:: DeveloperApi :: A set of functions used to aggregate data.
:: DeveloperApi :: A set of functions used to aggregate data.
function to create the initial value of the aggregation.
function to merge a new value into the aggregation result.
function to merge outputs from multiple mergeValue function.
:: Experimental :: A TaskContext with extra contextual info and tooling for tasks in a barrier stage.
:: Experimental :: A TaskContext with extra contextual info and tooling for tasks in a barrier stage. Use BarrierTaskContext#get to obtain the barrier context for a running barrier task.
:: Experimental :: Carries all task infos of a barrier task.
:: Experimental :: Carries all task infos of a barrier task.
A FutureAction for actions that could trigger multiple Spark jobs.
A FutureAction for actions that could trigger multiple Spark jobs. Examples include take, takeSample. Cancellation works by setting the cancelled flag to true and cancelling any pending jobs.
:: DeveloperApi :: A TaskContext aware iterator.
:: DeveloperApi :: A TaskContext aware iterator.
As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. If an off-heap access exists in the parent iterator, it could cause segmentation fault which crashes the executor. Thus, we should use ContextAwareIterator to stop consuming after the task ends.
:: DeveloperApi :: Base class for dependencies.
:: DeveloperApi :: Base class for dependencies.
:: DeveloperApi :: Task failed due to a runtime exception.
:: DeveloperApi :: Task failed due to a runtime exception. This is the most common failure case and also captures user program exceptions.
stackTrace
contains the stack trace of the exception itself. It still exists for backward
compatibility. It's better to use this(e: Throwable, metrics: Option[TaskMetrics])
to
create ExceptionFailure
as it will handle the backward compatibility properly.
fullStackTrace
is a better representation of the stack trace because it contains the whole
stack trace including the exception and its causes
exception
is the actual exception that caused the task to fail. It may be None
in
the case that the exception is not in fact serializable. If a task fails more than
once (due to retries), exception
is that one that caused the last failure.
:: DeveloperApi :: The task failed because the executor that it was running on was lost.
:: DeveloperApi :: The task failed because the executor that it was running on was lost. This may happen because the task crashed the JVM.
:: DeveloperApi :: Task failed to fetch shuffle data from a remote node.
:: DeveloperApi :: Task failed to fetch shuffle data from a remote node. Probably means we have lost the remote executors the task is trying to fetch from, and thus need to rerun the previous stage.
A future for the result of an action to support cancellation.
A future for the result of an action to support cancellation. This is an extension of the Scala Future interface to support cancellation.
A org.apache.spark.Partitioner that implements hash-based partitioning using
Java's Object.hashCode
.
A org.apache.spark.Partitioner that implements hash-based partitioning using
Java's Object.hashCode
.
Java arrays have hashCodes that are based on the arrays' identities rather than their contents, so attempting to partition an RDD[Array[_]] or RDD[(Array[_], _)] using a HashPartitioner will produce an unexpected or incorrect result.
:: DeveloperApi :: An iterator that wraps around an existing iterator to provide task killing functionality.
:: DeveloperApi :: An iterator that wraps around an existing iterator to provide task killing functionality. It works by checking the interrupted flag in TaskContext.
Handle via which a "run" function passed to a ComplexFutureAction can submit jobs for execution.
Handle via which a "run" function passed to a ComplexFutureAction can submit jobs for execution.
:: DeveloperApi :: Base class for dependencies where each partition of the child RDD depends on a small number of partitions of the parent RDD.
:: DeveloperApi :: Base class for dependencies where each partition of the child RDD depends on a small number of partitions of the parent RDD. Narrow dependencies allow for pipelined execution.
:: DeveloperApi :: Represents a one-to-one dependency between partitions of the parent and child RDDs.
:: DeveloperApi :: Represents a one-to-one dependency between partitions of the parent and child RDDs.
An identifier for a partition in an RDD.
An object that defines how the elements in a key-value pair RDD are partitioned by key.
An object that defines how the elements in a key-value pair RDD are partitioned by key.
Maps each key to a partition ID, from 0 to numPartitions - 1
.
Note that, partitioner must be deterministic, i.e. it must return the same partition id given the same partition key.
:: DeveloperApi :: Represents a one-to-one dependency between ranges of partitions in the parent and child RDDs.
:: DeveloperApi :: Represents a one-to-one dependency between ranges of partitions in the parent and child RDDs.
A org.apache.spark.Partitioner that partitions sortable records by range into roughly equal ranges.
A org.apache.spark.Partitioner that partitions sortable records by range into roughly equal ranges. The ranges are determined by sampling the content of the RDD passed in.
The actual number of partitions created by the RangePartitioner might not be the same
as the partitions
parameter, in the case where the number of sampled records is less than
the value of partitions
.
:: DeveloperApi :: Represents a dependency on the output of a shuffle stage.
:: DeveloperApi :: Represents a dependency on the output of a shuffle stage. Note that in the case of shuffle, the RDD is transient since we don't need it on the executor side.
A FutureAction holding the result of an action that triggers a single job.
A FutureAction holding the result of an action that triggers a single job. Examples include count, collect, reduce.
Configuration for a Spark application.
Configuration for a Spark application. Used to set various Spark parameters as key-value pairs.
Most of the time, you would create a SparkConf object with new SparkConf()
, which will load
values from any spark.*
Java system properties set in your application as well. In this case,
parameters you set directly on the SparkConf
object take priority over system properties.
For unit tests, you can also call new SparkConf(false)
to skip loading external settings and
get the same configuration no matter what the system properties are.
All setter methods in this class support chaining. For example, you can write
new SparkConf().setMaster("local").setAppName("My app")
.
Once a SparkConf object is passed to Spark, it is cloned and can no longer be modified by the user. Spark does not support modifying the configuration at runtime.
Main entry point for Spark functionality.
Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
Only one SparkContext may be active per JVM. You must stop()
the active SparkContext before
creating a new one. This limitation may eventually be removed; see SPARK-2243 for more details.
:: DeveloperApi :: Holds all the runtime environment objects for a running Spark instance (either master or worker), including the serializer, RpcEnv, block manager, map output tracker, etc.
:: DeveloperApi :: Holds all the runtime environment objects for a running Spark instance (either master or worker), including the serializer, RpcEnv, block manager, map output tracker, etc. Currently Spark code finds the SparkEnv through a global variable, so all the threads can access the same SparkEnv. It can be accessed by SparkEnv.get (e.g. after creating a SparkContext).
NOTE: This is not intended for external use. This is exposed for Shark and may be made private in a future release.
Low-level status reporting APIs for monitoring job and stage progress.
Low-level status reporting APIs for monitoring job and stage progress.
These APIs intentionally provide very weak consistency semantics; consumers of these APIs should
be prepared to handle empty / missing information. For example, a job's stage ids may be known
but the status API may not have any information about the details of those stages, so
getStageInfo
could potentially return None
for a valid stage id.
To limit memory usage, these APIs only provide information on recent jobs / stages. These APIs
will provide information for the last spark.ui.retainedStages
stages and
spark.ui.retainedJobs
jobs.
NOTE: this class's constructor should be considered private and may be subject to change.
:: DeveloperApi :: Task requested the driver to commit, but was denied.
:: DeveloperApi :: Task requested the driver to commit, but was denied.
Contextual information about a task which can be read or mutated during execution.
Contextual information about a task which can be read or mutated during execution. To access the TaskContext for a running task, use:
org.apache.spark.TaskContext.get()
:: DeveloperApi :: Various possible reasons why a task ended.
:: DeveloperApi :: Various possible reasons why a task ended. The low-level TaskScheduler is supposed to retry tasks several times for "ephemeral" failures, and only report back failures that require some old stages to be resubmitted, such as shuffle map fetch failures.
:: DeveloperApi :: Various possible reasons why a task failed.
:: DeveloperApi :: Various possible reasons why a task failed.
:: DeveloperApi :: Task was killed intentionally and needs to be rescheduled.
:: DeveloperApi :: Task was killed intentionally and needs to be rescheduled.
:: DeveloperApi :: Exception thrown when a task is explicitly killed (i.e., task failure is expected).
:: DeveloperApi :: Exception thrown when a task is explicitly killed (i.e., task failure is expected).
A data type that can be accumulated, i.e.
A data type that can be accumulated, i.e. has a commutative and associative "add" operation,
but where the result type, R
, may be different from the element type being added, T
.
You must define how to add data, and how to merge two of these together. For some data types, such as a counter, these might be the same operation. In that case, you can use the simpler org.apache.spark.Accumulator. They won't always be the same, though -- e.g., imagine you are accumulating a set. You will add items to the set, and you will union two sets together.
Operations are not thread-safe.
the full accumulated data (result type)
partial data that can be added in
(Since version 2.0.0) use AccumulatorV2
Helper object defining how to accumulate values of a particular type.
Helper object defining how to accumulate values of a particular type. An implicit AccumulableParam needs to be available when you create Accumulables of a specific type.
the full accumulated data (result type)
partial data that can be added in
(Since version 2.0.0) use AccumulatorV2
A simpler value of Accumulable where the result type being accumulated is the same as the types of elements being merged, i.e.
A simpler value of Accumulable where the result type being accumulated is the same as the types of elements being merged, i.e. variables that are only "added" to through an associative and commutative operation and can therefore be efficiently supported in parallel. They can be used to implement counters (as in MapReduce) or sums. Spark natively supports accumulators of numeric value types, and programmers can add support for new types.
An accumulator is created from an initial value v
by calling SparkContext.accumulator
.
Tasks running on the cluster can then add to it using the +=
operator.
However, they cannot read its value. Only the driver program can read the accumulator's value,
using its #value method.
The interpreter session below shows an accumulator being used to add up the elements of an array:
scala> val accum = sc.accumulator(0) accum: org.apache.spark.Accumulator[Int] = 0 scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x) ... 10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s scala> accum.value res2: Int = 10
result type
(Since version 2.0.0) use AccumulatorV2
A simpler version of org.apache.spark.AccumulableParam where the only data type you can add in is the same type as the accumulated value.
A simpler version of org.apache.spark.AccumulableParam where the only data type you can add in is the same type as the accumulated value. An implicit AccumulatorParam object needs to be available when you create Accumulators of a specific type.
type of value to accumulate
(Since version 2.0.0) use AccumulatorV2
:: DeveloperApi ::
A org.apache.spark.scheduler.ShuffleMapTask
that completed successfully earlier, but we
lost the executor before the stage completed.
:: DeveloperApi ::
A org.apache.spark.scheduler.ShuffleMapTask
that completed successfully earlier, but we
lost the executor before the stage completed. This means Spark needs to reschedule the task
to be re-executed on a different executor.
The SparkContext object contains a number of implicit conversions and parameters for use with various Spark features.
Resolves paths to files added through SparkContext.addFile()
.
:: DeveloperApi :: Task succeeded.
:: DeveloperApi :: Task succeeded.
:: DeveloperApi :: The task finished successfully, but the result was lost from the executor's block manager before it was fetched.
:: DeveloperApi :: The task finished successfully, but the result was lost from the executor's block manager before it was fetched.
:: DeveloperApi :: We don't know why the task ended -- for example, because of a ClassNotFound exception when deserializing the task result.
:: DeveloperApi :: We don't know why the task ended -- for example, because of a ClassNotFound exception when deserializing the task result.
Spark's broadcast variables, used to broadcast immutable datasets to all nodes.
ALPHA COMPONENT GraphX is a graph processing framework built on top of Spark.
ALPHA COMPONENT GraphX is a graph processing framework built on top of Spark.
IO codecs used for compression.
IO codecs used for compression. See org.apache.spark.io.CompressionCodec.
DataFrame-based machine learning APIs to let users quickly assemble and configure practical machine learning pipelines.
RDD-based machine learning APIs (in maintenance mode).
RDD-based machine learning APIs (in maintenance mode).
The spark.mllib
package is in maintenance mode as of the Spark 2.0.0 release to encourage
migration to the DataFrame-based APIs under the org.apache.spark.ml package.
While in maintenance mode,
spark.mllib
package will be accepted, unless they block
implementing new features in the DataFrame-based spark.ml
package;The developers will continue adding more features to the DataFrame-based APIs in the 2.x series to reach feature parity with the RDD-based APIs. And once we reach feature parity, this package will be deprecated.
SPARK-4591 to track the progress of feature parity
:: Experimental ::
:: Experimental ::
Support for approximate results. This provides convenient api and also implementation for approximate calculation.
Provides several RDD implementations.
Provides several RDD implementations. See org.apache.spark.rdd.RDD.
Spark's scheduling components.
Spark's scheduling components. This includes the org.apache.spark.scheduler.DAGScheduler
and
lower level org.apache.spark.scheduler.TaskScheduler
.
Pluggable serializers for RDD and shuffle data.
Pluggable serializers for RDD and shuffle data.
Allows the execution of relational queries, including those expressed in SQL using Spark.
Spark Streaming functionality.
Spark Streaming functionality. org.apache.spark.streaming.StreamingContext serves as the main entry point to Spark Streaming, while org.apache.spark.streaming.dstream.DStream is the data type representing a continuous sequence of RDDs, representing a continuous stream of data.
In addition, org.apache.spark.streaming.dstream.PairDStreamFunctions contains operations
available only on DStreams
of key-value pairs, such as groupByKey
and reduceByKey
. These operations are automatically
available on any DStream of the right type (e.g. DStream[(Int, Int)] through implicit
conversions.
For the Java API of Spark Streaming, take a look at the org.apache.spark.streaming.api.java.JavaStreamingContext which serves as the entry point, and the org.apache.spark.streaming.api.java.JavaDStream and the org.apache.spark.streaming.api.java.JavaPairDStream which have the DStream functionality.
Spark utilities.
(Since version 2.0.0) use AccumulatorV2
Core Spark functionality. org.apache.spark.SparkContext serves as the main entry point to Spark, while org.apache.spark.rdd.RDD is the data type representing a distributed collection, and provides most parallel operations.
In addition, org.apache.spark.rdd.PairRDDFunctions contains operations available only on RDDs of key-value pairs, such as
groupByKey
andjoin
; org.apache.spark.rdd.DoubleRDDFunctions contains operations available only on RDDs of Doubles; and org.apache.spark.rdd.SequenceFileRDDFunctions contains operations available on RDDs that can be saved as SequenceFiles. These operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)] through implicit conversions.Java programmers should reference the org.apache.spark.api.java package for Spark programming APIs in Java.
Classes and methods marked with Experimental are user-facing features which have not been officially adopted by the Spark project. These are subject to change or removal in minor releases.
Classes and methods marked with Developer API are intended for advanced users want to extend Spark through lower level interfaces. These are subject to changes or removal in minor releases.