2016/12/12 posted in  Spark



  • 抽象类RDD的注释
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. bke an immutable,
partitioned collection of elements that can be operated on in parallel. This class contains the
basic operations available on all RDDs, such as `map`, `filter`, and `persist`. In addition,
[[org.apache.spark.rdd.PairRDDFunctions]] contains operations available only on RDDs of key-value
pairs, such as `groupByKey` and `join`;
[[org.apache.spark.rdd.DoubleRDDFunctions]] contains operations available only on RDDs of
Doubles; and
[[org.apache.spark.rdd.SequenceFileRDDFunctions]] contains operations available on RDDs that
can be saved as SequenceFiles.
All operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)]
through implicit.

Internally, each RDD is characterized by five main properties:

 - A list of partitions
 - A function for computing each split
 - A list of dependencies on other RDDs
 - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
 - Optionally, a list of preferred locations to compute each split on (e.g. block locations for
   an HDFS file)

All of the scheduling and execution in Spark is done based on these methods, allowing each RDD
to implement its own way of computing itself. Indeed, users can implement custom RDDs (e.g. for
reading data from a new storage system) by overriding these functions. Please refer to the
[[http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf Spark paper]] for more details
on RDD internals.
  • 我简单做个整理总结


  • RDD的5个特性
    • 一系列的分区(partition),即数据集的基本组成单位
    • 一个计算每个分片的函数
    • 与父RDD的一系列依赖。这个也就是spark中的血统(lineage)