RDD究竟是什么

2016/12/12 posted in  Spark

看源码部分关于RDD的注释

在我们学习spark的时候,如果有什么拿不准,其实去找相关的源码,往往那部分源码都有比较详细的注释。想想,那可是作者的一手解释呢?比起在网上找博客要更合适一些呢。

  • 抽象类RDD的注释
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. bke an immutable,
partitioned collection of elements that can be operated on in parallel. This class contains the
basic operations available on all RDDs, such as `map`, `filter`, and `persist`. In addition,
[[org.apache.spark.rdd.PairRDDFunctions]] contains operations available only on RDDs of key-value
pairs, such as `groupByKey` and `join`;
[[org.apache.spark.rdd.DoubleRDDFunctions]] contains operations available only on RDDs of
Doubles; and
[[org.apache.spark.rdd.SequenceFileRDDFunctions]] contains operations available on RDDs that
can be saved as SequenceFiles.
All operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)]
through implicit.

Internally, each RDD is characterized by five main properties:

 - A list of partitions
 - A function for computing each split
 - A list of dependencies on other RDDs
 - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
 - Optionally, a list of preferred locations to compute each split on (e.g. block locations for
   an HDFS file)

All of the scheduling and execution in Spark is done based on these methods, allowing each RDD
to implement its own way of computing itself. Indeed, users can implement custom RDDs (e.g. for
reading data from a new storage system) by overriding these functions. Please refer to the
[[http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf Spark paper]] for more details
on RDD internals.
  • 我简单做个整理总结

1.RDD是分布式弹性数据集,是Spark中抽象出的最基本概念。它是不可变,并且一组分区中的元素可以并行处理的数据集。Spark中创建了一些算子来对RDD进行处理。比如:
map,filter,persist等算子。
2.另外在[[org.apache.spark.rdd.PairRDDFunctions]]这个类中,还包含了一些操作key-value类型的RDD。比如:groupByKey,join等算子
3.还有其他的类处理对应的一些RDD中包含的数据类型。比如double,sequence等
4.所有对RDD的操作都是可用的。主要是借助于scala的隐式转换特性。这样是操作单一元素的RDD还是操作pair类型的RDD,对于RDD而言就隐藏了。

  • RDD的5个特性
    • 一系列的分区(partition),即数据集的基本组成单位
    • 一个计算每个分片的函数
    • 与父RDD的一系列依赖。这个也就是spark中的血统(lineage)