What is Resilient Distributed Datasets (RDDs) ? (Day 3)

Spark’s primary core abstraction is called Resilient Distributed Dataset or RDD.It is designed to support in-memory data storage, distributed across a cluster in a manner that is Resilient,fault-tolerant and efficient. RDD are Resilient as it relies on lineage graph , whenever there is  a failure in system, they can recompute themselves using the prior information. Similarly Fault-tolerance is achieved, in part, by tracking the lineage of transformations applied to coarse grained sets of data. Efficiency is achieved through parallelization of processing across multiple nodes in the cluster, and minimization of data replication between those nodes.

In a layman language, you can RDD is representation of the data that’s coming into your system in an object format & allows you to do computation on it.”

Spark RDD’s can reference to a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat. Also we can define it as, just a distributed collection of elements that is parallelized across the cluster. Once data is loaded into an RDD, two basic types of operation can be carried out.. Transformations and Actions.

Transformations are those that do not return a value. In fact, nothing is evaluated during the definition of these transformation statements. It just creates a new RDD by changing the original through processes such as mapping, filtering, and more. The fault tolerance aspect of RDDs allows Spark to reconstruct the transformations used to build the lineage to get back the lost data.

Actions are when the transformations get evaluated along with the action that is called for that RDD. Actions return values. For example, you can do a count on a RDD, to get the number of elements within and that value is returned.

The original RDD remains unchanged throughout. The chain of transformations from RDD1 to RDDn are logged, and can be repeated in the event of data loss or the failure of a cluster node.RDDs can be persistent in order to cache a dataset in memory across operations. This allows future actions to be much faster, by as much as ten times.
Spark’s cache is fault-tolerant in that if any partition of an RDD is lost, it will automatically be recomputed by using the original transformations. For example, let’s say a node goes offline. All it needs to do when it comes back online is to re-evaluate the graph to where it left off. Caching is provided with Spark to enable the processing to happen in memory. If it does not fit in memory, it will spill to disk.
Interesting thing about Spark is , it’s lazy evaluation.  This is because RDD are not loaded into system as in when the system encounters an RDD , but only done when an Action is supposed to be performed. So to understand this concept, lets take an example:
  • We read a text file and load the data into new created RDD ‘m’   {scala>  val m=sc.textfile (“abc.txt”)  } . This step is interpreted by Spark and an DAG is created that tells it to read data from file and push it in RDD format. An RDD is made of multiple partitions. By default, the minimum # of partitions in an RDD will be two. However, this is customizable and will be different in vendor distributions of Spark. For example, when creating an RDD out of an HDFS file, each block in the file feeds one RDD partition, so a file with 30 unique blocks will create an RDD with 30 partitions. Or in Cassandra, every 100,000 rows get loaded into one RDD partition. So, a Cassandra table with 1 million rows will generate an RDD with 10 partitions.
  • Next step is to display the first item in this RDD,  {scala>  m.first() }
  • Now lets use the .filter() transformation on the ‘m‘ RDD to return a new RDD named “linesWithApache“, which will contain a subset of the items in the file (only the ones containing the string “Apache”: {scala> val linesWithApache = m.filter(line => line.contains(“Apache”))}
  • Now lets use an Action to find no. of lines with Apache word.  {scala> linesWithApache.count()}
  • To further see these lines, you can use .collect()  Action.  {scala> linesWithApache.collect()  }
Learn How Spark actually works? Click here

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s