Tag Archives: RDD

What is Resilient Distributed Datasets (RDDs) ? (Day 3)

Spark’s primary core abstraction is called Resilient Distributed Dataset or RDD.It is designed to support in-memory data storage, distributed across a cluster in a manner that is Resilient,fault-tolerant and efficient. RDD are Resilient as it relies on lineage graph , whenever there is  a failure in system, they can recompute themselves using the prior information. Similarly Fault-tolerance is achieved, in part, by tracking the lineage of transformations applied to coarse grained sets of data. Efficiency is achieved through parallelization of processing across multiple nodes in the cluster, and minimization of data replication between those nodes.

In a layman language, you can RDD is representation of the data that’s coming into your system in an object format & allows you to do computation on it.”

Spark RDD’s can reference to a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat. Also we can define it as, just a distributed collection of elements that is parallelized across the cluster. Once data is loaded into an RDD, two basic types of operation can be carried out.. Transformations and Actions.

Transformations are those that do not return a value. In fact, nothing is evaluated during the definition of these transformation statements. It just creates a new RDD by changing the original through processes such as mapping, filtering, and more. The fault tolerance aspect of RDDs allows Spark to reconstruct the transformations used to build the lineage to get back the lost data.

Actions are when the transformations get evaluated along with the action that is called for that RDD. Actions return values. For example, you can do a count on a RDD, to get the number of elements within and that value is returned.

The original RDD remains unchanged throughout. The chain of transformations from RDD1 to RDDn are logged, and can be repeated in the event of data loss or the failure of a cluster node.RDDs can be persistent in order to cache a dataset in memory across operations. This allows future actions to be much faster, by as much as ten times.
Spark’s cache is fault-tolerant in that if any partition of an RDD is lost, it will automatically be recomputed by using the original transformations. For example, let’s say a node goes offline. All it needs to do when it comes back online is to re-evaluate the graph to where it left off. Caching is provided with Spark to enable the processing to happen in memory. If it does not fit in memory, it will spill to disk.
Interesting thing about Spark is , it’s lazy evaluation.  This is because RDD are not loaded into system as in when the system encounters an RDD , but only done when an Action is supposed to be performed. So to understand this concept, lets take an example:
  • We read a text file and load the data into new created RDD ‘m’   {scala>  val m=sc.textfile (“abc.txt”)  } . This step is interpreted by Spark and an DAG is created that tells it to read data from file and push it in RDD format. An RDD is made of multiple partitions. By default, the minimum # of partitions in an RDD will be two. However, this is customizable and will be different in vendor distributions of Spark. For example, when creating an RDD out of an HDFS file, each block in the file feeds one RDD partition, so a file with 30 unique blocks will create an RDD with 30 partitions. Or in Cassandra, every 100,000 rows get loaded into one RDD partition. So, a Cassandra table with 1 million rows will generate an RDD with 10 partitions.
  • Next step is to display the first item in this RDD,  {scala>  m.first() }
  • Now lets use the .filter() transformation on the ‘m‘ RDD to return a new RDD named “linesWithApache“, which will contain a subset of the items in the file (only the ones containing the string “Apache”: {scala> val linesWithApache = m.filter(line => line.contains(“Apache”))}
  • Now lets use an Action to find no. of lines with Apache word.  {scala> linesWithApache.count()}
  • To further see these lines, you can use .collect()  Action.  {scala> linesWithApache.collect()  }
Learn How Spark actually works? Click here

 

Advertisements

Spark Components..Architecture (Day 2)

So here we are today…in Day 2  tutorial for Spark learning. As we all know, that Spark is a top-level project of the Apache Software Foundation, designed to be used with a range of programming languages and on a variety of architectures. Spark’s speed, simplicity, and broad support for existing development environments and storage systems make it increasingly popular with a wide range of developers, and relatively accessible to those learning to work with it for the first time.

To learn Spark easily and incorporate into existing applications as straightforwardly as possible., its developed to support many programming languages like Java, Python, Scala, SQL & R. Spark is easy to download and install on a laptop or virtual machine. Spark was built to be able to run in a couple different ways: standalone, or part of a cluster.For production workloads that are operating at scale, Spark will require to run on an big data cluster. These clusters are often also used for Hadoop jobs, and Hadoop’s YARN resource manager will generally be used to manage that Hadoop cluster (including Spark).  Spark can also run just as easily on clusters controlled by Apache Mesos.A series of scripts bundled with current releases of Spark simplify the process of launching Spark on Amazon Web Services’ Elastic Compute Cloud (EC2).

(source:internet)
Spark Architecture

The Spark architecture or stack currently is comprised of Spark Core and four libraries that are optimized to address the requirements of four different use cases.Individual applications will typically require Spark Core and at least one of these libraries.

What are Spark Components?

Spark core: Its is a general-purpose system providing basic functionality like task scheduling, distributing,fault recovery, interacting with storage systems and monitoring of the applications across a cluster. Spark Core is also home to the API that defines resilient distributed datasets (RDDs), which is Spark’s main programming abstraction.

Then you have the components on top of the core that are designed to interoperate closely.Benefit of such a stack is that all the higher layer components will inherit the improvements made at the lower layers. Example: Optimization to the Spark Core will speed up the SQL, the streaming, the machine learning and the graph processing libraries as well.

  1. Spark Streaming : This module enables scalable and fault-tolerant processing of streaming data, and can integrate with established sources of data streams like Flume. Examples of data streams include log files generated by production web servers, or queues of messages containing status updates posted by users of a web service.
  2. Spark SQL: This module is for working with structured data. It allows querying data via SQL as well as the Apache Hive variant of SQL—called the Hive Query Language (HQL)—and it supports many sources of data, including Hive tables, Parquet, and JSON.Spark SQL also supports JDBC and ODBC connections, enabling a degree of integration with existing databases, data warehouses and business intelligence tools.
  3. GRaphX : It supports analysis of and computation over graphs of data (e.g., a social network’s friend graph) and performing graph-parallel computations. Like Spark Streaming and Spark SQL, it also extends the Spark RDD API, allowing us to create a directed graph with arbitrary properties attached to each vertex and edge. It provides various operators for manipulating graphs (e.g., subgraph and mapVertices) and a library of common graph algorithms (e.g., PageRank and triangle counting).
  4. Spark Mlib : Spark comes with a library containing common machine learning (ML) functionality, called MLlib. It provides multiple types of machine learning algorithms, including classification, regression, clustering, and collaborative filtering, as well as supporting functionality such as model evaluation and data import.

What is Resilient Distributed Datasets (RDDs)? Click here to learn Day 3 tutorial 🙂