So here we are today…in Day 2 tutorial for Spark learning. As we all know, that Spark is a top-level project of the Apache Software Foundation, designed to be used with a range of programming languages and on a variety of architectures. Spark’s speed, simplicity, and broad support for existing development environments and storage systems make it increasingly popular with a wide range of developers, and relatively accessible to those learning to work with it for the first time.
To learn Spark easily and incorporate into existing applications as straightforwardly as possible., its developed to support many programming languages like Java, Python, Scala, SQL & R. Spark is easy to download and install on a laptop or virtual machine. Spark was built to be able to run in a couple different ways: standalone, or part of a cluster.For production workloads that are operating at scale, Spark will require to run on an big data cluster. These clusters are often also used for Hadoop jobs, and Hadoop’s YARN resource manager will generally be used to manage that Hadoop cluster (including Spark). Spark can also run just as easily on clusters controlled by Apache Mesos.A series of scripts bundled with current releases of Spark simplify the process of launching Spark on Amazon Web Services’ Elastic Compute Cloud (EC2).
The Spark architecture or stack currently is comprised of Spark Core and four libraries that are optimized to address the requirements of four different use cases.Individual applications will typically require Spark Core and at least one of these libraries.
What are Spark Components?
Spark core: Its is a general-purpose system providing basic functionality like task scheduling, distributing,fault recovery, interacting with storage systems and monitoring of the applications across a cluster. Spark Core is also home to the API that defines resilient distributed datasets (RDDs), which is Spark’s main programming abstraction.
Then you have the components on top of the core that are designed to interoperate closely.Benefit of such a stack is that all the higher layer components will inherit the improvements made at the lower layers. Example: Optimization to the Spark Core will speed up the SQL, the streaming, the machine learning and the graph processing libraries as well.
- Spark Streaming : This module enables scalable and fault-tolerant processing of streaming data, and can integrate with established sources of data streams like Flume. Examples of data streams include log files generated by production web servers, or queues of messages containing status updates posted by users of a web service.
- Spark SQL: This module is for working with structured data. It allows querying data via SQL as well as the Apache Hive variant of SQL—called the Hive Query Language (HQL)—and it supports many sources of data, including Hive tables, Parquet, and JSON.Spark SQL also supports JDBC and ODBC connections, enabling a degree of integration with existing databases, data warehouses and business intelligence tools.
- GRaphX : It supports analysis of and computation over graphs of data (e.g., a social network’s friend graph) and performing graph-parallel computations. Like Spark Streaming and Spark SQL, it also extends the Spark RDD API, allowing us to create a directed graph with arbitrary properties attached to each vertex and edge. It provides various operators for manipulating graphs (e.g., subgraph and mapVertices) and a library of common graph algorithms (e.g., PageRank and triangle counting).
- Spark Mlib : Spark comes with a library containing common machine learning (ML) functionality, called MLlib. It provides multiple types of machine learning algorithms, including classification, regression, clustering, and collaborative filtering, as well as supporting functionality such as model evaluation and data import.
What is Resilient Distributed Datasets (RDDs)? Click here to learn Day 3 tutorial 🙂
Spark Framework is a simple Java web framework built for fast computation. It is a free and open-source software & an alternative to other Java web application frameworks such as JAX-RS and Spring MVC. It was started in 2009 at Berkeley.
To define, Spark is a cluster-computing framework, which means that it competes more with MapReduce than with the entire Hadoop ecosystem. It actually extends MR model to support more computation ways like interactive/iterative algos, queries, stream processing, graph processing etc. It is designed to be highly accessible, offering simple API in languages like Python, Java, Scala & SQL.One of the main features Spark offers for speed is the ability to run computations in memory, but the system is also more efficient than MapReduce for complex applications running on disk.
Is Spark a Hadoop module?
We see Spark is listed as a module on Hadoop’s project page, but Spark also has its own page because, while it can run in Hadoop clusters through YARN, it also has a standalone mode. So Spark is independent. By default there is no storage mechanism in Spark, so to store data, need fast and scalable file system. Hence uses S3 or HDFS or any other file system, but if you use Hadoop it’s very low cost.
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3. You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. However, as time goes on, some big data scientists expect Spark to diverge and perhaps replace Hadoop, especially in instances where faster access to processed data is critical.
Hadoop vs Spark
A direct comparison of Hadoop and Spark is difficult because they do many of the same things, but are also non-overlapping in some areas.The most important thing to remember about Hadoop and Spark is that their use is not an either-or scenario because they are not mutually exclusive. Nor is one necessarily a drop-in replacement for the other. The two are compatible with each other and that makes their pairing an extremely powerful solution for a variety of big data applications.So we can compare them on some below points:
- Data Processing Engine/Operators: Hadoop originally was designed to handle crawling and searching billions of web pages and collecting their information into a database. For this it uses Map reduce,which is a batch-processing engine. MapReduce operates in sequential steps by reading data from the cluster, performing its operation on the data, writing the results back to the cluster, reading updated data from the cluster, performing the next data operation, writing those results back to the cluster and so on. But Spark is a cluster-computing framework,Which performs similar operations, but it does so in a single step and in memory. It reads data from the cluster, performs its operation (Filter/map/join/groupby) on the data, and then writes it back to the cluster.
- File System: Spark has no file management and therefor must rely on Hadoop’s Distributed File System (HDFS) or some other solution like S3, Tachyon.
- Speed/Performance: Spark’s in-memory processing admit that Spark is very fast (Up to 100 times faster than Hadoop MapReduce), Spark can also perform batch processing, however, it really excels at streaming workloads, interactive queries, and machine-based learning.The reason that Spark is so fast is that it processes everything in memory. Yes, it can also use disk for data that doesn’t all fit into memory.Spark uses memory and can use disk for processing, whereas MapReduce is strictly disk-based. Example: Internet of Things sensors, log monitoring, security analytics all require Spark for faster computation.
- Storage: MapReduce uses persistent storage and Spark uses Resilient Distributed Datasets (RDDs)
- Ease of Use: Spark is well known for its performance, but it’s also somewhat well known for its ease of use in that it comes with user-friendly APIs for Scala (its native language), Java, Python, and Spark SQL.Spark has an interactive mode so that developers and users can run queries.MapReduce has no interactive mode, but add-ons such as Hive and Pig to make working with MapReduce a little easier for developers.
- Costs :Both MapReduce and Spark are Apache projects, which means that they’re open source and free software products. While there’s no cost for the software, there are costs associated with running either platform in personnel and in hardware. Both products are designed to run on commodity hardware, such as low cost, so-called white box server systems. However Spark systems cost more because of the large amounts of RAM required to run everything in memory. But what’s also true is that Spark’s technology reduces the number of required systems. So, you have significantly fewer systems that cost more. There’s probably a point at which Spark actually reduces costs per unit of computation even with the additional RAM requirement.
- API’s: Spark also includes its own graph computation library, GraphX. GraphX allows users to view the same data as graphs and as collections. Users can also transform and join graphs with Resilient Distributed Datasets (RDDs).
- Fault Tolerance: Hadoop uses Replicated blocks of data to maintain this feature. There is a link between TaskTrackers & JobTracker, so if its missed then the JobTracker reschedules all pending and in-progress operations to another TaskTracker. This effectively provides fault tolerance.Spark uses Resilient Distributed Datasets (RDDs), which are fault-tolerant collections of elements that can be operated on in parallel. RDDs can reference a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat. Spark can create RDDs from any storage source supported by Hadoop, including local filesystems or one of those listed previously.
- Scalability: both MapReduce and Spark are scalable using the HDFS.
- Security: Hadoop supports Kerberos authentication. HDFS supports access control lists (ACLs) and a traditional file permissions model. For user control in job submission, Hadoop provides Service Level Authorization, which ensures that clients have the right permissions.Spark’s security is a bit sparse by currently only supporting authentication via shared secret (password authentication). The security bonus that Spark can enjoy is that if you run Spark on HDFS, it can use HDFS ACLs and file-level permissions. Additionally, Spark can run on YARN giving it the capability of using Kerberos authentication.
Learn Spark Architecture by clicking here.