Spark Framework is a simple Java web framework built for fast computation. It is a free and open-source software & an alternative to other Java web application frameworks such as JAX-RS and Spring MVC. It was started in 2009 at Berkeley.
To define, Spark is a cluster-computing framework, which means that it competes more with MapReduce than with the entire Hadoop ecosystem. It actually extends MR model to support more computation ways like interactive/iterative algos, queries, stream processing, graph processing etc. It is designed to be highly accessible, offering simple API in languages like Python, Java, Scala & SQL.One of the main features Spark offers for speed is the ability to run computations in memory, but the system is also more efficient than MapReduce for complex applications running on disk.
Is Spark a Hadoop module?
We see Spark is listed as a module on Hadoop’s project page, but Spark also has its own page because, while it can run in Hadoop clusters through YARN, it also has a standalone mode. So Spark is independent. By default there is no storage mechanism in Spark, so to store data, need fast and scalable file system. Hence uses S3 or HDFS or any other file system, but if you use Hadoop it’s very low cost.
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3. You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. However, as time goes on, some big data scientists expect Spark to diverge and perhaps replace Hadoop, especially in instances where faster access to processed data is critical.
Hadoop vs Spark
A direct comparison of Hadoop and Spark is difficult because they do many of the same things, but are also non-overlapping in some areas.The most important thing to remember about Hadoop and Spark is that their use is not an either-or scenario because they are not mutually exclusive. Nor is one necessarily a drop-in replacement for the other. The two are compatible with each other and that makes their pairing an extremely powerful solution for a variety of big data applications.So we can compare them on some below points:
- Data Processing Engine/Operators: Hadoop originally was designed to handle crawling and searching billions of web pages and collecting their information into a database. For this it uses Map reduce,which is a batch-processing engine. MapReduce operates in sequential steps by reading data from the cluster, performing its operation on the data, writing the results back to the cluster, reading updated data from the cluster, performing the next data operation, writing those results back to the cluster and so on. But Spark is a cluster-computing framework,Which performs similar operations, but it does so in a single step and in memory. It reads data from the cluster, performs its operation (Filter/map/join/groupby) on the data, and then writes it back to the cluster.
- File System: Spark has no file management and therefor must rely on Hadoop’s Distributed File System (HDFS) or some other solution like S3, Tachyon.
- Speed/Performance: Spark’s in-memory processing admit that Spark is very fast (Up to 100 times faster than Hadoop MapReduce), Spark can also perform batch processing, however, it really excels at streaming workloads, interactive queries, and machine-based learning.The reason that Spark is so fast is that it processes everything in memory. Yes, it can also use disk for data that doesn’t all fit into memory.Spark uses memory and can use disk for processing, whereas MapReduce is strictly disk-based. Example: Internet of Things sensors, log monitoring, security analytics all require Spark for faster computation.
- Storage: MapReduce uses persistent storage and Spark uses Resilient Distributed Datasets (RDDs)
- Ease of Use: Spark is well known for its performance, but it’s also somewhat well known for its ease of use in that it comes with user-friendly APIs for Scala (its native language), Java, Python, and Spark SQL.Spark has an interactive mode so that developers and users can run queries.MapReduce has no interactive mode, but add-ons such as Hive and Pig to make working with MapReduce a little easier for developers.
- Costs :Both MapReduce and Spark are Apache projects, which means that they’re open source and free software products. While there’s no cost for the software, there are costs associated with running either platform in personnel and in hardware. Both products are designed to run on commodity hardware, such as low cost, so-called white box server systems. However Spark systems cost more because of the large amounts of RAM required to run everything in memory. But what’s also true is that Spark’s technology reduces the number of required systems. So, you have significantly fewer systems that cost more. There’s probably a point at which Spark actually reduces costs per unit of computation even with the additional RAM requirement.
- API’s: Spark also includes its own graph computation library, GraphX. GraphX allows users to view the same data as graphs and as collections. Users can also transform and join graphs with Resilient Distributed Datasets (RDDs).
- Fault Tolerance: Hadoop uses Replicated blocks of data to maintain this feature. There is a link between TaskTrackers & JobTracker, so if its missed then the JobTracker reschedules all pending and in-progress operations to another TaskTracker. This effectively provides fault tolerance.Spark uses Resilient Distributed Datasets (RDDs), which are fault-tolerant collections of elements that can be operated on in parallel. RDDs can reference a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat. Spark can create RDDs from any storage source supported by Hadoop, including local filesystems or one of those listed previously.
- Scalability: both MapReduce and Spark are scalable using the HDFS.
- Security: Hadoop supports Kerberos authentication. HDFS supports access control lists (ACLs) and a traditional file permissions model. For user control in job submission, Hadoop provides Service Level Authorization, which ensures that clients have the right permissions.Spark’s security is a bit sparse by currently only supporting authentication via shared secret (password authentication). The security bonus that Spark can enjoy is that if you run Spark on HDFS, it can use HDFS ACLs and file-level permissions. Additionally, Spark can run on YARN giving it the capability of using Kerberos authentication.
Learn Spark Architecture by clicking here.