Category Archives: Hadoop

What is hdfs (Tutorial Day 9)

HDFS stands for Hadoop Distributed File System. It is designed to provide a fault-tolerant file system designed to run on commodity hardware. It uses a master/slave architecture in which one device (the master) controls one or more other devices (the slaves). The HDFS cluster consists of a single NameNode and a master server manages the file system namespace and regulates access to files.

Namenode is the centerpiece or master node of  HDFS. It only stores the metadata of HDFS and no data – means the directory tree of all files in the file system, and tracks the files across the cluster. The data is actually stored in the DataNodes. NN knows the list of the blocks and its location for any given file in HDFS. With this information it knows how to construct the file from blocks. NN is so critical that if its down or has any fault, HDFS/Hadoop cluster is inaccessible. NN is a single point of failure in Hadoop cluster. NameNode is usually configured with a lot of memory (RAM). We generally have Secondary NN to cover up this kind of failures, but they are manual start-up.

DataNode is known as Slave nodes & is responsible for storing the actual data in HDFS. NN & DN are in constant communication. When a DataNode starts up it announce itself to the NN along with the list of blocks it is responsible for. When a DataNode is down, it does not affect the availability of data or the cluster. NN will arrange for replication for the blocks managed by the DN that is not available. DataNode is usually configured with a lot of hard disk space. Because the actual data is stored in the DataNode.

More features of HDFs:

  • Single Master node, along with secondary Namenode (in Hadoop 1.x, secondary NN is manual start up, whereas in Hadoop 2.x we have automated failure recovery using secondary namenode)
  • Multiple Data nodes cluster (also called slave deamons)
  • Every block has a fixed size
  • Example: We have a NYSE data file of 39 mb. So if my datanode has block size defined as 27mb each, then we will get 2 files created in HDFS .One of 27mb and one of 12mb


HDFS Architecture 



                                                                                             Image credit: Google
  • Client is a application running on our machines which is used to interact with NN and DN, Job tracker etc. It is used for User interaction and is called HDFS client.
  • A Hadoop Cluster is a collection of racks. A rack is a collection of 30 or 40 nodes that are physically stored close together and are all connected to the same network switch. Network bandwidth between any two nodes in rack is greater than bandwidth between two nodes on different racks. In other words, a rack is the hard-disk or storage area of HDFS.
  • Client interacts with NN using SSH and not http.
  • To maintain fault tolerance on Hadoop system, we maintain replicate data. Minimum no. of replicas required by HDFS is 3. It can configured by admin too, but redundancy is necessary to be done
  • When a file is written to HDFS, it is split up into big chucks called data blocks, whose size is controlled by the parameter dfs.block.size in the config file hdfs-site.xml (default is 64MB). All blocks in a file except the last block are the same size . Each block is stored on one or more Data nodes, controlled by the parameter dfs.replication in the same file (in most of this post – set to 3, which is the default). Each copy of a block is called a replica.The blocks of a file are replicated for fault tolerance. Files in HDFS are write-once and have strictly one writer at any time.
  • Since there are 3 nodes, when we send the MapReduce programs, calculations will be done only on the original data. The master node will know which node exactly has that particular data. In case, if one of the nodes is not responding, it is assumed to be failed. Only then, the required calculation will be done on the second replica.

Process to read/write file into hdfs

When writing data to an HDFS file, its first written to local cache. When the cache reaches a block size (default 64MB), the client request the list of DN from the NN. This list contains all the DN that will host a replica of that block. The number of DN replication is default to 3. The client then organizes a pipeline from DN-to-DN and flushes the data block to the first DN (as shown in image below). The first DataNode starts receiving the data in small portions (file system block size 4 KB), writes each portion to its local repository and transfers the same portion to the second DataNode in the list. The second DataNode, in turn starts receiving each portion of the data block, writes that portion to its repository and then flushes the same portion to the third DataNode. Finally, the third DataNode writes the data to its local repository. Thus, a DN can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline. When the first block is filled, the client requests new DN to host replicas of the next block. A new pipeline is organized, and the client sends the further bytes of the file. This flow continues till last block of the file.


                                                                                                    Image credit: google

Advantages of HDFS

  • Suitable for applications with large datasets only. If you have small data sets, then its too expensive to store them
  • Inexpensive as filesystem relies on commodity storage disks that are much less expensive than the storage media used for enterprise grade storage
  • Highly Fault Tolerant
  • High throughput ( time taken to access , read data. Similar to example explained below)

HDFS is optimized for MapReduce workloads. It provides very high performance for sequential reads and writes, which is the typical access pattern in MapReduce jobs.

Learn more about Hadoop Installation, here !!


Follow for more interesting Big Data stuff to read !!


Learn how Map Reduce Algorithm works with a simple example.. (Tutorial Day 7)

MapReduce Algorithm

MapReduce is a Distributed Data Processing Algorithm, introduced by Google.It is mainly inspired by Functional Programming model. MapReduce algorithm is mainly useful to process huge amount of data in parallel, reliable and efficient way in cluster environments. Its uses the technique “Divide and Conquer” algorithm to process large amount of data.It divides input task into smaller and manageable sub-tasks to execute them in-parallel.

MapReduce Algorithm Steps

MapReduce Algorithm uses the following three main steps:

Map Function : Map Function is the first step in MapReduce Algorithm. It takes input tasks (say DataSets. I have given only one DataSet in below diagram.) and divides them into smaller sub-tasks. Then perform required computation on each sub-task in parallel.

This step performs the following two sub-steps:

  • Splitting :Splitting step takes input DataSet from Source and divide into smaller Sub-DataSets.
  • Mapping :Mapping step takes those smaller Sub-DataSets and perform required action or computation on each Sub-DataSet.

The output of this Map Function is a set of key and value pairs as <Key, Value> as shown in the below diagram.

Sort & Shuffle Function :

It is the second step in MapReduce Algorithm. Shuffle Function is also know as “Combine Function”. It takes a list of outputs coming from “Map Function” and perform these two following sub-steps on each and every key-value pair:

  1. Merging
  2. Sorting
  • Merging step combines all key-value pairs which have same keys (that is grouping key-value pairs by comparing “Key”). This step returns <Key, List<Value>>.
  • Sorting step takes input from Merging step and sort all key-value pairs by using Keys. This step also returns <Key, List<Value>> output but with sorted key-value pairs.


Reduce Function :It is the final step in MapReduce Algorithm. It performs only one step : Reduce step. It takes list of <Key, List<Value>> sorted pairs from Shuffle Function and perform reduce operation as shown below.

Final step output looks like first step output. However final step <Key, Value> pairs are different than first step <Key, Value> pairs. Final step <Key, Value> pairs are computed and sorted pairs.

We can observe the difference between first step output and final step output with some simple example.

Example 1: Word Counting

Problem Statement:
Count the number of occurrences of each word available in a DataSet. So in below example, count how many times Fruits appear and what is the count ?

Input DataSet
Please find below example Input DataSet file. This is small set of data, but in real-time applications , they use very huge amount of Data.


Output: Counts of each fruit.

How it works:

Step1: MapReduce – Map Function (Split Step)


Step2: MapReduce – Map Function (Mapping Step)


Step 3: MapReduce – Sort & Shuffle Function


Step4: Reduce Function


So we can accumulate our learning from this example as :

  1. The input dataset can be divided into n number of chunks depending upon the amount of data and block size.
  2. In Map function, all the chunks are processed simultaneously at the same time, which embraces the parallel processing of data.
  3. In shuffling, happens which leads to aggregation of similar patterns.
  4. Finally, reducers combine them all to get a consolidated output as per the logic.


Now in next tutorial we will discuss what is Scope ?

What is Resilient Distributed Datasets (RDDs) ? (Day 3)

Spark’s primary core abstraction is called Resilient Distributed Dataset or RDD.It is designed to support in-memory data storage, distributed across a cluster in a manner that is Resilient,fault-tolerant and efficient. RDD are Resilient as it relies on lineage graph , whenever there is  a failure in system, they can recompute themselves using the prior information. Similarly Fault-tolerance is achieved, in part, by tracking the lineage of transformations applied to coarse grained sets of data. Efficiency is achieved through parallelization of processing across multiple nodes in the cluster, and minimization of data replication between those nodes.

In a layman language, you can RDD is representation of the data that’s coming into your system in an object format & allows you to do computation on it.”

Spark RDD’s can reference to a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat. Also we can define it as, just a distributed collection of elements that is parallelized across the cluster. Once data is loaded into an RDD, two basic types of operation can be carried out.. Transformations and Actions.

Transformations are those that do not return a value. In fact, nothing is evaluated during the definition of these transformation statements. It just creates a new RDD by changing the original through processes such as mapping, filtering, and more. The fault tolerance aspect of RDDs allows Spark to reconstruct the transformations used to build the lineage to get back the lost data.

Actions are when the transformations get evaluated along with the action that is called for that RDD. Actions return values. For example, you can do a count on a RDD, to get the number of elements within and that value is returned.

The original RDD remains unchanged throughout. The chain of transformations from RDD1 to RDDn are logged, and can be repeated in the event of data loss or the failure of a cluster node.RDDs can be persistent in order to cache a dataset in memory across operations. This allows future actions to be much faster, by as much as ten times.
Spark’s cache is fault-tolerant in that if any partition of an RDD is lost, it will automatically be recomputed by using the original transformations. For example, let’s say a node goes offline. All it needs to do when it comes back online is to re-evaluate the graph to where it left off. Caching is provided with Spark to enable the processing to happen in memory. If it does not fit in memory, it will spill to disk.
Interesting thing about Spark is , it’s lazy evaluation.  This is because RDD are not loaded into system as in when the system encounters an RDD , but only done when an Action is supposed to be performed. So to understand this concept, lets take an example:
  • We read a text file and load the data into new created RDD ‘m’   {scala>  val m=sc.textfile (“abc.txt”)  } . This step is interpreted by Spark and an DAG is created that tells it to read data from file and push it in RDD format. An RDD is made of multiple partitions. By default, the minimum # of partitions in an RDD will be two. However, this is customizable and will be different in vendor distributions of Spark. For example, when creating an RDD out of an HDFS file, each block in the file feeds one RDD partition, so a file with 30 unique blocks will create an RDD with 30 partitions. Or in Cassandra, every 100,000 rows get loaded into one RDD partition. So, a Cassandra table with 1 million rows will generate an RDD with 10 partitions.
  • Next step is to display the first item in this RDD,  {scala>  m.first() }
  • Now lets use the .filter() transformation on the ‘m‘ RDD to return a new RDD named “linesWithApache“, which will contain a subset of the items in the file (only the ones containing the string “Apache”: {scala> val linesWithApache = m.filter(line => line.contains(“Apache”))}
  • Now lets use an Action to find no. of lines with Apache word.  {scala> linesWithApache.count()}
  • To further see these lines, you can use .collect()  Action.  {scala> linesWithApache.collect()  }
Learn How Spark actually works? Click here


Spark Components..Architecture (Day 2)

So here we are today…in Day 2  tutorial for Spark learning. As we all know, that Spark is a top-level project of the Apache Software Foundation, designed to be used with a range of programming languages and on a variety of architectures. Spark’s speed, simplicity, and broad support for existing development environments and storage systems make it increasingly popular with a wide range of developers, and relatively accessible to those learning to work with it for the first time.

To learn Spark easily and incorporate into existing applications as straightforwardly as possible., its developed to support many programming languages like Java, Python, Scala, SQL & R. Spark is easy to download and install on a laptop or virtual machine. Spark was built to be able to run in a couple different ways: standalone, or part of a cluster.For production workloads that are operating at scale, Spark will require to run on an big data cluster. These clusters are often also used for Hadoop jobs, and Hadoop’s YARN resource manager will generally be used to manage that Hadoop cluster (including Spark).  Spark can also run just as easily on clusters controlled by Apache Mesos.A series of scripts bundled with current releases of Spark simplify the process of launching Spark on Amazon Web Services’ Elastic Compute Cloud (EC2).

Spark Architecture

The Spark architecture or stack currently is comprised of Spark Core and four libraries that are optimized to address the requirements of four different use cases.Individual applications will typically require Spark Core and at least one of these libraries.

What are Spark Components?

Spark core: Its is a general-purpose system providing basic functionality like task scheduling, distributing,fault recovery, interacting with storage systems and monitoring of the applications across a cluster. Spark Core is also home to the API that defines resilient distributed datasets (RDDs), which is Spark’s main programming abstraction.

Then you have the components on top of the core that are designed to interoperate closely.Benefit of such a stack is that all the higher layer components will inherit the improvements made at the lower layers. Example: Optimization to the Spark Core will speed up the SQL, the streaming, the machine learning and the graph processing libraries as well.

  1. Spark Streaming : This module enables scalable and fault-tolerant processing of streaming data, and can integrate with established sources of data streams like Flume. Examples of data streams include log files generated by production web servers, or queues of messages containing status updates posted by users of a web service.
  2. Spark SQL: This module is for working with structured data. It allows querying data via SQL as well as the Apache Hive variant of SQL—called the Hive Query Language (HQL)—and it supports many sources of data, including Hive tables, Parquet, and JSON.Spark SQL also supports JDBC and ODBC connections, enabling a degree of integration with existing databases, data warehouses and business intelligence tools.
  3. GRaphX : It supports analysis of and computation over graphs of data (e.g., a social network’s friend graph) and performing graph-parallel computations. Like Spark Streaming and Spark SQL, it also extends the Spark RDD API, allowing us to create a directed graph with arbitrary properties attached to each vertex and edge. It provides various operators for manipulating graphs (e.g., subgraph and mapVertices) and a library of common graph algorithms (e.g., PageRank and triangle counting).
  4. Spark Mlib : Spark comes with a library containing common machine learning (ML) functionality, called MLlib. It provides multiple types of machine learning algorithms, including classification, regression, clustering, and collaborative filtering, as well as supporting functionality such as model evaluation and data import.

What is Resilient Distributed Datasets (RDDs)? Click here to learn Day 3 tutorial 🙂

What is it replacing Hadoop ? (Day1)

Spark Framework is a simple Java web framework built for fast computation. It is a free and open-source software  & an alternative to other Java web application frameworks such as JAX-RS and Spring MVC. It was started in 2009 at Berkeley.


To define, Spark is a cluster-computing framework, which means that it competes more with MapReduce than with the entire Hadoop ecosystem. It actually extends MR model to support more computation ways like interactive/iterative algos, queries, stream processing, graph processing etc. It is designed to be highly accessible, offering simple API in languages like Python, Java, Scala & SQL.One of the main features Spark offers for speed is the ability to run computations in memory, but the system is also more efficient than MapReduce for complex applications running on disk.

Is Spark a Hadoop module?

We see Spark is listed as a module on Hadoop’s project page, but Spark also has its own page because, while it can run in Hadoop clusters through YARN, it also has a standalone mode. So Spark is independent. By default there is no storage mechanism in Spark, so to store data, need fast and scalable file system. Hence uses S3 or HDFS or any other file system, but if you use Hadoop it’s very low cost.

Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3. You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. However, as time goes on, some big data scientists expect Spark to diverge and perhaps replace Hadoop, especially in instances where faster access to processed data is critical.

(Source: Internet)

Hadoop vs Spark

A direct comparison of Hadoop and Spark is difficult because they do many of the same things, but are also non-overlapping in some areas.The most important thing to remember about Hadoop and Spark is that their use is not an either-or scenario because they are not mutually exclusive. Nor is one necessarily a drop-in replacement for the other. The two are compatible with each other and that makes their pairing an extremely powerful solution for a variety of big data applications.So we can compare them on some below points:

  1. Data Processing Engine/Operators: Hadoop originally was designed to handle crawling and searching billions of web pages and collecting their information into a database. For this it uses Map reduce,which is a batch-processing engine. MapReduce operates in sequential steps by reading data from the cluster, performing its operation on the data, writing the results back to the cluster, reading updated data from the cluster, performing the next data operation, writing those results back to the cluster and so on. But Spark is a cluster-computing framework,Which performs similar operations, but it does so in a single step and in memory. It reads data from the cluster, performs its operation (Filter/map/join/groupby) on the data, and then writes it back to the cluster.
  2. File System: Spark has no file management and therefor must rely on Hadoop’s Distributed File System (HDFS) or some other solution like S3, Tachyon.
  3. Speed/Performance: Spark’s in-memory processing admit that Spark is very fast (Up to 100 times faster than Hadoop MapReduce), Spark can also perform batch processing, however, it really excels at streaming workloads, interactive queries, and machine-based learning.The reason that Spark is so fast is that it processes everything in memory. Yes, it can also use disk for data that doesn’t all fit into memory.Spark uses memory and can use disk for processing, whereas MapReduce is strictly disk-based. Example: Internet of Things sensors, log monitoring, security analytics all require Spark for faster computation.
  4. Storage: MapReduce uses persistent storage and Spark uses Resilient Distributed Datasets (RDDs)
  5. Ease of Use: Spark is well known for its performance, but it’s also somewhat well known for its ease of use in that it comes with user-friendly APIs for Scala (its native language), Java, Python, and Spark SQL.Spark has an interactive mode so that developers and users can run queries.MapReduce has no interactive mode, but add-ons such as Hive and Pig  to make working with MapReduce a little easier for developers.
  6. Costs :Both MapReduce and Spark are Apache projects, which means that they’re open source and free software products. While there’s no cost for the software, there are costs associated with running either platform in personnel and in hardware. Both products are designed to run on commodity hardware, such as low cost, so-called white box server systems. However Spark systems cost more because of the large amounts of RAM required to run everything in memory. But what’s also true is that Spark’s technology reduces the number of required systems. So, you have significantly fewer systems that cost more. There’s probably a point at which Spark actually reduces costs per unit of computation even with the additional RAM requirement.
  7. API’s: Spark also includes its own graph computation library, GraphX. GraphX allows users to view the same data as graphs and as collections. Users can also transform and join graphs with Resilient Distributed Datasets (RDDs).
  8. Fault Tolerance: Hadoop uses Replicated blocks of data to maintain this feature. There is a link between TaskTrackers & JobTracker, so if its missed then the JobTracker reschedules all pending and in-progress operations to another TaskTracker. This effectively provides fault tolerance.Spark uses Resilient Distributed Datasets (RDDs), which are fault-tolerant collections of elements that can be operated on in parallel. RDDs can reference a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat. Spark can create RDDs from any storage source supported by Hadoop, including local filesystems or one of those listed previously.
  9. Scalability: both MapReduce and Spark are scalable using the HDFS.
  10. Security: Hadoop supports Kerberos authentication. HDFS supports access control lists (ACLs) and a traditional file permissions model. For user control in job submission, Hadoop provides Service Level Authorization, which ensures that clients have the right permissions.Spark’s security is a bit sparse by currently only supporting authentication via shared secret (password authentication). The security bonus that Spark can enjoy is that if you run Spark on HDFS, it can use HDFS ACLs and file-level permissions. Additionally, Spark can run on YARN giving it the capability of using Kerberos authentication.

Learn Spark Architecture by clicking here.

Learn Sqoop..!! (tutorial Day 8)

Being a part of Hadoop ecosystem, Scoop is an important interaction tool. When Big Data storages and analyzers such as MapReduce, Hive, HBase, Cassandra, Pig, etc. of the Hadoop ecosystem came into picture, they required a tool to interact with the relational database servers for importing and exporting the Big Data residing in them. Here, Sqoop provides feasible interaction between relational database server and Hadoop’s HDFS.

What is Scoop?

Sqoop is a open source tool designed to transfer data between Hadoop (Hive/HDFS or Hbase) and relational database servers. It is used to import data from structured Data stores or relational databases such as MySQL, Oracle to Hadoop related eco-systems like Hive or HDFS or HBase, and export from Hadoop file system back to relational databases, enterprise data warehouses. Sqoop only works with structured & relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres etc.IF your DB doesn’t lie in this category, we can still use Scoop, using Extension framework – Connectors. You can find connectors online, and modify there code or write your own code using framework. Generally, JDBC connectivity comes handy with maximum of databases. so this resolves your issue.

Sqoop: “SQL to Hadoop and Hadoop to SQL”

Why is Sqoop used?

Scoop doesn’t have any server,so it is a client library. So it doesn’t matter you run it from Data node or from anywhere. It will find the instalation locally or  you can define the hadoop installation and then it will find the name node and run from there.

Sqoop uses MapReduce framework to import and export the data, which provides parallel mechanism as well as fault tolerance. Sqoop makes developers life easy by providing command line interface. Developers just need to provide basic information like source, destination and database authentication details in the sqoop command. Sqoop takes care of remaining part.

Sqoop provides many salient features like:

  1. Full Load
  2. Incremental Load
  3. Parallel import/export
  4. Import results of SQL query
  5. Connectors for all major RDBMS Databases
  6. Kerberos Security Integration
  7. Load data directly into Hive/Hbase

What are Connectors?

Scoop has connectors, which is a pluggable component that uses extension framework to enable scoop to import or export the data between Hadoop and Data stores. The most basic connector that ships with Sqoop is Generic JDBC Connector, and as the name suggests, it uses only the JDBC interface for accessing metadata and transferring dataAvailable connectors include Oracle, DB2, MYSql, PostgresSQL, Teradata, JDBC.

Scoop Architecture


How is Sqoop used?What all can we import/export ??

Scoop can be used to import/export :

  1. Entire table
  2. Part of table or just data using Where clause
  3. all tables of a Database

or we can use Scoop’s few commands like Eval (Evaluate), Options-File (convert your file command into Scoop commands) , all-Databases, all-tables and many more.

Scoop reads the table row by row into HDFS. The output of this Import table process is set of files containing copy of imported table. Since import is a parallel process, hence output will be many files.These files may be delimited text files, or binary Avro or sequence files.It tries to fetch metadata from db table & calculates the max and min values of Primary key of tables to identify the data range (Amount of Data). This value helps Scoop to divide the load between mappers. Generally it uses 4 mappers & no reducers.

Scoop is built on Map-reduce logic & uses JDBC API’s to create Java/class files to process this metadata and at end create a JAR  file of it. So once the import is complete you will see 3 files created. For example:, employee.class and employee.jar

Let’s learn how to use Scoop to import tables. Lets assume you have a mysql database (RDBMS) and you are trying to import a Employee table from it into HDFS.

Command Syntax:

sqoop import –connect jdbc:mysql://localhost/databasename –username $USER_NAME –password $PASSWORD$ –table tablename –m 1


$ sqoop import –connect jdbc:mysql://localhost/scoop_db –username scp –password scp123 –table employee –m 1

Here we specify the:

  • database path (localhost)
  • database name (scoop_db)
  • connection protocol (jdbc:mysql:)
  • username (scp)
  • password  (There are many ways to provide the password like on command line, store it in a file and call it etc.)
  • Always use ‘- -‘ for all sub commands like CONNECT, USERNAME, PASSWORD
  • Use ‘-‘ for Generic commands like FILE

To verify the imported data in HDFS, use the following command

(syntax from internet).
$ $HADOOP_HOME/bin/hadoop fs -cat /employee/part-m-*

It will show you fields and data with comma separated.

Now lets see various syntax and examples:

  • Import an entire table:

sqoop import –connect jdbc:mysql://localhost/abc –table EMPLOYEE

  • Import a subset of the columns from a table:

sqoop import –connect jdbc:mysql://localhost/abc–table EMPLOYEES –columns   “employee_id,first_name,age,designation”

  • Import only the few records by specifying them with a WHERE clause

sqoop import –connect jdbc:mysql://localhost/abc –table EMPLOYEES  –where “designion=’ADVISOR’ “

  • If table has primary key defined, we can set Parallelism to command by explicitly set the number of mappers using --num-mappers. Sqoop evenly splits the primary key range of the source table, as mentioned above.

sqoop import –connect jdbc:mysql://localhost/abc –table EMPLOYEES –num-mappers 6

  • If there is not primary key defined in the table, the data import must be sequential. Specify a single mapper by using –num-mappers 1 or  give ‘-m 1′ option for import.Otherwise it gives error

 sqoop import –connect jdbc:mysql://localhost/db –username $USER_NAME –password $PASSWORD$ –table tablename –m 1


  • To try a sample query without importing data, use the eval option to print the results to the command prompt:

sqoop eval –connect jdbc:mysql://localhost/abc –query “SELECT * FROM employees LIMIT 10”


Follow for more…Read next article on What is Spark ?

Want to learn how HDFS works  …read here

Want to learn Hadoop Installation…click here !!

Hadoop Ecosystem Components contd…(Tutorial Day 6)

So continuing the old post, here we will discuss some more components of Hadoop ecosystem.

Data Integration or ETL Components of Hadoop Ecosystem

Sqoop (SQL-to-Hadoop) is a big data tool that offers the capability to extract bulk data from non-Hadoop  or relational databases (like MySQL, Oracle,Teradata, Postgre) , transform the data into a form usable by Hadoop, and then load the data into HDFS, Hbase or Hive also. This process is similar to Extract, Transform, and Load.It parallelizes data transfer for fast performance, copies data quickly from external system to Hadoop & makes data analysis more efficient.

It’s batch oriented and not suitable for low latency interactive queries. It provides a scalable processing environment for both structured and non-structured data.

Sqoop Import

The import tool imports individual tables from RDBMS to HDFS. Each row in a table is treated as a record in HDFS. All records are stored as text data in text files or as binary data in Avro and Sequence files.

Sqoop Export

The export tool exports a set of files from HDFS back to an RDBMS. The files given as input to Sqoop contain records, which are called as rows in table. Those are read and parsed into a set of records and delimited with user-specified delimiter.

Sqoop Use Case- , Apollo Group uses Sqoop component of the Hadoop ecosystem to enable transmission of data between Hadoop & data warehouse .
  • Flume-

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming or log files data into the Hadoop Distributed File System (HDFS).  It is used for collecting data from its origin and sending it back to the resting location (HDFS).Flume accomplishes this by outlining data flows that consist of 3 primary structures channels, sources and sinks. The processes that run the dataflow with flume are known as agents and the bits of data that flow via flume are known as events.

Flume helps to collect data from a variety of sources, like logs, jms, Directory etc.
Multiple flume agents can be configured to collect high volume of data.
It scales horizontally & is stream oriented.It provides high throughput and low latency.It is fault tolerant.

Both Sqoop and Flume, pull the data from the source and push it to the sink. The main difference is Flume is event driven, while Sqoop is not.

Flume Use Case –

Twitter source connects through the streaming API and continuously downloads the tweets (called as events). These tweets are converted into JSON format and sent to the downstream Flume sinks for further analysis of tweets and retweets to engage users on Twitter.
Goibibo uses Flume to transfer logs from production system to HDFS.

Data Storage Component of Hadoop Ecosystem


Hbase is an open source, distributed, sorted map model.Its a column store-based NoSQL database solution & is similar to Google’s BigTable framework.It supports random reads and also batch computations using MapReduce. With HBase NoSQL database enterprise can create large tables with millions of rows and columns on hardware machine. The best practice to use HBase is when there is a requirement for random ‘read or write’ access to big datasets. HBase’s important advantage is that it supports updates on larger tables and faster lookup. The HBase data store supports linear and modular scaling. HBase stores data as a multidimensional map and is distributed. HBase operations are all MapReduce tasks that run in a parallel manner.

Its well integrated with Pig/Hive/Sqoop. It is consistent and partition tolerant system in CAP theorem.

HBase Use Case-

Facebook is one the largest users of HBase with its messaging platform built on top of HBase in 2010.


Apache Cassandra is a free and open-source distributed database management system designed to handle large amounts of data across many commodity servers.This database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.Cassandra’s support for replicating across multiple data-centers is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.

Use Cases:

For Cassandra, Twitter is an excellent example. We know that, like most sites, user information (screen name, password, email address, etc), is kept for everyone and that those entries are linked to one another to map friends and followers. And, it wouldn’t be Twitter if it weren’t storing tweets, which in addition to the 140 characters of text are also associated with meta-data like timestamp and the unique id that we see in the URLs.

Monitoring, Management and Orchestration Components of Hadoop Ecosystem- Oozie and Zookeeper

  • Oozie-

Oozie is a workflow scheduler where the workflows are expressed as Directed Acyclic Graphs. Oozie runs in a Java servlet container Tomcat and makes use of a database to store all the running workflow instances, their states ad variables along with the workflow definitions to manage Hadoop jobs (MapReduce, Sqoop, Pig and Hive).The workflows in Oozie are executed based on data and time dependencies.

Oozie Use Case:

The American video game publisher Riot Games uses Hadoop and the open source tool Oozie to understand  the player experience.

  • Zookeeper-

Zookeeper is the king of coordination and provides simple, fast, reliable and ordered operational services for a Hadoop cluster. Zookeeper is responsible for synchronization service, distributed configuration service and for providing a naming registry for distributed systems.

Zookeeper Use Case-

Found by Elastic uses Zookeeper comprehensively for resource allocation, leader election, high priority notifications and discovery. The entire service of Found built up of various systems that read and write to   Zookeeper.

Here is the recorded session from the IBM Certified Hadoop Developer Course at DeZyre about the components of Hadoop Ecosystem –
Several other common Hadoop ecosystem components include: Avro, Cassandra, Chukwa, Mahout, HCatalog, Ambari and Hama. By implementing Hadoop using one or more of the Hadoop ecosystem components, users can personalize their big data experience to meet the changing business requirements. The demand for big data analytics will make the elephant stay in the big data room for quite some time.

Data Serialisation (Data Interchange Protocols)

AVRO: Apache Avro is a language-neutral data serialization system, developed by  Apache Hadoop.Data serialization is a mechanism to translate data in computer environment (like memory buffer, data structures or object state) into binary or textual form that can be transported over network or stored in some persistent storage media.Java and Hadoop provides serialization APIs, which are java based, but Avro is not only language independent but also it is schema-based.Once the data is transported over network or retrieved from the persistent storage, it needs to be deserialized again. Serialization is termed as marshalling and deserialization is termed as unmarshalling.

Avro uses JSON format to declare the data structures. Presently, it supports languages such as Java, C, C++, C#, Python, and Ruby.Avro has a schema-based system. A language-independent schema is associated with its read and write operations.

Like Avro, there are other serialization mechanisms in Hadoop such as Sequence Files, Protocol Buffers, and Thrift.Avro creates a self-describing file named Avro Data File, in which it stores data along with its schema in the metadata section.Avro is also used in Remote Procedure Calls (RPCs). During RPC, client and server exchange schemas in the connection handshake.

To serialize Hadoop data, there are two ways −

  • You can use the Writable classes, provided by Hadoop’s native library.
  • You can also use Sequence Files which store the data in binary format.

The main drawback of these two mechanisms is that Writables and SequenceFiles have only a Java API and they cannot be written or read in any other language.

Therefore any of the files created in Hadoop with above two mechanisms cannot be read by any other third language, which makes Hadoop as a limited box. To address this drawback, Doug Cutting created Avro, which is a language independent data structure.

Use Case:

Content credit :

Avro provides rich data structures. For example, you can create a record that contains an array, an enumerated type, and a sub record. These datatypes can be created in any language, can be processed in Hadoop, and the results can be fed to a third language.


 Thrift :

Thrift is a lightweight, language-independent software stack with an associated code generation mechanism for RPC. Thrift provides clean abstractions for data transport, data serialization, and application level processing. Thrift was originally developed by Facebook and now it is open sourced as an Apache project. Apache Thrift is a set of code-generation tools that allows developers to build RPC clients and servers by just defining the data types and service interfaces in a simple definition file. Given this file as an input, code is generated to build RPC clients and servers that communicate seamlessly across programming languages.

Thrift supports a variety of languages including C++, Java, Python, PHP, Ruby.

To learn more on Hadoop…keep on reading these tutorials…every day we try to get something new and interesting for all my readers !!

Now we will start learning all Ecosystem components in more detail. Click here to read about how MapReduce Algorithm works with an easy example.

%d bloggers like this: