Category Archives: ecosystem

Hadoop Ecosystem & Architecture(Tutorial Day 4)

Like we discussed in last blog, Big Data is not just Hadoop. Similarly Hadoop is not one only monolithic thing, but is an ecosystem which consists of various  hadoop components and an amalgamation of different technologies.Like HDFS (Hadoop Distributed File System), Map Reduce, Pig, Hive,Hbase, Flume and so on.

Hadoop Ecosystem

The Hadoop platform consists of many tools but two key services are: Hadoop Distributed File System (HDFS) and the high-performance parallel data processing engine called Hadoop MapReduce.

Vendors that provide Hadoop-based platforms include Cloudera, Hortonworks, MapR, Greenplum, IBM, and Amazon.

The combination of HDFS and MapReduce provides a software framework for processing vast amounts of data in parallel on large clusters of commodity hardware (potentially scaling to thousands of nodes) in a reliable, fault-tolerant manner. We can combine various Hadoop Ecosystem tools to serve the business requirements in cost effective fashion.
Below image describes the Hadoop Ecosystem.

hadoop-ecosystem_mines

In view of Hadoop ecosystem prominence is given to Hadoop Core components (Hadoop common, YARN, HDFS and MapReduce), which we will discuss first.

1) Hadoop Common refers to the collection of common utilities ,libraries,necessary Java files and scripts that support other Hadoop modules. It is an essential part or module of the Apache Hadoop Framework.

2) Hadoop YARN is described as a clustering platform or framework that helps to manage resources and schedule tasks.It is a great enabler for dynamic resource utilization on Hadoop framework as users can run various Hadoop applications without having to bother about increasing workloads.

3) HDFS is a distributed file system that runs on standard or low-end hardware. Developed by Apache Hadoop, HDFS works like a standard distributed file system but provides better data throughput and access through the MapReduce algorithm, high fault tolerance and native support of large data sets.

HDFS comprises of 3 important components called NameNode, DataNode and Secondary NameNode. HDFS operates on a Master-Slave architecture model where the NameNode acts as the master node for keeping a track of the storage cluster and the DataNode acts as a slave node summing up to the various systems within a Hadoop cluster.

It provides data reliability by replicating each data instance as three different copies – two in one group and one in another. These copies may be replaced in the event of failure.

Default replication count is 3
• 1st replica on the local rack
• 2nd replica on the local rack but different machine
• 3rd replica on the different rack

The HDFS architecture consists of clusters, each of which is accessed through a single NameNode software tool installed on a separate machine to monitor and manage the that cluster’s file system and user access mechanism. The other machines install one instance of DataNode to manage cluster storage.
Because HDFS is written in Java, it has native support for Java application programming interfaces (API) for application integration and accessibility. It also may be accessed through standard Web browsers.

hdfs-architecture

Namenode

The namenode is the commodity hardware that stores the metadata like name of the datanodes, location or path, replica block path etc.The system having the namenode acts as the master server and there can only be 1 Name node. If you want you can create a replica of it and called it as Secondary Namenode. But either of one can be active.It does the following tasks:

  • Manages the file system namespace.
  • Regulates client’s access to files.
  • It also executes file system operations such as renaming, closing, and opening files and directories.

Datanode

A DataNode stores & manages the data stored in HDFS. In a functional filesystem we can have more than one DataNode, with data blocks replicated across them.DataNode connects to the Namenode; spinning until that service comes up. It then responds to requests from the Namenode for filesystem operations.

Client applications can talk directly to a DataNode, once the Namenode has provided the location of the data. Similarly, MapReduce operations delegated out to Task Tracker instance near a DataNode, can talk directly to the DataNode to access the files. following task are performed here:

  • Datanodes perform read-write operations on the file systems, as per client request.
  • They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode.
  • DataNode instances can talk to each other, which is what they do when they are replicating data.

Block

A Hadoop block is a file on the underlying filesystem. Since the underlying filesystem stores files as blocks, one Hadoop block may consist of many blocks in the underlying file system. Blocks are large.In other words, the minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB, but it can be increased as per the need to change in HDFS configuration.Most systems run with block sizes of 128 megabytes or larger.

4) MapReduce is a programming model introduced by Google. It breaks down a big data processing job into smaller tasks. It is responsible for the analyzing large data-sets in parallel before reducing it to find the results. It is highly scaleable & has several forms of implementation provided by multiple programming languages, like Java, C# and C++.

The MapReduce executed in 2 stages :

  1. Map:  The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.
  2. Reduce: It is combination of Shuffle and Reduce.The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.

The main advantage of the MapReduce framework is its fault tolerance, where periodic reports from each node in the cluster are expected when work is completed.
A task is transferred from one node to another. If the master node notices that a node has been silent for a longer interval than expected, the main node performs the reassignment process to the frozen/delayed task.

As discussed above, there are several other Hadoop components that form an integral part of the Hadoop ecosystem, making Hadoop faster or developing novel features and functionalities. To know further about some of the eminent Hadoop components , please read my Next Blog.

To learn more on MapReduce Algorithm and how it works click here.

Pic/content credit :Google and Specific mentioned.

What is Big Data ? Is it Only Hadoop ? (Tutorial day 1)

Big Data, the new buzz word in the today’s technology is gaining more importance due to its high rewards. A systematic and focused approach toward the adoption of Big Data allows one to derive maximum value and utilize the power of Big Data.

 Its nothing but a new framework or system to get insight of existing different data forms and increasing the researchers/analyst power to get more out of existing system.

As BG Univ says, “Big data is about the application of new tools to do MORE analytic on MORE data for More people.”

Lifecycle of data can be defined as :

 

People get confuse with Big Data & Hadoop as 2 similar things. But no, Big data is not only Hadoop

Big Data is not a tool or single technique. Its actually a platform or a framework having various components like Data Warehouses (providing OLAP data/History), Real time Data systems and Hadoop (provides insight to structured/semi or unstructured Data).

Examples of Big Data are like Traffic data, Flights Data/ Search engine data etc.

Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in it will be of three types :

a) Structured data: Relational data.
b) Semi Structured data: XML data.
c) Unstructured data: Word, PDF, Text, Media Logs.

 Big Data can be characterized by 3 V’s :

1) Velocity -> Batch processing data, real time
2) Variety-> Structured, semi-structured, unstructured and polymorphic data
3) Volume-> Terabytes to Petabytes

Big Data puts existing traditional systems into trouble due to many reasons because when data increases the complexity, Security, maintenance, processing time of it also increases. Big Data gets Distributed processing system into picture. Its using multiple system/disk for parallel processing.

There are various tools & technologies in the market from different vendors including IBM, Microsoft, etc., to handle big data. Few of them are:

1) No SQL Big Data systems are designed to provide operational capabilities for real-time, interactive workloads where data is primarily captured and stored. It allows massive computations to be run inexpensively and efficiently. This makes operational big data workloads much easier to manage, cheaper, and faster to implement. For example MongoDB
2)MPP & MapReduce provide analytical capabilities for complex analysis including lot of data. Based on them we have Hadoop, Hive, Pig, Impala
3) Storage (HDFS ie Hadoop Distributed File System)
4) Servers (Google App Engine)
There are major challenges with Big Data.

Read  Day 2 tutorial to understand further and bookmark this page for future reference.