Tag Archives: HDFS

What is hdfs (Tutorial Day 9)

HDFS stands for Hadoop Distributed File System. It is designed to provide a fault-tolerant file system designed to run on commodity hardware. It uses a master/slave architecture in which one device (the master) controls one or more other devices (the slaves). The HDFS cluster consists of a single NameNode and a master server manages the file system namespace and regulates access to files.

Namenode is the centerpiece or master node of  HDFS. It only stores the metadata of HDFS and no data – means the directory tree of all files in the file system, and tracks the files across the cluster. The data is actually stored in the DataNodes. NN knows the list of the blocks and its location for any given file in HDFS. With this information it knows how to construct the file from blocks. NN is so critical that if its down or has any fault, HDFS/Hadoop cluster is inaccessible. NN is a single point of failure in Hadoop cluster. NameNode is usually configured with a lot of memory (RAM). We generally have Secondary NN to cover up this kind of failures, but they are manual start-up.

DataNode is known as Slave nodes & is responsible for storing the actual data in HDFS. NN & DN are in constant communication. When a DataNode starts up it announce itself to the NN along with the list of blocks it is responsible for. When a DataNode is down, it does not affect the availability of data or the cluster. NN will arrange for replication for the blocks managed by the DN that is not available. DataNode is usually configured with a lot of hard disk space. Because the actual data is stored in the DataNode.

More features of HDFs:

  • Single Master node, along with secondary Namenode (in Hadoop 1.x, secondary NN is manual start up, whereas in Hadoop 2.x we have automated failure recovery using secondary namenode)
  • Multiple Data nodes cluster (also called slave deamons)
  • Every block has a fixed size
  • Example: We have a NYSE data file of 39 mb. So if my datanode has block size defined as 27mb each, then we will get 2 files created in HDFS .One of 27mb and one of 12mb

 

HDFS Architecture 

 

HDFSArchi

                                                                                             Image credit: Google
  • Client is a application running on our machines which is used to interact with NN and DN, Job tracker etc. It is used for User interaction and is called HDFS client.
  • A Hadoop Cluster is a collection of racks. A rack is a collection of 30 or 40 nodes that are physically stored close together and are all connected to the same network switch. Network bandwidth between any two nodes in rack is greater than bandwidth between two nodes on different racks. In other words, a rack is the hard-disk or storage area of HDFS.
  • Client interacts with NN using SSH and not http.
  • To maintain fault tolerance on Hadoop system, we maintain replicate data. Minimum no. of replicas required by HDFS is 3. It can configured by admin too, but redundancy is necessary to be done
  • When a file is written to HDFS, it is split up into big chucks called data blocks, whose size is controlled by the parameter dfs.block.size in the config file hdfs-site.xml (default is 64MB). All blocks in a file except the last block are the same size . Each block is stored on one or more Data nodes, controlled by the parameter dfs.replication in the same file (in most of this post – set to 3, which is the default). Each copy of a block is called a replica.The blocks of a file are replicated for fault tolerance. Files in HDFS are write-once and have strictly one writer at any time.
  • Since there are 3 nodes, when we send the MapReduce programs, calculations will be done only on the original data. The master node will know which node exactly has that particular data. In case, if one of the nodes is not responding, it is assumed to be failed. Only then, the required calculation will be done on the second replica.

Process to read/write file into hdfs

When writing data to an HDFS file, its first written to local cache. When the cache reaches a block size (default 64MB), the client request the list of DN from the NN. This list contains all the DN that will host a replica of that block. The number of DN replication is default to 3. The client then organizes a pipeline from DN-to-DN and flushes the data block to the first DN (as shown in image below). The first DataNode starts receiving the data in small portions (file system block size 4 KB), writes each portion to its local repository and transfers the same portion to the second DataNode in the list. The second DataNode, in turn starts receiving each portion of the data block, writes that portion to its repository and then flushes the same portion to the third DataNode. Finally, the third DataNode writes the data to its local repository. Thus, a DN can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline. When the first block is filled, the client requests new DN to host replicas of the next block. A new pipeline is organized, and the client sends the further bytes of the file. This flow continues till last block of the file.

hdfsworking

                                                                                                    Image credit: google

Advantages of HDFS

  • Suitable for applications with large datasets only. If you have small data sets, then its too expensive to store them
  • Inexpensive as filesystem relies on commodity storage disks that are much less expensive than the storage media used for enterprise grade storage
  • Highly Fault Tolerant
  • High throughput ( time taken to access , read data. Similar to example explained below)

HDFS is optimized for MapReduce workloads. It provides very high performance for sequential reads and writes, which is the typical access pattern in MapReduce jobs.

Learn more about Hadoop Installation, here !!

 

Follow for more interesting Big Data stuff to read !!

 

Advertisements

What is Resilient Distributed Datasets (RDDs) ? (Day 3)

Spark’s primary core abstraction is called Resilient Distributed Dataset or RDD.It is designed to support in-memory data storage, distributed across a cluster in a manner that is Resilient,fault-tolerant and efficient. RDD are Resilient as it relies on lineage graph , whenever there is  a failure in system, they can recompute themselves using the prior information. Similarly Fault-tolerance is achieved, in part, by tracking the lineage of transformations applied to coarse grained sets of data. Efficiency is achieved through parallelization of processing across multiple nodes in the cluster, and minimization of data replication between those nodes.

In a layman language, you can RDD is representation of the data that’s coming into your system in an object format & allows you to do computation on it.”

Spark RDD’s can reference to a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat. Also we can define it as, just a distributed collection of elements that is parallelized across the cluster. Once data is loaded into an RDD, two basic types of operation can be carried out.. Transformations and Actions.

Transformations are those that do not return a value. In fact, nothing is evaluated during the definition of these transformation statements. It just creates a new RDD by changing the original through processes such as mapping, filtering, and more. The fault tolerance aspect of RDDs allows Spark to reconstruct the transformations used to build the lineage to get back the lost data.

Actions are when the transformations get evaluated along with the action that is called for that RDD. Actions return values. For example, you can do a count on a RDD, to get the number of elements within and that value is returned.

The original RDD remains unchanged throughout. The chain of transformations from RDD1 to RDDn are logged, and can be repeated in the event of data loss or the failure of a cluster node.RDDs can be persistent in order to cache a dataset in memory across operations. This allows future actions to be much faster, by as much as ten times.
Spark’s cache is fault-tolerant in that if any partition of an RDD is lost, it will automatically be recomputed by using the original transformations. For example, let’s say a node goes offline. All it needs to do when it comes back online is to re-evaluate the graph to where it left off. Caching is provided with Spark to enable the processing to happen in memory. If it does not fit in memory, it will spill to disk.
Interesting thing about Spark is , it’s lazy evaluation.  This is because RDD are not loaded into system as in when the system encounters an RDD , but only done when an Action is supposed to be performed. So to understand this concept, lets take an example:
  • We read a text file and load the data into new created RDD ‘m’   {scala>  val m=sc.textfile (“abc.txt”)  } . This step is interpreted by Spark and an DAG is created that tells it to read data from file and push it in RDD format. An RDD is made of multiple partitions. By default, the minimum # of partitions in an RDD will be two. However, this is customizable and will be different in vendor distributions of Spark. For example, when creating an RDD out of an HDFS file, each block in the file feeds one RDD partition, so a file with 30 unique blocks will create an RDD with 30 partitions. Or in Cassandra, every 100,000 rows get loaded into one RDD partition. So, a Cassandra table with 1 million rows will generate an RDD with 10 partitions.
  • Next step is to display the first item in this RDD,  {scala>  m.first() }
  • Now lets use the .filter() transformation on the ‘m‘ RDD to return a new RDD named “linesWithApache“, which will contain a subset of the items in the file (only the ones containing the string “Apache”: {scala> val linesWithApache = m.filter(line => line.contains(“Apache”))}
  • Now lets use an Action to find no. of lines with Apache word.  {scala> linesWithApache.count()}
  • To further see these lines, you can use .collect()  Action.  {scala> linesWithApache.collect()  }
Learn How Spark actually works? Click here