HDFS stands for Hadoop Distributed File System. It is designed to provide a fault-tolerant file system designed to run on commodity hardware. It uses a master/slave architecture in which one device (the master) controls one or more other devices (the slaves). The HDFS cluster consists of a single NameNode and a master server manages the file system namespace and regulates access to files.
Namenode is the centerpiece or master node of HDFS. It only stores the metadata of HDFS and no data – means the directory tree of all files in the file system, and tracks the files across the cluster. The data is actually stored in the DataNodes. NN knows the list of the blocks and its location for any given file in HDFS. With this information it knows how to construct the file from blocks. NN is so critical that if its down or has any fault, HDFS/Hadoop cluster is inaccessible. NN is a single point of failure in Hadoop cluster. NameNode is usually configured with a lot of memory (RAM). We generally have Secondary NN to cover up this kind of failures, but they are manual start-up.
DataNode is known as Slave nodes & is responsible for storing the actual data in HDFS. NN & DN are in constant communication. When a DataNode starts up it announce itself to the NN along with the list of blocks it is responsible for. When a DataNode is down, it does not affect the availability of data or the cluster. NN will arrange for replication for the blocks managed by the DN that is not available. DataNode is usually configured with a lot of hard disk space. Because the actual data is stored in the DataNode.
More features of HDFs:
- Single Master node, along with secondary Namenode (in Hadoop 1.x, secondary NN is manual start up, whereas in Hadoop 2.x we have automated failure recovery using secondary namenode)
- Multiple Data nodes cluster (also called slave deamons)
- Every block has a fixed size
- Example: We have a NYSE data file of 39 mb. So if my datanode has block size defined as 27mb each, then we will get 2 files created in HDFS .One of 27mb and one of 12mb
HDFS Architecture
Image credit: Google
- Client is a application running on our machines which is used to interact with NN and DN, Job tracker etc. It is used for User interaction and is called HDFS client.
- A Hadoop Cluster is a collection of racks. A rack is a collection of 30 or 40 nodes that are physically stored close together and are all connected to the same network switch. Network bandwidth between any two nodes in rack is greater than bandwidth between two nodes on different racks. In other words, a rack is the hard-disk or storage area of HDFS.
- Client interacts with NN using SSH and not http.
- To maintain fault tolerance on Hadoop system, we maintain replicate data. Minimum no. of replicas required by HDFS is 3. It can configured by admin too, but redundancy is necessary to be done
- When a file is written to HDFS, it is split up into big chucks called data blocks, whose size is controlled by the parameter dfs.block.size in the config file hdfs-site.xml (default is 64MB). All blocks in a file except the last block are the same size . Each block is stored on one or more Data nodes, controlled by the parameter dfs.replication in the same file (in most of this post – set to 3, which is the default). Each copy of a block is called a replica.The blocks of a file are replicated for fault tolerance. Files in HDFS are write-once and have strictly one writer at any time.
- Since there are 3 nodes, when we send the MapReduce programs, calculations will be done only on the original data. The master node will know which node exactly has that particular data. In case, if one of the nodes is not responding, it is assumed to be failed. Only then, the required calculation will be done on the second replica.
Process to read/write file into hdfs
When writing data to an HDFS file, its first written to local cache. When the cache reaches a block size (default 64MB), the client request the list of DN from the NN. This list contains all the DN that will host a replica of that block. The number of DN replication is default to 3. The client then organizes a pipeline from DN-to-DN and flushes the data block to the first DN (as shown in image below). The first DataNode starts receiving the data in small portions (file system block size 4 KB), writes each portion to its local repository and transfers the same portion to the second DataNode in the list. The second DataNode, in turn starts receiving each portion of the data block, writes that portion to its repository and then flushes the same portion to the third DataNode. Finally, the third DataNode writes the data to its local repository. Thus, a DN can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline. When the first block is filled, the client requests new DN to host replicas of the next block. A new pipeline is organized, and the client sends the further bytes of the file. This flow continues till last block of the file.
Image credit: google
Advantages of HDFS
- Suitable for applications with large datasets only. If you have small data sets, then its too expensive to store them
- Inexpensive as filesystem relies on commodity storage disks that are much less expensive than the storage media used for enterprise grade storage
- Highly Fault Tolerant
- High throughput ( time taken to access , read data. Similar to example explained below)
HDFS is optimized for MapReduce workloads. It provides very high performance for sequential reads and writes, which is the typical access pattern in MapReduce jobs.
Learn more about Hadoop Installation, here !!
Follow for more interesting Big Data stuff to read !!