Hadoop Ecosystem Components contd…(Tutorial Day 5)

So continuing the old post, vendors that provide Hadoop-based platforms include Cloudera, Hortonworks, MapR, Greenplum, IBM, and Amazon. Here we will discuss more components of Hadoop ecosystem.

Data Access Components of Hadoop Ecosystem

  • Pig-

Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. Apache Pig is a tool developed by Yahoo for analyzing huge data sets efficiently and easily. The high level data flow language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

At the present time, Pig’s infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig’s language layer currently consists of a textual language called Pig Latin.Pig is an open source project under the Apache Software Foundation, so you can learn about it online

Pig Latin is basically used it to construct dataflows, to have a scheduled job to periodically crunch the massive data from HDFS and transfer the summarized data into a relational database for reporting, & ad-hoc analyses. Hive is used for simple ad-hoc analytical queries for the data in HDFS, as Hive queries are a lot faster to write for those types of queries. Its generally used by Yahoo, Twitter etc to process web logs,images,maps etc.

Usage of Apache Pig:

  • Using Pig Latin, programmers can perform MapReduce tasks easily without having to type complex codes in Java, as it uses multi-query approach, thereby reducing the length of codes. For example, an operation that would require you to type 200 lines of code (LoC) in Java can be easily done by typing as less as just 10 LoC in Apache Pig. Ultimately Apache Pig reduces the development time by almost 16 times.
  • Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are familiar with SQL.
  • Apache Pig provides many built-in operators to support data operations like joins, filters, ordering, etc. In addition, it also provides nested data types like tuples, bags, and maps that are missing from MapReduce.

Pig Use Case-

I am hereby using one of my fav use case of PIG Latin language, you can read here on Slideshare:

Scenario: You have a User data in one file ,website data in another. Now you want to find out the top 5 most visited pages by users of Age (18-25). For this scenario, MAp reduce program is full page length code, but in PIG Latin language its a small easily understandable code.

Code credit: Nick Dimiduk
  • Hive-

Hive is a Data warehouse system layer built on Hadoop. It allows to define a structure for unstructured big data and query the data using a SQL-like language called HiveQL. Its developed by Facebook & makes querying faster through indexing.

Hive Use Case-

Hive simplifies Hadoop at Facebook with the execution of 7500+ Hive jobs daily for Ad-hoc analysis, reporting and machine learning.


Read my next blog on more Hadoop ecosystem components (tutorial Day 6)


One thought on “Hadoop Ecosystem Components contd…(Tutorial Day 5)”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s