Learn Sqoop..!! (tutorial Day 8)

Being a part of Hadoop ecosystem, Scoop is an important interaction tool. When Big Data storages and analyzers such as MapReduce, Hive, HBase, Cassandra, Pig, etc. of the Hadoop ecosystem came into picture, they required a tool to interact with the relational database servers for importing and exporting the Big Data residing in them. Here, Sqoop provides feasible interaction between relational database server and Hadoop’s HDFS.

What is Scoop?

Sqoop is a open source tool designed to transfer data between Hadoop (Hive/HDFS or Hbase) and relational database servers. It is used to import data from structured Data stores or relational databases such as MySQL, Oracle to Hadoop related eco-systems like Hive or HDFS or HBase, and export from Hadoop file system back to relational databases, enterprise data warehouses. Sqoop only works with structured & relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres etc.IF your DB doesn’t lie in this category, we can still use Scoop, using Extension framework – Connectors. You can find connectors online, and modify there code or write your own code using framework. Generally, JDBC connectivity comes handy with maximum of databases. so this resolves your issue.

Sqoop: “SQL to Hadoop and Hadoop to SQL”

Why is Sqoop used?

Scoop doesn’t have any server,so it is a client library. So it doesn’t matter you run it from Data node or from anywhere. It will find the instalation locally or  you can define the hadoop installation and then it will find the name node and run from there.

Sqoop uses MapReduce framework to import and export the data, which provides parallel mechanism as well as fault tolerance. Sqoop makes developers life easy by providing command line interface. Developers just need to provide basic information like source, destination and database authentication details in the sqoop command. Sqoop takes care of remaining part.

Sqoop provides many salient features like:

  1. Full Load
  2. Incremental Load
  3. Parallel import/export
  4. Import results of SQL query
  5. Connectors for all major RDBMS Databases
  6. Kerberos Security Integration
  7. Load data directly into Hive/Hbase

What are Connectors?

Scoop has connectors, which is a pluggable component that uses extension framework to enable scoop to import or export the data between Hadoop and Data stores. The most basic connector that ships with Sqoop is Generic JDBC Connector, and as the name suggests, it uses only the JDBC interface for accessing metadata and transferring dataAvailable connectors include Oracle, DB2, MYSql, PostgresSQL, Teradata, JDBC.

Scoop Architecture

 

How is Sqoop used?What all can we import/export ??

Scoop can be used to import/export :

  1. Entire table
  2. Part of table or just data using Where clause
  3. all tables of a Database

or we can use Scoop’s few commands like Eval (Evaluate), Options-File (convert your file command into Scoop commands) , all-Databases, all-tables and many more.

Scoop reads the table row by row into HDFS. The output of this Import table process is set of files containing copy of imported table. Since import is a parallel process, hence output will be many files.These files may be delimited text files, or binary Avro or sequence files.It tries to fetch metadata from db table & calculates the max and min values of Primary key of tables to identify the data range (Amount of Data). This value helps Scoop to divide the load between mappers. Generally it uses 4 mappers & no reducers.

Scoop is built on Map-reduce logic & uses JDBC API’s to create Java/class files to process this metadata and at end create a JAR  file of it. So once the import is complete you will see 3 files created. For example: Employee.java, employee.class and employee.jar

Let’s learn how to use Scoop to import tables. Lets assume you have a mysql database (RDBMS) and you are trying to import a Employee table from it into HDFS.

Command Syntax:

sqoop import –connect jdbc:mysql://localhost/databasename –username $USER_NAME –password $PASSWORD$ –table tablename –m 1

Example:

$ sqoop import –connect jdbc:mysql://localhost/scoop_db –username scp –password scp123 –table employee –m 1

Here we specify the:

  • database path (localhost)
  • database name (scoop_db)
  • connection protocol (jdbc:mysql:)
  • username (scp)
  • password  (There are many ways to provide the password like on command line, store it in a file and call it etc.)
  • Always use ‘- -‘ for all sub commands like CONNECT, USERNAME, PASSWORD
  • Use ‘-‘ for Generic commands like FILE

To verify the imported data in HDFS, use the following command

(syntax from internet).
$ $HADOOP_HOME/bin/hadoop fs -cat /employee/part-m-*

It will show you fields and data with comma separated.

Now lets see various syntax and examples:

  • Import an entire table:

sqoop import –connect jdbc:mysql://localhost/abc –table EMPLOYEE

  • Import a subset of the columns from a table:

sqoop import –connect jdbc:mysql://localhost/abc–table EMPLOYEES –columns   “employee_id,first_name,age,designation”

  • Import only the few records by specifying them with a WHERE clause

sqoop import –connect jdbc:mysql://localhost/abc –table EMPLOYEES  –where “designion=’ADVISOR’ “

  • If table has primary key defined, we can set Parallelism to command by explicitly set the number of mappers using --num-mappers. Sqoop evenly splits the primary key range of the source table, as mentioned above.

sqoop import –connect jdbc:mysql://localhost/abc –table EMPLOYEES –num-mappers 6

  • If there is not primary key defined in the table, the data import must be sequential. Specify a single mapper by using –num-mappers 1 or  give ‘-m 1′ option for import.Otherwise it gives error

 sqoop import –connect jdbc:mysql://localhost/db –username $USER_NAME –password $PASSWORD$ –table tablename –m 1

 

  • To try a sample query without importing data, use the eval option to print the results to the command prompt:

sqoop eval –connect jdbc:mysql://localhost/abc –query “SELECT * FROM employees LIMIT 10”

 

Follow for more…Read next article on What is Spark ?

Want to learn how HDFS works  …read here

Want to learn Hadoop Installation…click here !!

Advertisements

One thought on “Learn Sqoop..!! (tutorial Day 8)”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: