Being a part of Hadoop ecosystem, Scoop is an important interaction tool. When Big Data storages and analyzers such as MapReduce, Hive, HBase, Cassandra, Pig, etc. of the Hadoop ecosystem came into picture, they required a tool to interact with the relational database servers for importing and exporting the Big Data residing in them. Here, Sqoop provides feasible interaction between relational database server and Hadoop’s HDFS.
What is Scoop?
Sqoop is a open source tool designed to transfer data between Hadoop (Hive/HDFS or Hbase) and relational database servers. It is used to import data from structured Data stores or relational databases such as MySQL, Oracle to Hadoop related eco-systems like Hive or HDFS or HBase, and export from Hadoop file system back to relational databases, enterprise data warehouses. Sqoop only works with structured & relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres etc.IF your DB doesn’t lie in this category, we can still use Scoop, using Extension framework – Connectors. You can find connectors online, and modify there code or write your own code using framework. Generally, JDBC connectivity comes handy with maximum of databases. so this resolves your issue.
Sqoop: “SQL to Hadoop and Hadoop to SQL”
Why is Sqoop used?
Here we specify the:
- database path (localhost)
- database name (scoop_db)
- connection protocol (
- username (scp)
- password (There are many ways to provide the password like on command line, store it in a file and call it etc.)
- Always use ‘- -‘ for all sub commands like CONNECT, USERNAME, PASSWORD
- Use ‘-‘ for Generic commands like FILE
To verify the imported data in HDFS, use the following command
(syntax from internet).
$ $HADOOP_HOME/bin/hadoop fs -cat /employee/part-m-*
It will show you fields and data with comma separated.
Now lets see various syntax and examples:
- Import an entire table:
sqoop import –connect jdbc:mysql://localhost/abc –table EMPLOYEE
- Import a subset of the columns from a table:
sqoop import –connect jdbc:mysql://localhost/abc–table EMPLOYEES –columns “employee_id,first_name,age,designation”
- Import only the few records by specifying them with a WHERE clause
sqoop import –connect jdbc:mysql://localhost/abc –table EMPLOYEES –where “designion=’ADVISOR’ “
- If table has primary key defined, we can set Parallelism to command by explicitly set the number of mappers using
--num-mappers. Sqoop evenly splits the primary key range of the source table, as mentioned above.
sqoop import –connect jdbc:mysql://localhost/abc –table EMPLOYEES –num-mappers 6
- If there is not primary key defined in the table, the data import must be sequential. Specify a single mapper by using –num-mappers 1 or give ‘-m 1′ option for import.Otherwise it gives error
sqoop import –connect jdbc:mysql://localhost/db –username $USER_NAME –password $PASSWORD$ –table tablename –m 1
- To try a sample query without importing data, use the eval option to print the results to the command prompt:
sqoop eval –connect jdbc:mysql://localhost/abc –query “SELECT * FROM employees LIMIT 10”
Follow for more…Read next article on What is Spark ?
Want to learn how HDFS works …read here
Want to learn Hadoop Installation…click here !!