A Big Data Sqoop of Hadoop

Home  >>  Big Data  >>  A Big Data Sqoop of Hadoop

A Big Data Sqoop of Hadoop

On January 26, 2017, Posted by , in Big Data, Development, tags , , , , With Comments Off on A Big Data Sqoop of Hadoop

This week I’m working with one of our partners on migrating some data from Microsoft SQL Server to Hadoop. Sqoop is the tool of choice because it can migrate the data and maintain data types inside of Hive. Sqoop is a very straight forward tool, but there were some serious gotchas including driver issues and unsupported data types.

For those who think I am writing gibberish at this point, I have dedicated an entire post to defining some jargon. Once you know the jargon, the rest of this post will make more sense.

Driver Issues

By default Sqoop does not include the JDBC driver for Microsoft SQL Server. Since I was using HDP 2.5, I needed to get the Java 7 JDBC jar file for SQL Server. I grabbed sqljdbc_6.0.8112.100_enu.tar.gz. It contains the JDBC driver to place it in the Sqoop /lib directory on your Hadoop server. Although this should be straight forward, Linux issues and security issues meant there were some fun hoops (not hadoops) to jump through.

In my case, I had to download a tar.gz file from Microsoft, place it on a web server, then use wget from the Linux command prompt to pull it off my local server. I couldn’t get it directly from Microsoft.com using wget for some reason.

Once I had it on the Linux box, I then used the tar command (tar -xvzf sqljdbc_6.0.8112.100_enu.tar.gz ) to extract everything (thanks to this article), and finally pulled the sqljdbc41.jar file out of the JRE7 directory and placed it in the Sqoop /lib directory.

Data Type Issues

Once I had the connection working properly, I decided to take Sqoop for a little test run. I created a table in MSSQL that contained one column of every data type and then tried to pull the data into Hadoop. I used the Sqoop options to import the data into Hive so that I could maintain the source data types.

I knew that not all data types were supported, but didn’t see a good reference for what types were supported. Lucky for you, I created this table.

SQL Data Type Hive Data Type Reduced Precision Unsupported
BINARY   X
BIGINT BIGINT    
BIT BOOLEAN    
CHAR STRING    
DATE STRING X  
DATETIME STRING X  
DATETIME2 STRING X  
DATETIMEOFFSET   X
DECIMAL DOUBLE X  
FLOAT DOUBLE    
GEOGRAPHY   X
GEOMETRY   X
HIERARCHY   X
IMAGE   X
INT INT    
MONEY DOUBLE X  
NCHAR STRING    
NTEXT STRING    
NUMERIC DOUBLE X  
NCHAR STRING    
NTEXT STRING    
NUMERIC DOUBLE    
NVARCHAR STRING    
REAL DOUBLE    
SMALLDATETIME STRING X  
SMALLINT INT    
SMALLMONEY DOUBLE X  
TEXT STRING    
TIME STRING X  
TINYINT TINYINT    
UNIQUEIDENTIFIER STRING    
VARBINARY   X
VARCHAR STRING    
VARIANT   X
XML STRING    

How to find problem data types?

So the next obvious question is, how do I find the tables that have columns that will cause me trouble. Thankfully, it is pretty straight forward to run a little SQL statement inside the Microsoft SQL Server Manager to find the tables and columns you need to worry about.

SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE, ‘UnsupportedType’ as ‘Issue’
FROM [Master Controls].INFORMATION_SCHEMA.COLUMNS
where DATA_TYPE in (‘Variant’,’Binary’,’DateTimeOffset’,’Geography’,’Geometry’,’Hierarchyid’,’Image’,’VarBinary’)
Union
SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE, ‘ReducedPrecission’ as ‘Issue’
FROM [Master Controls].INFORMATION_SCHEMA.COLUMNS
where DATA_TYPE in (‘Date’,’DateTime’,’DateTime2′,’Decimal’,’Money’,’Numeric’,’SmallDateTime’,’SmallMoney’,’Time’)

Comments are closed.