Big DataDevelopment

Big Data and Hadoop Jargon

I was starting to write a post on using Sqoop with Hadoop, and realized that to a normal person, it would sound as if I were speaking gibberish, so I decided to define as much of the jargon in one place as possible.

Defining Big Data Jargon

Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets (Big Data) in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.

Big Data extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions.

Data migration is the process of transferring data between data storage systems, data formats or computer systems. A data migration project is usually undertaken to replace or upgrade servers or storage equipment, for a website consolidation, to conduct server maintenance or to relocate a data center.

Data Type specifies the type of data that the object can hold: integer data, character data, monetary data, date and time data, binary strings, and so on.

Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to relational databases.

Hive has three main functions: data summarization, query and analysis. It supports queries expressed in a language called HiveQL, which automatically translates SQL-like queries into MapReduce jobs executed on Hadoop. In addition, HiveQL supports custom MapReduce scripts to be plugged into queries.

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.