Loading Data Into Pig

The first step in using Pig is to load data into a program. Pig provides a LOAD statement for this purpose. Its format is: result = LOAD 'filename' USING fn() AS (field1, field2, ...).

This statement returns a bag of values of all the data contained in the named file. Each record in the bag is a tuple, with the fields named by field1, field2, etc. The fn() is a user-provided function that reads in the data. Pig supports user-provided Java code throughout to handle the application-specific bits of parsing. Pig Latin itself is the "glue" that then holds these application-specific functions together, routing records and other data between them.

An example data loading command (taken from this paper on Pig) is:

queries = LOAD 'query_log.txt'
USING myLoad()
AS (userId, queryString, timestamp)

The user-defined functions to load data (e.g., myLoad()) do not need to be provided. A default function for loading data exists, which will parse tab-delimited records. If the programmer did not specify field names in the AS clause, they would be addressed by positional parameters: $0, $1, and so forth.

The default loader is called PigStorage(). This loader can read files containing character-delimited tuple records. These tuples must contain only atomic values; e.g., cat, turtle, fish. Other loaders are listed in the PigBuiltins page of the Pig wiki. PigStorage() takes as an argument the character to use to delimit fields. For example, to load a table of three tab-delimited fields, the following statement can be used:

data = LOAD 'tab_delim_data.txt' USING PigStorage('\t') AS (user, time, query)

A different argument could be passed to PigStorage() to read comma- or space-delimited fields.

2 comments:

  1. Great. But the interesting thing is to write a class to load data, not to just use PigStorage, as every tutorial seems to do...

    ReplyDelete