|
|
|
A cloud, for our purposes, is a shared-nothing, networked group of
computers that we can use to run some computation in parallel on a
massive dataset.
A typical application, search engine log analasys, see
www.google.com/trends
Example log file, terabytes per day.
query?Volcno Bat
query?Island palm tree
images?volcano
...
Workflow to analyze
*) Write to lower case
*) map to correct spelling
*) aggregate and count
Need a dataflow language to manage each step.
Assume each set of words is a list.
1) Map to lower case
volcno bat
2) Map to correct spelling
volcano bat
3) [1 volcano, 1 bat]
In functional programming, this is "map" and "reduce."
Pig is a dataflow language, built on top of a map/reduce
architecture (Hadoop).
Kinds of objects
relations (a bag)
a bag is a set of tuples
a tuples is a list of fields
a field is a piece of data
Alias is a name bound to an object.
The following loads some data
A = LOAD 'actor.csv' USING PigStorage(',') AS (id:int, name:chararray);
To look at the data.
DUMP A;
To store the data.
STORE A;
Projection, create an iterator over a column.
B = FOREACH A GENERATE name;
Selection, use a filter.
C = FILTER A BY id < 20;
Join
E = LOAD 'address.csv' USING PigStorage(',') AS (name:chararray, address:chararray);
D = JOIN A BY name, E BY name;
Grouping creates a bag of tuples with the group-by values.
M = FOREACH A GENERATE id % 3 as mod, name;
N = GROUP M By (mod);
X = FOREACH N GENERATE mod, COUNT(name);
|