|A cloud, for our purposes, is a shared-nothing, networked group of
computers that we can use to run some computation in parallel on a
A typical application, search engine log analysis, see
Example log file, terabytes per day.
query?Island palm tree
Workflow to analyze
Need a dataflow language to manage each step.
Assume each set of words is a list.
Write to lower case
map to correct spelling
aggregate and count
- Initially words
["Bat", "Volcno", "bat"]
1) Map to lower case
["bat", "volcno", "bat"]
Map to correct spelling
["bat", "volcano", "bat"]
[(1, "volcano"), (2, "bat")]
In functional programming, this is "map" and "reduce" which are
A higher-order function takes a function as a parameter (or produces
a function as a result). The
classic example is
map which applies a function
to every element in a list. The higer order functions
reduce (fold) are build into functional
languages like Haskell.
map (\x -> x * x) [1, 2, 3, 4]
would result in
[1, 4, 9, 16]
reduce (*) 1 [1, 2, 3, 4]
would result in
1 * 1 * 2 * 3 * 4
So the factorial can also be defined as
factorial n = reduce (*) 1 [1..n]
Of course, if we lacked a specific function, we could always create it.
For instance, here is an implementation of the map function.
mapImpl f  = 
mapImpl f (x:xs) = (f x)::(mapImpl f xs)
Pig is a dataflow language, built on top of a map/reduce
Kinds of objects
Alias is a name bound to an object.
The following loads some data
relations (a bag)
a bag is a set of tuples
a tuples is a list of fields
a field is a piece of data
A = LOAD 'actor.csv' USING PigStorage(',') AS (id:int, name:chararray);
To look at the data.
To store the data.
Projection, create an iterator over a column.
B = FOREACH A GENERATE name;
Selection, use a filter.
C = FILTER A BY id < 20;
E = LOAD 'address.csv' USING PigStorage(',') AS (name:chararray, address:chararray);
D = JOIN A BY name, E BY name;
Grouping creates a bag of tuples with the group-by values.
M = FOREACH A GENERATE id % 3 as mod, name;
N = GROUP M By (mod);
X = FOREACH N GENERATE mod, COUNT(name);