Pig Program
Jump to navigation
Jump to search
A Pig Program is a software program composed of pig statements (written in the pig programming language) that can be executed by a Pig software system.
- Context:
- It can (typically) be a Data Processing Program.
- It can (often) be a Apache Pig Job.
- Example(s):
- Counter-Example(s):
- a Spark Job.
- a Hive Query.
- See: Data Analysis Program, Pig Software System, Pig Compiler, Map/Reduce Job.
References
2013
input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);
-- Extract words from each line and put them into a pig bag -- datatype, then flatten the bag to get one word on each row words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
-- filter out any words that are just white spaces filtered_words = FILTER words BY word MATCHES '\\w+';
-- create a group for each word word_groups = GROUP filtered_words BY word;
-- count the entries in each group word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;
-- order the records by count ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
The above program will generate parallel executable tasks which can be distributed across 1,000s of machines in a Hadoop cluster to count the number of words in a dataset such as "all the webpages on the internet".