Apache Pig

Open-source data analytics software

Apache Pig
Apache Pig Logo
Developers	Apache Software Foundation, Yahoo Research
Initial release	September 11, 2008; 17 years ago (2008年09月11日)

Stable release	0.17.0 / June 19, 2017; 8 years ago (2017年06月19日)
Repository	svn.apache.org/repos/asf/pig/ Edit this at Wikidata
Operating system	Microsoft Windows, OS X, Linux
Type	Data analytics
License	Apache License 2.0
Website	pig.apache.org

Apache Pig^[1] is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin.^[1] Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.^[2] Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for relational database management systems. Pig Latin can be extended using user-defined functions (UDFs) which the user can write in Java, Python, JavaScript, Ruby or Groovy ^[3] and then call directly from the language.

History

[edit ]

Apache Pig was originally^[4] developed at Yahoo Research around 2006 for researchers to have an ad hoc way of creating and executing MapReduce jobs on very large data sets. In 2007,^[5] it was moved into the Apache Software Foundation.

Version	Original release date	Latest version	Release date^[6]
Unsupported: 0.1	2008年09月11日	0.1.1	2008年12月05日
Unsupported: 0.2	2009年04月08日	0.2.0	2009年04月08日
Unsupported: 0.3	2009年06月25日	0.3.0	2009年06月25日
Unsupported: 0.4	2009年08月29日	0.4.0	2009年08月29日
Unsupported: 0.5	2009年09月29日	0.5.0	2009年09月29日
Unsupported: 0.6	2010年03月01日	0.6.0	2010年03月01日
Unsupported: 0.7	2010年05月13日	0.7.0	2010年05月13日
Unsupported: 0.8	2010年12月17日	0.8.1	2011年04月24日
Unsupported: 0.9	2011年07月29日	0.9.2	2012年01月22日
Unsupported: 0.10	2012年01月22日	0.10.1	2012年04月25日
Unsupported: 0.11	2013年02月21日	0.11.1	2013年04月01日
Unsupported: 0.12	2013年10月14日	0.12.1	2014年04月14日
Unsupported: 0.13	2014年07月04日	0.13.0	2014年07月04日
Unsupported: 0.14	2014年11月20日	0.14.0	2014年11月20日
Unsupported: 0.15	2015年06月06日	0.15.0	2015年06月06日
Unsupported: 0.16	2016年06月08日	0.16.0	2016年06月08日
Latest version: 0.17	2017年06月19日	0.17.0	2017年06月19日
Legend: Unsupported Supported Latest version Preview version Future version

Naming

[edit ]

Regarding the naming of the Pig programming language, the name was chosen arbitrarily and stuck because it was memorable, easy to spell, and for novelty.^[7]^[8]^[9]

The story goes that the researchers working on the project initially referred to it simply as 'the language'. Eventually they needed to call it something. Off the top of his head, one researcher suggested Pig, and the name stuck. It is quirky yet memorable and easy to spell. While some have hinted that the name sounds coy or silly, it has provided us with an entertaining nomenclature, such as Pig Latin for the language, Grunt for the shell, and PiggyBank for the CPAN-like shared repository.

— Alan Gates, Daniel Dai, "What Is Pig?", Programming Pig, 2nd Edition (November 2017)

Example

[edit ]

Below is an example of a "Word Count" program in Pig Latin:

input_lines=LOAD'/tmp/my-copy-of-all-pages-on-internet'AS(line:chararray);

-- Extract words from each line and put them into a pig bag
-- datatype, then flatten the bag to get one word on each row
words=FOREACHinput_linesGENERATEFLATTEN(TOKENIZE(line))ASword;

-- filter out any words that are just white spaces
filtered_words=FILTERwordsBYwordMATCHES'\\w+';

-- create a group for each word
word_groups=GROUPfiltered_wordsBYword;

-- count the entries in each group
word_count=FOREACHword_groupsGENERATECOUNT(filtered_words)AScount,groupASword;

-- order the records by count
ordered_word_count=ORDERword_countBYcountDESC;
STOREordered_word_countINTO'/tmp/number-of-words-on-internet';

The above program will generate parallel executable tasks which can be distributed across multiple machines in a Hadoop cluster to count the number of words in a dataset such as all the webpages on the internet.

Pig vs SQL

[edit ]

In comparison to SQL, Pig

has a nested relational model,
uses lazy evaluation,
uses extract, transform, load (ETL),
is able to store data at any point during a pipeline,
declares execution plans,
supports pipeline splits, thus allowing workflows to proceed along DAGs instead of strictly sequential pipelines.

On the other hand, it has been argued DBMSs are substantially faster than the MapReduce system once the data is loaded, but that loading the data takes considerably longer in the database systems. It has also been argued RDBMSs offer out of the box support for column-storage, working with compressed data, indexes for efficient random data access, and transaction-level fault tolerance.^[10]

Pig Latin is procedural and fits very naturally in the pipeline paradigm while SQL is instead declarative. In SQL users can specify that data from two tables must be joined, but not what join implementation to use (You can specify the implementation of JOIN in SQL, thus "... for many SQL applications the query writer may not have enough knowledge of the data or enough expertise to specify an appropriate join algorithm."). Pig Latin allows users to specify an implementation or aspects of an implementation to be used in executing a script in several ways.^[11] In effect, Pig Latin programming is similar to specifying a query execution plan, making it easier for programmers to explicitly control the flow of their data processing task.^[12]

SQL is oriented around queries that produce a single result. SQL handles trees naturally, but has no built in mechanism for splitting a data processing stream and applying different operators to each sub-stream. Pig Latin script describes a directed acyclic graph (DAG) rather than a pipeline.^[11]

Pig Latin's ability to include user code at any point in the pipeline is useful for pipeline development. If SQL is used, data must first be imported into the database, and then the cleansing and transformation process can begin.^[11]

References

[edit ]

^ ^a ^b "Hadoop: Apache Pig" . Retrieved Sep 2, 2011.
^ "[PIG-4167] Initial implementation of Pig on Spark - ASF JIRA". issues.apache.org. Retrieved 2018年12月29日.
^ "Pig user defined functions" . Retrieved May 3, 2013.
^ "Yahoo Blog:Pig – The Road to an Efficient High-level language for Hadoop". Archived from the original on February 3, 2016. Retrieved May 23, 2015.
^ "Pig into Incubation at the Apache Software Foundation". Archived from the original on February 3, 2016. Retrieved May 23, 2015.
^ "Apache Pig Releases". Apache. Retrieved 2019年03月13日.
^ "1. What Is Pig? - Programming Pig, 2nd Edition [Book]". www.oreilly.com. Retrieved 2021年08月01日.
^ Gates, Alan (2016). Programming Pig. Daniel Dai (Second ed.). Sebastopol, CA. ISBN 978-1-4919-3706-8. OCLC 964523786.{{cite book}}: CS1 maint: location missing publisher (link)
^ Gates, Alan (2021年07月27日). "Pig mascot questions". Pig User Mailing List (Mailing list). Archived from the original on 1 August 2021. Retrieved 1 August 2021.
^ "Communications of the ACM: MapReduce and Parallel DBMSs: Friends or Foes?" (PDF). Archived from the original (PDF) on July 1, 2015. Retrieved May 23, 2015.
^ ^a ^b ^c "Yahoo Pig Development Team: Comparing Pig Latin and SQL for Constructing Data Processing Pipelines". Archived from the original on May 30, 2015. Retrieved May 23, 2015.
^ "ACM SigMod 08: Pig Latin: A Not-So-Foreign Language for Data Processing" (PDF). Retrieved May 23, 2015.

External links

[edit ]

Official website

v t e The Apache Software Foundation
Top-level projects	Accumulo ActiveMQ Airavata Airflow Allura Ambari Ant Aries Arrow Apache HTTP Server APR Avro Axis Axis2 Beam Bloodhound Brooklyn Calcite Camel CarbonData Cassandra Cayenne CloudStack Cocoon Cordova CouchDB cTAKES CXF Derby Directory Drill Druid Empire-db Felix Flex Flink Flume FreeMarker Geronimo Groovy Guacamole Gump Hadoop HBase Helix Hive Iceberg Ignite Impala Jackrabbit James Jena JMeter Kafka Kudu Kylin Lucene Mahout Maven MINA mod_perl MyFaces Mynewt NiFi NetBeans Nutch NuttX OFBiz Oozie OpenEJB OpenJPA OpenNLP OрenOffice ORC PDFBox Parquet Phoenix POI Pig Pinot Pivot Qpid Roller RocketMQ Samza Shiro SINGA Sling Solr Spark Storm SpamAssassin Struts 1 Subversion Superset SystemDS Tapestry Thrift Tika TinkerPop Tomcat Trafodion Traffic Server UIMA Velocity Wicket Xalan Xerces XMLBeans Yetus ZooKeeper
Commons	BCEL BSF Daemon Jelly Logging
Incubator	Taverna
Other projects	Batik FOP Ivy Log4j
Attic	Apex AxKit Beehive iBATIS Click Continuum Deltacloud Etch Giraph Hama Harmony Jakarta Marmotta MXNet ODE River Shale Slide Sqoop Stanbol Tuscany Wave XML
Licenses	Apache License
Category

Retrieved from "https://en.wikipedia.org/w/index.php?title=Apache_Pig&oldid=1306882474"

History

Naming

Example

Pig vs SQL

See also

References

External links