Apache DataFu™

Getting Started

DataFu Spark Docs

DataFu Pig Docs

DataFu Hourglass Docs

Community

Apache Software Foundation

Apache DataFu Pig - Guide

Set Operations

Apache DataFu has several methods for performing set operations on bags.

Set Intersection

Compute the set intersection with SetIntersect:

define SetIntersect datafu.pig.sets.SetIntersect();
-- ({(3),(4),(1),(2),(7),(5),(6)},{(0),(5),(10),(1),(4)})
input = LOAD 'input' AS (B1:bag{T:tuple(val:int)},B2:bag{T:tuple(val:int)});
intersected = FOREACH input {
 sorted_b1 = ORDER B1 by val;
 sorted_b2 = ORDER B2 by val;
 GENERATE SetIntersect(sorted_b1,sorted_b2);
}
-- produces: ({(1),(4),(5)})
DUMP intersected;

Set Union

Compute the set union with SetUnion:

define SetUnion datafu.pig.sets.SetUnion();
-- ({(3),(4),(1),(2),(7),(5),(6)},{(0),(5),(10),(1),(4)})
input = LOAD 'input' AS (B1:bag{T:tuple(val:int)},B2:bag{T:tuple(val:int)});
unioned = FOREACH input GENERATE SetUnion(B1,B2);
-- produces: ({(3),(4),(1),(2),(7),(5),(6),(0),(10)})
DUMP unioned;

This can also operate on multiple bags:

intersected = FOREACH input GENERATE SetUnion(B1,B2,B3);

Set Difference

Compute the set difference with SetDifference:

define SetDifference datafu.pig.sets.SetDifference();
-- ({(3),(4),(1),(2),(7),(5),(6)},{(1),(3),(5),(12)})
input = LOAD 'input' AS (B1:bag{T:tuple(val:int)},B2:bag{T:tuple(val:int)});
differenced = FOREACH input {
 -- input bags must be sorted
 sorted_b1 = ORDER B1 by val;
 sorted_b2 = ORDER B2 by val;
 GENERATE SetDifference(sorted_b1,sorted_b2);
}
-- produces: ({(2),(4),(6),(7)})
DUMP differenced;
Apache Feather
Copyright © 2011-2025 The Apache Software Foundation, Licensed under the Apache License, Version 2.0.
Apache DataFu, DataFu, Apache Pig, Apache Hadoop, Hadoop, Apache, and the Apache feather logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and other countries.

AltStyle によって変換されたページ (->オリジナル) /