Apache DataFu™

Getting Started

DataFu Spark Docs

DataFu Pig Docs

DataFu Hourglass Docs

Community

Apache Software Foundation

Apache DataFu Pig - Guide

PageRank

Run PageRank on a large number of independent graphs through the PageRank UDF:

define PageRank datafu.pig.linkanalysis.PageRank('dangling_nodes','true');
topic_edges = LOAD 'input_edges' as (topic:INT,source:INT,dest:INT,weight:DOUBLE);
topic_edges_grouped = GROUP topic_edges by (topic, source) ;
topic_edges_grouped = FOREACH topic_edges_grouped GENERATE
 group.topic as topic,
 group.source as source,
 topic_edges.(dest,weight) as edges;
topic_edges_grouped_by_topic = GROUP topic_edges_grouped BY topic;
topic_ranks = FOREACH topic_edges_grouped_by_topic GENERATE
 group as topic,
 FLATTEN(PageRank(topic_edges_grouped.(source,edges))) as (source,pr);
skill_ranks = FOREACH skill_ranks GENERATE
 topic, source, pr;

This implementation stores the nodes and edges (mostly) in memory. It is therefore best suited when one needs to compute PageRank on many reasonably sized graphs in parallel.

Apache Feather
Copyright © 2011-2025 The Apache Software Foundation, Licensed under the Apache License, Version 2.0.
Apache DataFu, DataFu, Apache Pig, Apache Hadoop, Hadoop, Apache, and the Apache feather logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and other countries.

AltStyle によって変換されたページ (->オリジナル) /