Thread: [Dbpedia-gsoc] GSoC2014 - Interested in working on extraction with MapReduce

Status: Beta

Brought to you by: bizer, cyganiak, gkobilarov, jcsahnwaldt, and 13 others

dbpedia-gsoc

[Dbpedia-gsoc] GSoC2014 - Interested in working on extraction with MapReduce

From: Nilesh C. <ni...@ni...> - 2014年03月05日 13:01:06

Hi Andrea, Dimitris and everyone!
I'm a senior year B.Tech undergraduate majoring in Computer Science.
Machine learning and data science excite me like nothing else. I've got
quite some experience with Hadoop, and after studying the details of
the *Extraction
using Map Reduce* project idea I figured that this would be a good match
for my skillset and should be fun for me too.
First, a bit of background:
Among other things, I had worked on peta-scale graph centrality computation
using MapReduce during a research internship a year ago - I built a Hadoop
implementation for computing PageRank on huge graphs (it's ongoing, some
WIP code at [1]), trying to get a lot more performance improvements than
Pegasus [2].
Last year I've been hacking on an entity suggester for wikidata, so I've
got a good idea of the structure of wiki dumps. I needed to build a feature
matrix for feeding into a collaborative filtering engine - so in essence I
had to generate tuples of row, col, value (sparse matrix data) from the big
wikidata dumps. You can find the Hadoop Streaming Python scripts at [4]. I
used lxml and json libraries in the Python code to parse the raw dumps,
therefore we could easily parallelize the task without needing to run a
separate MediaWiki instance on LAMP.
The Hadoop code I just mentioned also has a custom InputFormat to split the
wiki dump XML into <page>...</page> chunks. An even better idea would be to
use the wikihadoop [5] project - it's aimed at providing custom
InputFormats to split Wikipedia pages into chunks. This will be done in
Hadoop itself, automated, parallelized. And we need not even extract the
bz2 files. Wikihadoop decompresses bz2 on-the-fly.
Writing MapReduce jobs for extracting redirects would be trivial, we can do
all sorts of things with the reducer here, creating redirect lists like
adjancency lists etc., we can store them on the HDFS in a format that'd be
useful for the next step. Parsing each page and generating RDF triples in
the mappers and aggregating/joining them via the reducers - the whole thing
should be done in 2-3 MapReduce jobs.
Also, since the data isn't Peta-scale and we're still talking GBs here, I
think using Spark instead of Hadoop could be a good option too. Spark even
has a native Scala API, and it does disk-backed in-memory computation which
is often faster. In any case, we'll stick to Scala or Java.
It would be great if you can help me out with these questions:
 - Looks like org.dbpedia.extraction.mappings.AbstractExtractor calls the
 API on a local MediaWiki instance. We could have a single MW instance on a
 LAMP server and use that to answer API queries from all the mappers and do
 all the processing (the stuff that AbstractExtractor does) in the mappers.
 That would keep it parallel, and will be faster than simple sequential
 extraction. But the slow MediaWiki node may turn out to be a bottleneck.
 Thoughts? Also, what is the problem with having an automated script setup
 MediaWiki+MySQL instances one for each of the Hadoop machines?
 - Could you give me some pointers as to what my next steps should be?
 Should I start working on a prototype, draft my proposal on google-melagne,
 or share it here first?
Please let me know if you have any questions and I'll be glad to clarify
them for you. :)
Cheers,
Nilesh
[1] : https://github.com/nilesh-c/graphfu
[2] : http://www.cs.cmu.edu/~ukang/papers/PegasusICDM2009.pdf
[3] : http://www10.org/cdrom/papers/pdf/p577.pdf
[4] : https://github.com/nilesh-c/wes/tree/master/wikiparser
[5] : https://github.com/whym/wikihadoop
A quest eternal, a life so small! So don't just play the guitar, build one.
You can also email me at co...@ni... or visit my
website<http://www.nileshc.com/>

Re: [Dbpedia-gsoc] GSoC2014 - Interested in working on extraction with MapReduce

From: Dimitris K. <ji...@gm...> - 2014年03月06日 10:33:02

Hi Nilesh & welcome to the DBpedia community,
You seem familiar with wikipedia dump processing and wikihadoop might fit
here too. However, we already have our own library that can download,
process and split (zipped) dumps so let's see if this fits the requirements.
As you probably already noticed in previous threads, we are waiting for a
mentor with MapReduce experience to join but the main idea workflow is
described here:
http://wiki.dbpedia.org/gsoc2014/ideas/ExtractionwithMapReduce/
The rational for choosing Spark sounds reasonable but I don't have
experience to comment on this
To answer your questions:
- The abstract extractor is already the bottleneck in the extraction
process on a single machine and it will not scale. Maybe with a limited
number of parallel nodes (2-3) it will work (quite slow) but if we want to
go full speed we must skip it.
- As for the next tasks, you can get a little familiar with the extraction
process, run some sample extractions to get to know how things work and
then focus on your proposal.
You can share it privately through the melange system or public through the
mailing list, it's up to you.
Cheers,
Dimitris
On Wed, Mar 5, 2014 at 2:33 PM, Nilesh Chakraborty <ni...@ni...>wrote:
> Hi Andrea, Dimitris and everyone!
>
> I'm a senior year B.Tech undergraduate majoring in Computer Science.
> Machine learning and data science excite me like nothing else. I've got
> quite some experience with Hadoop, and after studying the details of the *Extraction
> using Map Reduce* project idea I figured that this would be a good match
> for my skillset and should be fun for me too.
>
> First, a bit of background:
>
> Among other things, I had worked on peta-scale graph centrality
> computation using MapReduce during a research internship a year ago - I
> built a Hadoop implementation for computing PageRank on huge graphs (it's
> ongoing, some WIP code at [1]), trying to get a lot more performance
> improvements than Pegasus [2].
>
> Last year I've been hacking on an entity suggester for wikidata, so I've
> got a good idea of the structure of wiki dumps. I needed to build a feature
> matrix for feeding into a collaborative filtering engine - so in essence I
> had to generate tuples of row, col, value (sparse matrix data) from the big
> wikidata dumps. You can find the Hadoop Streaming Python scripts at [4]. I
> used lxml and json libraries in the Python code to parse the raw dumps,
> therefore we could easily parallelize the task without needing to run a
> separate MediaWiki instance on LAMP.
>
> The Hadoop code I just mentioned also has a custom InputFormat to split
> the wiki dump XML into <page>...</page> chunks. An even better idea would
> be to use the wikihadoop [5] project - it's aimed at providing custom
> InputFormats to split Wikipedia pages into chunks. This will be done in
> Hadoop itself, automated, parallelized. And we need not even extract the
> bz2 files. Wikihadoop decompresses bz2 on-the-fly.
>
> Writing MapReduce jobs for extracting redirects would be trivial, we can
> do all sorts of things with the reducer here, creating redirect lists like
> adjancency lists etc., we can store them on the HDFS in a format that'd be
> useful for the next step. Parsing each page and generating RDF triples in
> the mappers and aggregating/joining them via the reducers - the whole thing
> should be done in 2-3 MapReduce jobs.
>
> Also, since the data isn't Peta-scale and we're still talking GBs here, I
> think using Spark instead of Hadoop could be a good option too. Spark even
> has a native Scala API, and it does disk-backed in-memory computation which
> is often faster. In any case, we'll stick to Scala or Java.
>
> It would be great if you can help me out with these questions:
>
> - Looks like org.dbpedia.extraction.mappings.AbstractExtractor calls
> the API on a local MediaWiki instance. We could have a single MW instance
> on a LAMP server and use that to answer API queries from all the mappers
> and do all the processing (the stuff that AbstractExtractor does) in the
> mappers. That would keep it parallel, and will be faster than simple
> sequential extraction. But the slow MediaWiki node may turn out to be a
> bottleneck. Thoughts? Also, what is the problem with having an automated
> script setup MediaWiki+MySQL instances one for each of the Hadoop machines?
>
> - Could you give me some pointers as to what my next steps should be?
> Should I start working on a prototype, draft my proposal on google-melagne,
> or share it here first?
>
>
> Please let me know if you have any questions and I'll be glad to clarify
> them for you. :)
>
> Cheers,
> Nilesh
>
> [1] : https://github.com/nilesh-c/graphfu
> [2] : http://www.cs.cmu.edu/~ukang/papers/PegasusICDM2009.pdf
> [3] : http://www10.org/cdrom/papers/pdf/p577.pdf
> [4] : https://github.com/nilesh-c/wes/tree/master/wikiparser
> [5] : https://github.com/whym/wikihadoop
>
>
>
> A quest eternal, a life so small! So don't just play the guitar, build one.
> You can also email me at co...@ni... or visit my website<http://www.nileshc.com/>
>
>
>
> ------------------------------------------------------------------------------
> Subversion Kills Productivity. Get off Subversion & Make the Move to
> Perforce.
> With Perforce, you get hassle-free workflows. Merge that actually works.
> Faster operations. Version large binaries. Built-in WAN optimization and
> the
> freedom to use Git, Perforce or both. Make the move to Perforce.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
> _______________________________________________
> Dbpedia-gsoc mailing list
> Dbp...@li...
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>
>
-- 
Kontokostas Dimitris

Thanks for helping keep SourceForge clean.