Saturday, September 13, 2014

Easy parallel corpora from Wikipedia

We're off to a busy start of the semester, and between co-teaching a new (for me) class, proposals, project work, and students returning from internships, I haven't had much capacity for extracurricular writing.

But, I wanted to post a link to some scripts I just pushed to Github that will build a parallel corpus based by extracting the titles from the interlingual links on Wikipedia. I've found Wikipedia title pairs to be a surprisingly useful resource on a number of occasions (great coverage of interesting languages and scripts, good license for data use/distribution), and I imagine others will as well.

Posted by at

1 comment:

Unknown said...

This could be a potential resource for the our multilingual entity project!

September 29, 2014 at 11:19 PM

Post a Comment

Subscribe to: Post Comments (Atom)

AltStyle によって変換されたページ (->オリジナル) /