Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

aboSamoor/lydia

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

79 Commits

Repository files navigation

Modify your bashrc
# Add custom Python modules to the Python path.
PYTHONPATH=$PYTHONPATH:/path/to/your/lydia/
export PYTHONPATH
you can change ~/sandbox/ to wherever you are keeping the code
To get a dump of arabic wikipedia
http://download.wikimedia.org/arwiki/
the one that I downloaded
 |wget -c http://download.wikimedia.org/arwiki/20110121/arwiki-20110121-pages-articles.xml.bz2
To fragment the wikipedia dump to different xml files, each contain one article
 |$wikipedia/fragmenter.py wikiDump.xml outputDirectory
The outputDirectory will contain all the files. The files are named after the ids of the wikipedia articles.
In each xml file, the text is contained between the <text></text> tags. Moreover, the text available with the wiki markup language
doStrip contains most of the formatting need to remove the wiki markup language.
To extract the text and remove the markup language.
$wikipedia/formatter.py inputFile
This will generate an output file called inputFile.f
$aner/NER.py folderName
It will run over all the files that have .f extension, it will delete all the temp files except .bw.post.NER file which has five columns. The stats will will be stored in folderName/stats
$aner/NER filename
It will run on the file passed and make .bw.ner file, it will not delete any temp files.

About

Arabic NER pipeline for Lydia system

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

Languages

AltStyle によって変換されたページ (->オリジナル) /