$ pip install -r python_requirements.txt
$ cd scripts $ python crawl_page.py CATEGORY # specify article category
The collected articles are stored in data of the current directory.
Make each article data to one CSV file for each category. The CSV file is stored in GClassifier/dataset/row.
$ cd scripts $ python make_single_file.py all # specify article category or all
Do wakatigaki data and format it. Output CSV file is stored in GClassifier/dataset/preprocess.
$ cd GClassifier $ python g_preprocess.py all --wakati_type mecab-noun # if you use word-level n-gram $ python g_preprocess.py all --wakati_type word-ngram --ngram_n 2
Train Naive Bayes model using the wakatigaking data and dump it to GClassifier/naive_bayes_model.pkl.
$ cd GClassifier $ python dump_classifier.py mecab-noun_all # or n-gram_all
Run the server and access http://localhost:8000/predict_category/ then enter gunosy article URL.
$ python manage.py runserver
We evaluated classifier using 5-fold cross validation. The result is here
$ cd Gclassifier
$ python train_cross_validation.py mecab-noun_all --kfold 5