This is the official codebase for the following paper, implemented in PyTorch:
Hareesh Bahuleyan and Layla El Asri. Diverse Keyphrase Generation with Neural Unlikelihood Training. COLING 2020. https://arxiv.org/pdf/2010.07665.pdf
-
Create and activate Python 3.7.5 virtual environment using
conda:conda create --name keygen python=3.7.5 source activate keygen -
Install necessary packages using pip:
pip install -r requirements.txt # Download spacy model python -m spacy download en_core_web_sm -
Sent2Vec Installation Sent2Vec is used in the evaluation script. Please install sent2vec from https://github.com/epfml/sent2vec, using the steps below:
- Clone/Download the directory:
git clone https://github.com/epfml/sent2vec - Go to sent2vec directory:
cd sent2vec/ git checkout f827d014a473aa22b2fef28d9e29211d50808d48- Run
make - Run
pip install cython - Inside the src folder:
cd src/python setup.py build_extpip install .
- Download a pre-trained sent2vec model. For example, we used
sent2vec_wiki_unigrams. Finally, copy it todata/sent2vec/wiki_unigrams.bin
- Clone/Download the directory:
-
Data Download Download the pre-processed data files in JSON format by visiting this link: Unzip the file and copy it to
data/The data folder should now have the following structure:
data/ ├── kp20k_sorted/ ├── KPTimes/ │ └── kptimes_sorted/ ├── sample_testset/ ├── sent2vec/ │ └── wiki_unigrams.bin └── stackexchange/ └── se_sorted/
To train a DivKGen model using one of the configurations provided under configurations/:
# Specify the dataset
export DATASET=kp20k
# Specify the configuration name
export EXP=copy_seq2seq_attn_mle_greedy.tgt_15.0.copy_18.0
# Run training script
allennlp train configurations/$DATASET/$EXP.jsonnet -s output/$DATASET/$EXP/ -f --include-package keyphrase_generation -o '{ "trainer": {"cuda_device": 0} }'
The outputs (training logs, model checkpoints, tensorboard logs) will be stored under: output/$DATASET/$EXP
Notes:
- If your loss collapses NaN during training, this could be due to numerical underflow. The way to fix this is to edit
path/to/conda/envs/keygen/lib/python3.7/site-packages/allennlp/nn/utils.pyfunctionmasked_log_softmax()and change the linevector = vector + (mask + 1e-45).log()tovector = vector + (mask + 1e-35).log(). - Similary, find and replace all instances of
1e-45inpath/to/conda/envs/keygen/lib/python3.7/site-packages/allennlp/models/encoder_decoders/copynet_seq2seq.pyto1e-35 - During validation after every epoch, if it throws a Type Mismatch Error (
RuntimeError: "argmax_cuda" not implemented for 'Bool'), this can be fixed by explicit type casting by changing the linematches = (expanded_source_token_ids == expanded_target_token_ids)tomatches = (expanded_source_token_ids == expanded_target_token_ids).int()inpath/to/conda/envs/keygen/lib/python3.7/site-packages/allennlp/models/encoder_decoders/copynet_seq2seq.py
Finally, the evalution script can be run as follows:
- Go to
run_eval.sh, set theHOME_PATHvariable. This corresponds to theabsolute/path/to/keyphrase-generation/folder - Set the datasets. For instance, if we set both
EVALSETandDATASETtokp20k, then we use the best model trained onkp20kto evaluate onkp20k. This is useful when you would like to evaluate a model trained on Dataset A on Dataset B. - Next,
bash run_eval.shwill print the quality and diversity results and also save them tooutput/$DATASET/$EXP
Note: In the paper, we present EditDist as a diversity evaluation metric, for which we initially used a different fuzzy string matcher. However, this codebase uses an alternative library rapidfuzz, which offers a similar funcitonality.
If you found this code useful in your research, please cite:
@inproceedings{divKeyGen2020,
title={Diverse Keyphrase Generation with Neural Unlikelihood Training},
author={Bahuleyan, Hareesh and El Asri, Layla},
booktitle={Proceedings of the 28th International Conference on Computational Linguistics (COLING)},
year={2020}
}