-
Notifications
You must be signed in to change notification settings - Fork 6
Open
@neurlang
Description
abbreviations.tsv are currently not implemented. Make or borrow an open source dataset (for various languages) which ideally looks like this:
GPS tab Global Positioning System tab ["technology"]
USA tab United States of America tab ["geography"]
The full abbreviations are needed so that dataset admin know what abbreviation it is. Without it, dataset admin will have a hard job to delete / correct abbreviations. Tags will be optional.
Training phase:
- Only the first column will be used.
- Generate a bigram/trigram comparing abbreviations with normal language's words.
Inference phase:
For every non-dictionary word:
- Check if word is short and have at least 2 uppercase letters. If no its a word.
- Check it using bigram/trigram.
- Spell it out if it thinks that it is an abbreviation.