Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Normalizer tool for user-generated content (Brazilian Portuguese)

License

Notifications You must be signed in to change notification settings

avanco/UGCNormal

Repository files navigation

UGCNormal

This is a normalizer tool for user-generated content (Brazilian Portuguese). You can use it as a service, look at ugcnormal_interface. Also consider using this dockerized service of UGCNormal features ugcnormal-microservice.


 UGC-Normalizer
INPUT
|
| ----------------------------- ------------- -----------
---> | SentenceBoundaryDetection | ---> | tokenizer | ---> | speller | ----
 ----------------------------- ------------- ----------- |
 |
 |
 ----------------------------------------------------------------------
 |
 | -------------- ------------------ ----------
 ---> | siglas_map | ---> | internetes_map | ---> | np_map | ---> OUTPUT
 -------------- ------------------ ----------
>>> HOW TO USE:
Before anything else, run ./configure.sh script to check and solve all
dependencies. After that you can run the normalizer script.
Main script is ugc_norm.sh. Use it to apply the normalization pipeline. Just run and pass as
parameters INPUT_dir and OUTPUT_dir. The INPUT_dir must contain all text files
to be processed.
You can test the normalizer using the data in directory "test":
./ugc_norm.sh ./test/input/ ./test/output/
>>> MORE INFO:
******************************* test
Input and output directories to test the normalizer. The output directory tree
has the output produced by each step of this pipeline (sent -> tok -> checked
-> siglas -> internetes -> nomes). The deeper directory ('nomes') has the
result of the full pipeline (probably you are interested only in this result).
******************************* internetes_map.pl
perl script to translate web language using dictionary
******************************* np_map.pl
perl script to normalize NPs using (./resources/np_data.txt). It just
capitalizes the first letter
******************************* siglas_map.pl
Script to put all letters to upper case, if it is in ./resources/lexico_siglas.txt
******************************* upper_handler.py
It checks if a text file is totally in uppercase, if it is, only words after
punctuation are capitalized, all the others are set to lowercase
******************************* upper_periods.py
It capitalizes words after periods
******************************* README.txt
This file !
******************************* resources
Directory with dictionaries for NPs and web language
******************************* SentenceBoundaryDetection
Sentence boundary detection tool, it appends <S> tags at the end of each sentence
******************************* speller
Speller tool directory
******************************* tokenizer
Tokenizer tool directory, you can change lex rules in webtok.lex and run
Makefile using make tool
******************************* utils
- ./utils/extract.sh
This script extract all opinions (text files) in a corpus (many
subdirectories)

References

Duran, M. S.; Avanço, L. V.; Nunes, M. G. V. (2015). A Normalizer for UGC in Brazilian Portuguese. In: ACL 2015, Workshop on Noisy User-generated Text - WNUT, 2015, Beijing, China, p. 38-47. http://aclanthology.info/papers/W15-4305/a-normalizer-for-ugc-in-brazilian-portuguese

About

Normalizer tool for user-generated content (Brazilian Portuguese)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

AltStyle によって変換されたページ (->オリジナル) /