3

Let's say I have thousands of pdfs that are each about 30k words written in conversational English. In each of the pdfs there is a name / names of a person/people who snowboard. There are also many other names. I need to extract the name(s) of the snowboarder(s) from any future pdfs. What are some tools / methods you could approach this problem with?

I just started learning about Natural Language Processing and Machine Learning a couple weeks ago. I have been using Python's NLTK to filter my data and have used scikit-learn for my classification and multilabel classification solutions pertaining to other questions I want to answer on the same data set, but this snowboarder example is not classification. I know I can strictly use an NLP solution but I want to try to have a ML model recognize the patterns in the text because all the documents are formatted similarly (and I have a lot of documents to train with and I am willing to manually label).

I was able to get some success training a word2vec neural net on each individual document. I then checked the model similarity (model.wv.similarity(HUMAN_NAME, 'snowboard')) between each name in a list of human names and the word 'snowboard', and the most similar has been my answer so far. I know there has to be a more eloquent solution. I know Sequence to Sequence models and topic modeling might be my next steps. Can someone point me in the right direction if they have a better idea?

asked Feb 1, 2018 at 18:05
1
  • So it dawned on me today (idk why I didn't think of this earlier) that I could just combine my input documents with each human name I extract from the documents and output whether or not that person snowboards. Eg. Document A in my training set has human names "a", "b", and "c" where "b" is the snowboarder. I can create inputs of Document A + "a", Document A + "b", and Document A + "c" which would output "no", "yes", "no" respectively. I will post an answer to this question when I get full solution Commented Feb 5, 2018 at 15:42

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.