0

I'm not sure if this is a really dumb question, but here goes.

text_corpus = '''Insurance bosses plead guilty\n\nAnother three US insurance executives have pleaded guilty to fraud charges stemming from an ongoing investigation into industry malpractice.\n\nTwo executives from American International Group (AIG) and one from Marsh & McLennan were the latest. The investigation by New York attorney general Eliot Spitzer has now obtained nine guilty pleas. The highest ranking executive pleading guilty on Tuesday was former Marsh senior vice president Joshua Bewlay.\n\nHe admitted one felony count of scheming to defraud and faces up to four years in prison. A Marsh spokeswoman said Mr Bewlay was no longer with the company. Mr Spitzer\'s investigation of the US insurance industry looked at whether companies rigged bids and fixed prices. Last month Marsh agreed to pay 850ドルm (415ドルm) to settle a lawsuit filed by Mr Spitzer, but under the settlement it "neither admits nor denies the allegations".\n'''
def get_entities(document_text, model):
 analyzed_doc = model(document_text)
 entities = [entity for entity in analyzed_doc.ents if entity.label_ in ["PER", "ORG", "LOC", "GPE"]]
 return entities
model = spacy.load("en_core_web_sm")
entities_1 = get_entities(text_corpus, model)
entities_2 = get_entities(text_corpus, model)

but when it run the following,

entities_1[0] in entities_2

The output is False.

Why is that? The objects in both the entity lists are the same. Yet an item from one list is not in the other one. That's extremely odd. Can someone please explain why that is so to me?

1

1 Answer 1

1

This is due to the way ents's are represented in spaCy. They are classes with specific implementations so even entities_2[0] == entities_1[0] will evaluate to False. By the looks of it, the Span class does not have an implementation of __eq__ which, at first glance at least, is the simple reason why.

If you print out the value of entities_2[0] it will give you US but this is simply because the span class has a __repr__ method implemented in the same file. If you want to do a boolean comparison, one way would be to use the text property of Span and do something like:

entities_1[0].text in [e.text for e in entities_2]

edit:

As @abb pointed out, Span implements __richcmp__, however this is applicable to the same instance of Span since it checks the position of the token itself.

answered Apr 23, 2020 at 20:50
Sign up to request clarification or add additional context in comments.

5 Comments

Thank you so much! This makes sense to me now.
@AvijeetKartikay , if this solution worked for you, then please upvote and accept the answer
(Span implements __eq__ in its cython __richcmp__ method, but spans from different documents are never equal regardless of the contents. If they're from the same document, they need to have the same start/end/label.)
Thanks @aab, somehow I completely missed __richcmp__ while I was looking at it. Nice catch!
I looked at the source code of the __richcmp__ method and it seems more than the three above mentioned attributes need to align. It needs to be doc, start_char, end_char, label and kb_id. Source: github.com/explosion/spaCy/blob/…

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.