Build Embeddings from Top 10 Terms Only? · JohnSnowLabs/spark-nlp · Discussion #13717

rwitmer
Mar 28, 2023

I have a two-fold question, conceptual and technical.

Conceptual: Using SparkNLP the way to build BERT embeddings is to create a pipeline that has a DocumentAssembler, SentenceDetector, Tokenizer, BertEmbeddings.pretrained, EmbeddingsFinisher and Pipeline. Building embeddings for every word in a document is a task that requires a lot of time/resources. If I wanted to build embeddings just for SOME of the words in the document, for example the most salient words based on TF-IDF, would it be sensible to do that? I think the answer to this question will take one of two forms:

No, that is not sensible. You must build embeddings for each word in the sentence to build an embedding for the particular word in question. Without first building the embeddings for each previous word in its context BERT can't build the embedding for the word in question. Each embedding influences the others. It's not sensible to build only a few in isolation.
OR
Yes, that is sensible. You can just use the pretrained embedding already available in the BERT model and build your particular word's embedding based on those standard embeddings for that word and the words surrounding it in its context. You don't have to refine the pretrained embedding for every word. You can still get a reasonable amount of context information from those pretrained embeddings.

Technical: Given that the pipeline is set up like it is and we don't see the intermediate steps (for example, the pipeline doesn't output a list of tokens found by the tokenizer before moving on to the next step in the pipeline), I wanted to build embeddings just for SOME of the words in a document how would I go about doing that? Maybe someone has already done this somehow and some example exists somewhere.

Thanks for your input!

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Build Embeddings from Top 10 Terms Only? #13717

Uh oh!

{{title}}

Uh oh!

rwitmer
Mar 28, 2023

Replies: 0 comments

Select a reply

Uh oh!

Build Embeddings from Top 10 Terms Only? #13717

Uh oh!

rwitmer Mar 28, 2023

Replies: 0 comments

rwitmer
Mar 28, 2023