Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Build Embeddings from Top 10 Terms Only? #13717

Unanswered
rwitmer asked this question in Q&A
Discussion options

I have a two-fold question, conceptual and technical.

Conceptual: Using SparkNLP the way to build BERT embeddings is to create a pipeline that has a DocumentAssembler, SentenceDetector, Tokenizer, BertEmbeddings.pretrained, EmbeddingsFinisher and Pipeline. Building embeddings for every word in a document is a task that requires a lot of time/resources. If I wanted to build embeddings just for SOME of the words in the document, for example the most salient words based on TF-IDF, would it be sensible to do that? I think the answer to this question will take one of two forms:

  1. No, that is not sensible. You must build embeddings for each word in the sentence to build an embedding for the particular word in question. Without first building the embeddings for each previous word in its context BERT can't build the embedding for the word in question. Each embedding influences the others. It's not sensible to build only a few in isolation.
    OR
  2. Yes, that is sensible. You can just use the pretrained embedding already available in the BERT model and build your particular word's embedding based on those standard embeddings for that word and the words surrounding it in its context. You don't have to refine the pretrained embedding for every word. You can still get a reasonable amount of context information from those pretrained embeddings.

Technical: Given that the pipeline is set up like it is and we don't see the intermediate steps (for example, the pipeline doesn't output a list of tokens found by the tokenizer before moving on to the next step in the pipeline), I wanted to build embeddings just for SOME of the words in a document how would I go about doing that? Maybe someone has already done this somehow and some example exists somewhere.

Thanks for your input!

You must be logged in to vote

Replies: 0 comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
1 participant

AltStyle によって変換されたページ (->オリジナル) /