11751 Week3 Digest

courses 4 minutes read (About 669 words)0 visits

This digest contains two components: the concepts I failed to make sense of in class and important sections.

Out-of-Vocabulary (OOV)

  • Definition: In plain language, OOV occurs when a word appears in the test lexicon but does not occur appear in the training data. More technically speaking, Out-of-vocabulary (OOV) are terms that are not part of the normal lexicon found in a natural language processing environment.
  • How to handle OOV:
    • <unk> token
    • Spell check: only works for mis-spelled words. Can’t do new words.
    • Subwords: use facebook’s fasttext library, or sklearn.feature_extraction.text with analyzer set to char or char_wb. More details.
    • BPE: more recommended one.
      How does it work? solution

      BPE ensures that the most common words are represented in the vocabulary as a single token while the rare words are broken down into two or more subword tokens and this is in agreement with what a subword-based tokenization algorithm does.

      Another thing about BPE is that its granularity is somewhere between words (too large, $|\mathcal{V}|$ can be 100k) and characters (too few, only 26). BPE’s vocab size is a good middle point, you can change the vocab size, and it will generate the lexicon with a subword-based tokenization algorithm.

Alignment

  • Soft Alignment: For each phoneme sequence, which frames belong to which phoneme sequences are probability distributions. Attention-based asr is based on software.

    Soft Alignment Example

  • Hard Alignment: No probability distributions. Each frame belongs to only one phoneme sequences.

    Hard Alignment Example
    We can Use Trellis to align phoneme and frames. Below is and example where $N=3$ and $T=5$

    Trellis Example

Acoustic Model

Unlike the novel attention-based end-to-end ASR, traditional ASR is hmm-based. It helps to understand the basics of ASR. Traditional hmm-based ASR composed of four components, shown as below.

Acoustic Model in HMM-based ASR Pipeline

We have talked about the first feature extraction, and will try to factorize the acoustic model. Features and Phonemes in lexicon can be represented in $O$ and $L$ respectively.
$$O=(O_t\in R^D|t=1,\cdots,T)$$

$$L=(l_i\in{/AA/,/AE/,\cdots}|i=1,\cdots,J)$$
Assume that alignment information is given, then acoustic model can be written as

$$\begin{split}
p(O|L)&=p(O_{1:T1},O_{T_1+1:T2},\cdots|l_1,l_2,\cdots)\\
& = p(O_{1:T1}|O_{T_1+1:T2},\cdots,l_1,l_2,\cdots)p(O_{T_1+1:T2},\cdots|l_1, l_2,\cdots)\\
& = p(O_{1:T1}|l_1)p(O_{T_1+1:T2},\cdots|l_1, l_2,\cdots)\\
& \vdots\\
& = p(O_{1:T1}|l_1)p(O_{T_1+1:T2}|l_2)\cdots\\
& = \prod_{i=1}^{J}p(O_{T_{j-1}+1:T_j}|l_j)\\
\end{split}$$

Two rules has been appied:

Feature Extraction

Author

Ziang Zhou

Posted on

2022年09月19日

Updated on

2022年09月19日

Licensed under

Like this article? Support the author with

Comments

Please enable JavaScript to view the comments powered by Disqus.

follow.it

Follow my blogs for immediate feeds!

AltStyle によって変換されたページ (->オリジナル) /