11642 Lecture 01

courses 3 minutes read (About 493 words)0 visits

Note Information

Title Introduction
Course Name 11642: Search Engines
Lecturer Jamie Callan
Date 01/17/2023
Note Author Ziang Zhou
Description Course Logistics, Heap’s Law and Zipf’s Law

Searching Concept: Diversity

Not everyone agree what tops each searching query. For instance the search of Michael Jordan.

Adminstrative Info

  • Office Hours
Day Time Room Instructor
Monday 4:00 - 5:30 GHC 5417 Meghna
Tuesday 1:00 - 2:30 GHC 5417 Guanzhong
Wednesday 2:00 - 3:30 GHC 5417 Anubha
Thursday 2:00 - 3:30 GHC 5417 Siddharth
Friday 4:00 - 5:30 GHC 5417 Raghav
  • Textbook: Introduction to Information Retrieval

No JAVA pickup

This course is not a place to pick up Java. Find a Java Crash tutorial and see if there’s a problem to pick up.

Two laws describe how words are used in human language

Heap’s Law

  • Heap’s Law: the size of vocabulary

    Heaps’ Law is a statistical law that describes the relationship between the number of distinct words (vocabulary size) and the total number of words in a text corpus. It states that the vocabulary size, $V,ドル of a text corpus grows at a slower rate than the number of words, $N,ドル in the corpus.

    $$V=KN^{\beta}$$

    Typically there are two ways to figure out $N$ and $\beta$.

    • Use a small set to fit parameters like $K$ and $\beta$
    • Maximum likelihood estimation (MLE)

Zipf’s Law

  • Zipf’s Law: the term frequency distribution

    Zipf’s Law is a statistical law that describes the distribution of word frequencies in a text corpus. It states that the frequency of a word, $f,ドル is inversely proportional to its rank, $r,ドル in the frequency distribution of words.
    $$f=\frac{k}{r}$$ Where $k$ is a constant.

    zipf’s law relates a term’s frequency to its rank.

  • Empirical Observation: $$P(t_R)=\frac{ctf_{t_R}}{N}\approx\frac{A}{R}$$ where $A\approx 0.1$

    Therefore, the predicted probability of the most frequent word is $P(t_1)=\frac{0.1}{1}=0.1$. The second most frequent word is $P(t_2)=\frac{0.1}{2}=0.05$

    • (tf stands for term frequency)
    • (ctf stands for collection term frequency)
  • Rank $\times$ Frequency = Constant (Constant = 0.1 $\times$ N), therefore,

    • Rank = $\frac{0.1 \times N}{\text{Frequency}}$
    • Frequency = $\frac{0.1 \times N}{\text{Rank}}$
      Where $N$ stands for the number of total occurrences

Terminologies

  • $tf_{t,d}$: Term frequency
  • $ctf_t$: collection term frequency
  • $df_{t}$: # of documents that contain a term

Summary

  • Heaps’ Law states that the vocabulary size of a text corpus grows at a slower rate than the number of words in the corpus, described by a power law function $V=KN^b,ドル where K and b are constants.
  • Zipf’s Law states that the frequency of a term in a text corpus is inversely proportional to its ranking in the frequency distribution of words, described by the equation $f=k/r,ドル where k is a constant. Also the most popular term is likely to appear twice frequently than the second popular term.

Next Class Reading

  • Chapter 1
  • Chapter 5.1.1 - 5.1.2
Author

Ziang Zhou

Posted on

2023年01月17日

Updated on

2023年01月23日

Licensed under

Like this article? Support the author with

Comments

Please enable JavaScript to view the comments powered by Disqus.

follow.it

Follow my blogs for immediate feeds!

AltStyle によって変換されたページ (->オリジナル) /