Logo
(追記) (追記ここまで)

34780번 - LLMs 다국어

시간 제한메모리 제한제출정답맞힌 사람정답 비율
0.5 초 2048 MB111100.000%

문제

Bruno recently became interested in large language models (the so-called LLMs, acronym for large language models). One of the first things he learned was that one of the most fundamental components of LLMs is the token prediction system (which, for simplicity, we will consider as words). The idea of this type of system is relatively simple to explain: given an “incomplete” sentence, the system must suggest the next word of the sentence. Such systems rely heavily on linear algebra and neural networks, require large training databases and have at least millions (often billions or even trillions) of parameters, which makes it unfeasible for an individual to train their own model.

Due to these limitations (and also because he doesn’t need an LLM at its full potential), Bruno decided to test new ideas, potentially computationally cheaper, that need less input data and even eliminate the training phase. He doesn’t expect to get something as good as commercial models, but thinks he can find a much cheaper and sufficiently good model. He needs help to test the word prediction system he just conceived.

His system, named System for Bruno’s Chat, or, simply, SBC, starts similarly to traditional LLMs: it receives as input a mapping of a set of words, called a “dictionary”, into a Euclidean space, which somehow tends to encode similar words at nearby points. This dictionary comes ordered from the most common to the least common word, in the dataset used to build it. For simplicity, Bruno made a projection of this space onto a plane and rounded the coordinates, so that each word $p$ is represented by a point (or vector) $v(p)$ with integer coordinates in $\mathbb{R}^2$. In addition to this mapping, Bruno will use an input text as his “knowledge base”, that is, the source from which his questions will be answered.

For each query, he will try to predict the next word. To do this, he will look at the last words of the query and, using the mapping and the text, make a prediction of which word should be used.

Bruno’s system works as follows. Initially, an integer $K$ is chosen, which will be the size of the “context window”. For the last $K$ words of the query, we will search, in the knowledge base, where these words occur in the text, exactly in the same order and consecutively. In each occurrence, we record the word c that occurs in the knowledge base right after the last $K$ words. After repeating this process throughout the text, we obtain a sequence $c_1, \dots , c_r$ of words, which we will call “candidates”.

Next, for each word d in the dictionary, the similarity of $d$ with the candidates is calculated. Since words are associated with vectors, and the inner product of two vectors can be interpreted as a measure of similarity between them, Bruno decided to use it as a similarity metric. By abuse of notation, we identify each word with its vector: $d ≡ v(d)$ and $c_i ≡ v(c_i)$. If a candidate word $c_i$ is not present in the dictionary, it is represented by the vector $(0, 0)$. The inner product of two vectors $v = (v_x, v_y)$ and $w = (w_x, w_y)$ is denoted by $v \cdot w$ and is defined as $v \cdot w = v_xw_x + v_yw_y$. The similarity of a word $d$ in the dictionary with the candidates is given by $$S(d) = \displaystyle\sum_{i=1}^r{d \cdot c_i}\text{.}$$

SBC then chooses the word with the highest value of $S(d)$ and this will be the next word in the text. In case of a tie, SBC will choose the most common word, that is, the one that appears first in the dictionary. If there are no candidate words, the value of $K$ is decreased by one unit and the process is done again. If even for $K = 1$ no candidate words are found, the process is aborted.

Although in Bruno’s intuition this whole process makes sense, he has great difficulty implementing it and needs your help.

입력

The input contains 3ドル$ parts: the first describing the dictionary, the second describing the knowledge base and the third containing the queries.

The first line of input contains an integer $N$ (2ドル ≤ N ≤ 10^3$), indicating the number of words in the dictionary. The $i$-th of the following $N$ lines will contain a word $P_i$ composed only of lowercase letters of the alphabet, with at most 12ドル$ characters, followed by two integers $X_i$ and $Y_i$ ($−10^3 ≤ X_i , Y_i ≤ 10^3$), which indicate the vector $(X_i , Y_i)$ associated with the $i$-th word in the dictionary.

The next line contains an integer $M$ (2ドル ≤ M ≤ 10^3$), the number of words in the text corresponding to the knowledge base. The following lines will contain the knowledge base, with 8ドル$ words per line, separated by spaces (except possibly the last line).

Then, two integers $Q$ (1ドル ≤ Q ≤ 10$) and $K$ (1ドル ≤ K ≤ 5$), indicating the number of queries and the size of the context window. Each of the following lines will describe a query. Each query will contain an integer $F$ ($K ≤ F ≤ 8$), followed by $F$ words.

출력

For each query, a line should be printed containing the words of the query followed by the next word predicted by SBC, if it exists, or followed by “*” (asterisk), otherwise.

제한

예제 입력 1

6
the -15 0
world 7 6
star -4 2
wars 10 12
peace 10 -11
trek 13 1
12
the peace the war the trek the star
star wars star peace
4 3
3 i love star
3 this is the
4 star trek is very
3 star star star

예제 출력 1

i love star trek
this is the peace
star trek is very *
star star star wars

노트

출처

ICPC > Regionals > Latin America > Sub-Regional Brasil do ACM ICPC > Maratona SBC de Programação 2025 L번

(追記) (追記ここまで)

출처

대학교 대회

  • 사업자 등록 번호: 541-88-00682
  • 대표자명: 최백준
  • 주소: 서울시 서초구 서초대로74길 29 서초파라곤 412호
  • 전화번호: 02-521-0487 (이메일로 연락 주세요)
  • 이메일: contacts@startlink.io
  • 통신판매신고번호: 제 2017-서울서초-2193 호

AltStyle によって変換されたページ (->オリジナル) /