| 시간 제한 | 메모리 제한 | 제출 | 정답 | 맞힌 사람 | 정답 비율 |
|---|---|---|---|---|---|
| 3 초 | 2048 MB | 2 | 2 | 2 | 100.000% |
You are given a text dataset. Your task is to train LLM (Large Language Model) and find the minimal possible loss. No kidding.
A text dataset is an array of texts $t_1, t_2, \ldots, t_n$. Each text $t_i$ is a sequence of tokens. We define the set of tokens $T$ as the set of all tokens that appear in at least one text $t_i$. Additionally, for each text $t_i,ドル there is a set of positions $L_i \subseteq \{1, 2, \ldots, |t_i|\}$. The token $t_i[j]$ is generated by LLM if $j \in L_i$ and is written by the user if $j \notin L_i$.
Let us define LLM with context size $k$ as a probabilistic model $P_k,ドル such that it defines the probability distribution of the next token of the sequence, depending on a context $w$ --- a sequence of length between 0ドル$ and $k$ (inclusive) whose elements are from $T$. Thus the probabilistic model $P_k$ is a large table of probabilities $P_k(\text{next} | w),ドル defined for any context $w \in T^{*},ドル 0ドル \leq |w| \leq k$ and any token $\text{next} \in T$. Conditions 0ドル \leq P_k(\text{next} | w) \leq 1$ and $\sum\limits_{\text{next} \in T} P_k(\text{next} | w) = 1$ should be satisfied.
The loss function of LLM with the context size $k$ is the following function defined for $P_k$:
$$ \mathcal{L}_k(P_k) ,円,円 = ,円,円 \sum_{i=1}^{n} ,円,円 \sum_{j\in L_i} ,円 -\log_2 P_k\!\left( \underbrace{t_i[j]}_{\text{next token}} \ \middle|\ \underbrace{t_i[\max(1,j-k),円..,円j-1]}_{\text{context}} \right) $$
Here $t_i[l,円..,円r] = t_i[l] t_i[l+1] \ldots t_i[r]$ is the substring from $l$-th to $r$-th token, $t_i[1,円..,0円]$ is an empty string. So, for each text and for each token that is generated by LLM, we add to the loss the negative logarithm (base 2) of the probability that this token will be generated, depending on the substring of previous $k$ tokens (or the whole prefix, if it has length less than $k$). If the probability is zero, we assume that the negative logarithm is $+\infty$. This loss function is known as the (base 2) Cross Entropy Loss over the LLM-generated positions. The smaller the loss function value $\mathcal{L}_k(P_k),ドル the better LLM $P_k$ is.
For each 0ドル \leq k < \max\limits_{i=1..n} |t_i|,ドル calculate the minimum possible loss $\mathcal{L}_k(P_k)$ that could be obtained for some $P_k$ --- LLM with context size $k$. It can be proved that this minimum is reachable and is not infinite.
The first line contains a single integer $n$ (1ドル \leq n \leq 10^5$) --- the number of texts in the dataset. Text descriptions follow.
The first line of the $i$-th text description contains a single integer $m_i$ (1ドル \leq m_i \leq 3 \cdot 10^5$) --- the length of $t_i$ ($m_i = |t_i|$).
The next line contains $m_i$ strings $t_{i}[1],ドル $t_{i}[2],ドル $\ldots,ドル $t_{i}[m_i]$ (1ドル \leq |t_{i}[j]| \leq 5$) --- tokens of the text $t_i$. Each token consists of symbols with ASCII codes from 33ドル$ to 126ドル$ (printable characters).
The next line contains a string $\ell_i$ of $m_i$ letters U and L, which encodes the set $L_i$. All positions with the letter L are generated by LLM, and all positions with the letter U are written by the user. So $L_i = \{j,円|,円\ell_{i}[j] = $L$\}$. It is guaranteed that the last token is generated by LLM, so $\ell_{i}[m_i] = $L.
It is guaranteed that the sum of $m_i$ for all $i$ (1ドル \le i \le n$) does not exceed 3ドル \cdot 10^5$.
Print $M = \max\limits_{i=1..n} m_i$ real numbers: for each $k = 0, 1, \ldots, M-1$ print the minimum possible loss $\mathcal{L}_k(P_k)$ for all possible $P_k$ --- LLM with context size $k$.
Your answers will be accepted if their absolute or relative errors do not exceed 10ドル^{-6}$; formally, if $p$ is your answer, and $q$ is the jury's answer, this should hold: $\frac{|p - q|}{\max\{1, |q|\}} \le 10^{-6}$.
4 5 1 + 1 = 2 UUUUL 5 1 + 2 = 3 UUUUL 5 2 + 1 = 3 UUUUL 5 2 + 2 = 4 UUUUL
6.000000000000 6.000000000000 4.000000000000 4.000000000000 0.000000000000
4 4 N E F <EOS> LLLL 5 N E R C <EOS> LLLLL 6 N E E R C <EOS> LLLLLL 5 I C P C <EOS> LLLLL
55.683674395584 12.490224995673 8.000000000000 8.000000000000 8.000000000000 8.000000000000
1 16 a b a c a b a d b a b d a b a c ULLULLLLLLULLLLL
22.595941331507 12.464393446710 5.245112497837 2.000000000000 0.000000000000 0.000000000000 0.000000000000 0.000000000000 0.000000000000 0.000000000000 0.000000000000 0.000000000000 0.000000000000 0.000000000000 0.000000000000 0.000000000000
2 4 WA WA WA AC LULL 4 AC AC WA AC LLUL
5.509775004327 4.754887502163 4.000000000000 2.000000000000