(追記) (追記ここまで)

35050번 - LLM Training 스페셜 저지다국어

시간 제한	메모리 제한	제출	정답	맞힌 사람	정답 비율
3 초	2048 MB	2	2	2	100.000%

문제

You are given a text dataset. Your task is to train LLM (Large Language Model) and find the minimal possible loss. No kidding.

A text dataset is an array of texts $t_1, t_2, \ldots, t_n$. Each text $t_i$ is a sequence of tokens. We define the set of tokens $T$ as the set of all tokens that appear in at least one text $t_i$. Additionally, for each text $t_i,ドル there is a set of positions $L_i \subseteq \{1, 2, \ldots, |t_i|\}$. The token $t_i[j]$ is generated by LLM if $j \in L_i$ and is written by the user if $j \notin L_i$.

Let us define LLM with context size $k$ as a probabilistic model $P_k,ドル such that it defines the probability distribution of the next token of the sequence, depending on a context $w$ --- a sequence of length between 0ドル$ and $k$ (inclusive) whose elements are from $T$. Thus the probabilistic model $P_k$ is a large table of probabilities $P_k(\text{next} | w),ドル defined for any context $w \in T^{*},ドル 0ドル \leq |w| \leq k$ and any token $\text{next} \in T$. Conditions 0ドル \leq P_k(\text{next} | w) \leq 1$ and $\sum\limits_{\text{next} \in T} P_k(\text{next} | w) = 1$ should be satisfied.

The loss function of LLM with the context size $k$ is the following function defined for $P_k$:

$$ \mathcal{L}_k(P_k) ,円,円 = ,円,円 \sum_{i=1}^{n} ,円,円 \sum_{j\in L_i} ,円 -\log_2 P_k\!\left( \underbrace{t_i[j]}_{\text{next token}} \ \middle|\ \underbrace{t_i[\max(1,j-k),円..,円j-1]}_{\text{context}} \right) $$

Here $t_i[l,円..,円r] = t_i[l] t_i[l+1] \ldots t_i[r]$ is the substring from $l$-th to $r$-th token, $t_i[1,円..,0円]$ is an empty string. So, for each text and for each token that is generated by LLM, we add to the loss the negative logarithm (base 2) of the probability that this token will be generated, depending on the substring of previous $k$ tokens (or the whole prefix, if it has length less than $k$). If the probability is zero, we assume that the negative logarithm is $+\infty$. This loss function is known as the (base 2) Cross Entropy Loss over the LLM-generated positions. The smaller the loss function value $\mathcal{L}_k(P_k),ドル the better LLM $P_k$ is.

For each 0ドル \leq k < \max\limits_{i=1..n} |t_i|,ドル calculate the minimum possible loss $\mathcal{L}_k(P_k)$ that could be obtained for some $P_k$ --- LLM with context size $k$. It can be proved that this minimum is reachable and is not infinite.

입력

The first line contains a single integer $n$ (1ドル \leq n \leq 10^5$) --- the number of texts in the dataset. Text descriptions follow.

The first line of the $i$-th text description contains a single integer $m_i$ (1ドル \leq m_i \leq 3 \cdot 10^5$) --- the length of $t_i$ ($m_i = |t_i|$).

The next line contains $m_i$ strings $t_{i}[1],ドル $t_{i}[2],ドル $\ldots,ドル $t_{i}[m_i]$ (1ドル \leq |t_{i}[j]| \leq 5$) --- tokens of the text $t_i$. Each token consists of symbols with ASCII codes from 33ドル$ to 126ドル$ (printable characters).

The next line contains a string $\ell_i$ of $m_i$ letters U and L, which encodes the set $L_i$. All positions with the letter L are generated by LLM, and all positions with the letter U are written by the user. So $L_i = \{j,円|,円\ell_{i}[j] = $L$\}$. It is guaranteed that the last token is generated by LLM, so $\ell_{i}[m_i] = $L.

It is guaranteed that the sum of $m_i$ for all $i$ (1ドル \le i \le n$) does not exceed 3ドル \cdot 10^5$.

출력

Print $M = \max\limits_{i=1..n} m_i$ real numbers: for each $k = 0, 1, \ldots, M-1$ print the minimum possible loss $\mathcal{L}_k(P_k)$ for all possible $P_k$ --- LLM with context size $k$.

Your answers will be accepted if their absolute or relative errors do not exceed 10ドル^{-6}$; formally, if $p$ is your answer, and $q$ is the jury's answer, this should hold: $\frac{|p - q|}{\max\{1, |q|\}} \le 10^{-6}$.

제한

예제 입력 1

4
5
1 + 1 = 2
UUUUL
5
1 + 2 = 3
UUUUL
5
2 + 1 = 3
UUUUL
5
2 + 2 = 4
UUUUL

예제 출력 1

6.000000000000
6.000000000000
4.000000000000
4.000000000000
0.000000000000

예제 입력 2

4
4
N E F <EOS>
LLLL
5
N E R C <EOS>
LLLLL
6
N E E R C <EOS>
LLLLLL
5
I C P C <EOS>
LLLLL

예제 출력 2

55.683674395584
12.490224995673
8.000000000000
8.000000000000
8.000000000000
8.000000000000

예제 입력 3

1
16
a b a c a b a d b a b d a b a c
ULLULLLLLLULLLLL

예제 출력 3

22.595941331507
12.464393446710
5.245112497837
2.000000000000
0.000000000000
0.000000000000
0.000000000000
0.000000000000
0.000000000000
0.000000000000
0.000000000000
0.000000000000
0.000000000000
0.000000000000
0.000000000000
0.000000000000

예제 입력 4

2
4
WA WA WA AC
LULL
4
AC AC WA AC
LLUL

예제 출력 4

5.509775004327
4.754887502163
4.000000000000
2.000000000000

노트

출처

ICPC > Regionals > Northern Eurasia > Northern Eurasia Finals > Northern Eurasia Finals 2025 L번

(追記) (追記ここまで)

Baekjoon Online Judge

채점 현황

채점 현황

문제

유저 대회

고등학교 대회

출처

대학교 대회

도움말

사업자 등록 번호: 541-88-00682
대표자명: 최백준
주소: 서울시 서초구 서초대로74길 29 서초파라곤 412호
전화번호: 02-521-0487 (이메일로 연락 주세요)
이메일: contacts@startlink.io
통신판매신고번호: 제 2017-서울서초-2193 호

한국어 | English (Beta)

문제

문제

출처

ICPC

35050번 - LLM Training 스페셜 저지다국어

문제

입력

출력

제한

예제 입력 1 복사

예제 출력 1 복사

예제 입력 2 복사

예제 출력 2 복사

예제 입력 3 복사

예제 출력 3 복사

예제 입력 4 복사

예제 출력 4 복사

노트

출처