Given alphabet $A$ and a sequence $S$, determine the optimal dictionary $D$ of max word length $m$ that can compose the sequence.
Optimality can be given as a two-part cost $(L)$ = dictionary cost $(L(D))$ + encoding cost $(L(S|D))$, e.g.
$L(D) = \log_2(|A|) * \sum_{w \in D} |w| $
$L(S|D)$ is Huffman coding, using probabilities learned from the sequence.
Typical scenarios for my case: |A|=10, length of S=180, m=21
The goal is to find structure in the sequence, being encoded in words $w\in D$. In some cases, it is given there exists a word of length $m^*$. Please point to any references, libraries (python preferred) for the algorithm.
-
$\begingroup$ I do not see a question, Just a hard problem - tag reference request?. $\endgroup$greybeard– greybeard2025年10月07日 20:57:14 +00:00Commented Oct 7 at 20:57
-
$\begingroup$ For defining $L(S|D),ドル what probabilities do you propose to use when generating the Huffman code? Do you want to assume a uniform distribution on $D$? $\endgroup$D.W.– D.W. ♦2025年10月20日 11:11:27 +00:00Commented Oct 20 at 11:11
-
$\begingroup$ @D.W. for $L(S/D),ドル the probabilities should be learnt from the sequence. for $L(D)$. i am assuming, $log_2 (|A|) $ bits for each alphabet. I made an error, in the question above, it should be $log_2 (|A|)$ instead of $log_2(|D|)$ $\endgroup$samrat– samrat2025年10月20日 17:16:17 +00:00Commented Oct 20 at 17:16
-
$\begingroup$ Don't you need about $|D| \log_2 |D| + \log_2 |A| \sum_{w \in D} |w|$ bits to encode $D$? Should $L(D)$ be defined as that larger values? $\endgroup$D.W.– D.W. ♦2025年10月20日 17:45:02 +00:00Commented Oct 20 at 17:45
-
$\begingroup$ I'm not sure I understand what "$L(S|D)$ is Huffman coding" means. Can you define it more precisely? Do you mean the number of bits needed to represent $S$ as a sequence of words from $D$ using a a Huffman code for $D$? But then you cannot use probabilities derived from the frequency of each word of $D$ found in $S,ドル unless you include those frequencies explicitly in the encoding. So should $L(S|D)$ be the number of bits needed to encode the Huffman code and those frequencies, and then to represent $S$ using that Huffman code? It would help to be more explicit. $\endgroup$D.W.– D.W. ♦2025年10月20日 17:45:56 +00:00Commented Oct 20 at 17:45
1 Answer 1
I suspect finding the exact optimum will likely be NP-hard, but you might be interested in the Sequitur algorithm, the Re-Pair algorithm, and methods for byte-pair encoding. Doing a literature search for related research papers might find other techniques.
Explore related questions
See similar questions with these tags.