Optimal dictionary for coding

Asked 1 month ago

Viewed 44 times

$\begingroup$

Given alphabet $A$ and a sequence $S$, determine the optimal dictionary $D$ of max word length $m$ that can compose the sequence.

Optimality can be given as a two-part cost $(L)$ = dictionary cost $(L(D))$ + encoding cost $(L(S|D))$, e.g.

$L(D) = \log_2(|A|) * \sum_{w \in D} |w| $
$L(S|D)$ is Huffman coding, using probabilities learned from the sequence.

Typical scenarios for my case: |A|=10, length of S=180, m=21

The goal is to find structure in the sequence, being encoded in words $w\in D$. In some cases, it is given there exists a word of length $m^*$. Please point to any references, libraries (python preferred) for the algorithm.

Improve this question

edited Oct 20 at 17:45

D.W.'s user avatar

D.W. ♦

169k23 gold badges236 silver badges519 bronze badges

asked Oct 7 at 20:17

samrat's user avatar

samrat

11 bronze badge

$\endgroup$

$\begingroup$ I do not see a question, Just a hard problem - tag reference request?. $\endgroup$

greybeard
– greybeard

2025年10月07日 20:57:14 +00:00
Commented Oct 7 at 20:57
$\begingroup$ For defining $L(S|D),ドル what probabilities do you propose to use when generating the Huffman code? Do you want to assume a uniform distribution on $D$? $\endgroup$

D.W.
– D.W. ♦

2025年10月20日 11:11:27 +00:00
Commented Oct 20 at 11:11
$\begingroup$ @D.W. for $L(S/D),ドル the probabilities should be learnt from the sequence. for $L(D)$. i am assuming, $log_2 (|A|) $ bits for each alphabet. I made an error, in the question above, it should be $log_2 (|A|)$ instead of $log_2(|D|)$ $\endgroup$

samrat
– samrat

2025年10月20日 17:16:17 +00:00
Commented Oct 20 at 17:16
$\begingroup$ Don't you need about $|D| \log_2 |D| + \log_2 |A| \sum_{w \in D} |w|$ bits to encode $D$? Should $L(D)$ be defined as that larger values? $\endgroup$

D.W.
– D.W. ♦

2025年10月20日 17:45:02 +00:00
Commented Oct 20 at 17:45
$\begingroup$ I'm not sure I understand what "$L(S|D)$ is Huffman coding" means. Can you define it more precisely? Do you mean the number of bits needed to represent $S$ as a sequence of words from $D$ using a a Huffman code for $D$? But then you cannot use probabilities derived from the frequency of each word of $D$ found in $S,ドル unless you include those frequencies explicitly in the encoding. So should $L(S|D)$ be the number of bits needed to encode the Huffman code and those frequencies, and then to represent $S$ using that Huffman code? It would help to be more explicit. $\endgroup$

D.W.
– D.W. ♦

2025年10月20日 17:45:56 +00:00
Commented Oct 20 at 17:45

| Show 1 more comment

1 Answer 1

Sorted by: Reset to default

$\begingroup$

I suspect finding the exact optimum will likely be NP-hard, but you might be interested in the Sequitur algorithm, the Re-Pair algorithm, and methods for byte-pair encoding. Doing a literature search for related research papers might find other techniques.

Improve this answer

answered Oct 20 at 11:17

D.W.'s user avatar

D.W. ♦

169k23 gold badges236 silver badges519 bronze badges

$\endgroup$

Add a comment |

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Stack Exchange Network

Optimal dictionary for coding

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Optimal dictionary for coding

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions