0
$\begingroup$

Given alphabet $A$ and a sequence $S$, determine the optimal dictionary $D$ of max word length $m$ that can compose the sequence.

Optimality can be given as a two-part cost $(L)$ = dictionary cost $(L(D))$ + encoding cost $(L(S|D))$, e.g.

$L(D) = \log_2(|A|) * \sum_{w \in D} |w| $
$L(S|D)$ is Huffman coding, using probabilities learned from the sequence.

Typical scenarios for my case: |A|=10, length of S=180, m=21

The goal is to find structure in the sequence, being encoded in words $w\in D$. In some cases, it is given there exists a word of length $m^*$. Please point to any references, libraries (python preferred) for the algorithm.

D.W.
169k23 gold badges236 silver badges519 bronze badges
asked Oct 7 at 20:17
$\endgroup$
6
  • $\begingroup$ I do not see a question, Just a hard problem - tag reference request?. $\endgroup$ Commented Oct 7 at 20:57
  • $\begingroup$ For defining $L(S|D),ドル what probabilities do you propose to use when generating the Huffman code? Do you want to assume a uniform distribution on $D$? $\endgroup$ Commented Oct 20 at 11:11
  • $\begingroup$ @D.W. for $L(S/D),ドル the probabilities should be learnt from the sequence. for $L(D)$. i am assuming, $log_2 (|A|) $ bits for each alphabet. I made an error, in the question above, it should be $log_2 (|A|)$ instead of $log_2(|D|)$ $\endgroup$ Commented Oct 20 at 17:16
  • $\begingroup$ Don't you need about $|D| \log_2 |D| + \log_2 |A| \sum_{w \in D} |w|$ bits to encode $D$? Should $L(D)$ be defined as that larger values? $\endgroup$ Commented Oct 20 at 17:45
  • $\begingroup$ I'm not sure I understand what "$L(S|D)$ is Huffman coding" means. Can you define it more precisely? Do you mean the number of bits needed to represent $S$ as a sequence of words from $D$ using a a Huffman code for $D$? But then you cannot use probabilities derived from the frequency of each word of $D$ found in $S,ドル unless you include those frequencies explicitly in the encoding. So should $L(S|D)$ be the number of bits needed to encode the Huffman code and those frequencies, and then to represent $S$ using that Huffman code? It would help to be more explicit. $\endgroup$ Commented Oct 20 at 17:45

1 Answer 1

0
$\begingroup$

I suspect finding the exact optimum will likely be NP-hard, but you might be interested in the Sequitur algorithm, the Re-Pair algorithm, and methods for byte-pair encoding. Doing a literature search for related research papers might find other techniques.

answered Oct 20 at 11:17
$\endgroup$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.