I am curious as to how one might very compactly compress the domain of an arbitrary IDN hostname (as defined by RFC5890) and suspect this could become an interesting challenge. A Unicode host or domain name (U-label) consists of a string of Unicode characters, typically constrained to one language depending on the top-level domain (e.g. Greek letters under .gr), which is encoded into an ASCII string beginning with xn-- (the corresponding A-label).
One can build data models not only from the formal requirements that
each non-Unicode label be a string matching
^[a-z\d]([a-z\d\-]{0,61}[a-z\d])?$;each A-label be a string matching
^xn--[a-z\d]([a-z\d\-]{0,57}[a-z\d])?$; andthe total length of the entire domain (A-labels and non-IDN labels concatenated with '.' delimiters) not exceed 255 characters
but also from various heuristics, including:
lower-order U-labels are often lexically, syntactically and semantically valid phrases in some natural language including proper nouns and numerals (unpunctuated except hyphen, stripped of whitespace and folded per Nameprep), with a preference for shorter phrases; and
higher-order labels are drawn from a dictionary of SLDs and TLDs and provide context for predicting which natural language is used in the lower-order labels.
I fear that achieving good compression of such short strings will be difficult without considering these specific features of the data and, furthermore, that existing libraries will produce unnecessary overhead in order to accomodate their more general use cases.
Reading Matt Mahoney's online book Data Compression Explained, it is clear that a number of existing techniques could be employed to take advantage of the above (and/or other) modelling assumptions which ought to result in far superior compression versus less specific tools.
By way of context, this question is an offshoot from a previous one on SO.
Initial thoughts
It strikes me that this problem is an excellent candidate for offline training and I envisage a compressed data format along the following lines:
A Huffman coding of the "public suffix", with probabilities drawn from some published source of domain registration or traffic volumes;
A Huffman coding of which (natural language) model is used for the remaining U-labels, with probabilities drawn from some published source of domain registration or traffic volumes given context of the domain suffix;
Apply some dictionary-based transforms from the specified natural language model; and
An arithmetic coding of each character in the U-labels, with probabilities drawn from contextually adaptive natural language models derived from offline training (and perhaps online too, although I suspect the data may well be too short to provide any meaningful insight?).
1 Answer 1
Huffman coding is optimal for letters and can certainly be adapted to sequences. For instance, if the sequence "ab" results in fewer bits than the bits for "a" and "b", then simply add it to the tree ...and so on.
...you can also probably use some simple library which does that all for you with near optimal performances, so that you won't gain much using your custom made super fancy compression algorithm.
-
2$\begingroup$ I think Huffman is not quite optimal (it rounds to the nearest bit): arithmetic coding should always outperform. And unless one applies an accurate model of the data being compressed, one is always going to achieve suboptimal results... so if every bit matters, generic libraries cannot suffice. $\endgroup$eggyal– eggyal2011年10月21日 10:54:53 +00:00Commented Oct 21, 2011 at 10:54
-
7$\begingroup$ Huffman coding is asymptotically optimal if you ignore correlations between letters (e.g., if you see a
q, then the next letter is a lot more likely to be authan it otherwise would be). But that's not a realistic assumption. In practice, those correlations are huge and enable one to do a lot better than naive Huffman coding in practice. $\endgroup$2013年12月31日 19:49:17 +00:00Commented Dec 31, 2013 at 19:49 -
$\begingroup$ @D.W. do you have any recommendations for how one could do better? Would it perhaps help to allow pairs or triples of contiguous characters to be encoded via Huffman? $\endgroup$ryan– ryan2019年04月18日 04:13:48 +00:00Commented Apr 18, 2019 at 4:13
Explore related questions
See similar questions with these tags.
.in-addr.arpa; also breaks if the IP ever changes. $\endgroup$