Compression of domain names

Question 1

I am curious as to how one might very compactly compress the domain of an arbitrary IDN hostname (as defined by RFC5890) and suspect this could become an interesting challenge. A Unicode host or domain name (U-label) consists of a string of Unicode characters, typically constrained to one language depending on the top-level domain (e.g. Greek letters under .gr), which is encoded into an ASCII string beginning with xn-- (the corresponding A-label).

One can build data models not only from the formal requirements that

each non-Unicode label be a string matching ^[a-z\d]([a-z\d\-]{0,61}[a-z\d])?$;
each A-label be a string matching ^xn--[a-z\d]([a-z\d\-]{0,57}[a-z\d])?$; and
the total length of the entire domain (A-labels and non-IDN labels concatenated with '.' delimiters) not exceed 255 characters

but also from various heuristics, including:

lower-order U-labels are often lexically, syntactically and semantically valid phrases in some natural language including proper nouns and numerals (unpunctuated except hyphen, stripped of whitespace and folded per Nameprep), with a preference for shorter phrases; and
higher-order labels are drawn from a dictionary of SLDs and TLDs and provide context for predicting which natural language is used in the lower-order labels.

I fear that achieving good compression of such short strings will be difficult without considering these specific features of the data and, furthermore, that existing libraries will produce unnecessary overhead in order to accomodate their more general use cases.

Reading Matt Mahoney's online book Data Compression Explained, it is clear that a number of existing techniques could be employed to take advantage of the above (and/or other) modelling assumptions which ought to result in far superior compression versus less specific tools.

By way of context, this question is an offshoot from a previous one on SO.

Initial thoughts

It strikes me that this problem is an excellent candidate for offline training and I envisage a compressed data format along the following lines:

A Huffman coding of the "public suffix", with probabilities drawn from some published source of domain registration or traffic volumes;
A Huffman coding of which (natural language) model is used for the remaining U-labels, with probabilities drawn from some published source of domain registration or traffic volumes given context of the domain suffix;
Apply some dictionary-based transforms from the specified natural language model; and
An arithmetic coding of each character in the U-labels, with probabilities drawn from contextually adaptive natural language models derived from offline training (and perhaps online too, although I suspect the data may well be too short to provide any meaningful insight?).

Question 2

Perhaps you could download a list of all domain names, and assign each one a number. This would be very compact.

Question 3

@Dietrich Epp: Indeed - and actually, I had thought that perhaps registrars might publish in WHOIS a serial number of each registration from which this could reliably be built, but sadly they do not. Realistically, I think the practical challenges in maintaing such a database make it infeasible: not to mention that such databases do not handle subdomains.

Question 4

@arnaud: Reversing it is a problem - relies on a correct pointer in .in-addr.arpa; also breaks if the IP ever changes.

Question 5

By Dietrich Epp's method (based on an estimated 196m domains) you could store a domain name in 28 bits (two unicode characters), and you cannot do better. Of course, a probability distribution over domain names can give you a much better expected number of bits. You could at least use arithmetic coding for the 1 million most popular domains and use some ad-hoc scheme for the rest.

Question 6

@eggyal Do you mean all domain names in use, or all legal domain names?

Question 7

Huffman coding is optimal for letters and can certainly be adapted to sequences. For instance, if the sequence "ab" results in fewer bits than the bits for "a" and "b", then simply add it to the tree ...and so on.

...you can also probably use some simple library which does that all for you with near optimal performances, so that you won't gain much using your custom made super fancy compression algorithm.

Question 8

I think Huffman is not quite optimal (it rounds to the nearest bit): arithmetic coding should always outperform. And unless one applies an accurate model of the data being compressed, one is always going to achieve suboptimal results... so if every bit matters, generic libraries cannot suffice.

Question 9

Huffman coding is asymptotically optimal if you ignore correlations between letters (e.g., if you see a q, then the next letter is a lot more likely to be a u than it otherwise would be). But that's not a realistic assumption. In practice, those correlations are huge and enable one to do a lot better than naive Huffman coding in practice.

Question 10

@D.W. do you have any recommendations for how one could do better? Would it perhaps help to allow pairs or triples of contiguous characters to be encoded via Huffman?

arnaud · Answer 1 · 2011-10-21 09:04:12Z

0

$\begingroup$

Huffman coding is optimal for letters and can certainly be adapted to sequences. For instance, if the sequence "ab" results in fewer bits than the bits for "a" and "b", then simply add it to the tree ...and so on.

...you can also probably use some simple library which does that all for you with near optimal performances, so that you won't gain much using your custom made super fancy compression algorithm.

Share

Improve this answer

answered Oct 21, 2011 at 9:04

arnaud

$\endgroup$

3

2

$\begingroup$ I think Huffman is not quite optimal (it rounds to the nearest bit): arithmetic coding should always outperform. And unless one applies an accurate model of the data being compressed, one is always going to achieve suboptimal results... so if every bit matters, generic libraries cannot suffice. $\endgroup$

eggyal
– eggyal

2011年10月21日 10:54:53 +00:00
Commented Oct 21, 2011 at 10:54
7

$\begingroup$ Huffman coding is asymptotically optimal if you ignore correlations between letters (e.g., if you see a q, then the next letter is a lot more likely to be a u than it otherwise would be). But that's not a realistic assumption. In practice, those correlations are huge and enable one to do a lot better than naive Huffman coding in practice. $\endgroup$

D.W.
– D.W. ♦

2013年12月31日 19:49:17 +00:00
Commented Dec 31, 2013 at 19:49
$\begingroup$ @D.W. do you have any recommendations for how one could do better? Would it perhaps help to allow pairs or triples of contiguous characters to be encoded via Huffman? $\endgroup$

ryan
– ryan

2019年04月18日 04:13:48 +00:00
Commented Apr 18, 2019 at 4:13

Add a comment |

Stack Exchange Network

Compression of domain names

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Compression of domain names

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions