[Python-Dev] 2.1 alpha: what about the unicode name database?

2001年1月14日 20:31:06 +0100

finn wrote:
> As a result I invented my own compression format for the ucnhash for
> jython. I managed to achive ~100k but that probably have different
> performance properties.

here's the description:
---
From: "Fredrik Lundh" <effbot@telia.com>
Date: 2000年7月16日 20:40:46 +0200
/.../
 The unicodenames database consists of two parts: a name
 database which maps character codes to names, and a code
 database, mapping names to codes.
* The Name Database (getname)
 First, the 10538 text strings are split into 42193 words,
 and combined into a 4949-word lexicon (a 29k array).
 Each word is given a unique index number (common words get
 lower numbers), and there's a "lexicon offset" table mapping
 from numbers to words (10k).
 To get back to the original text strings, I use a "phrase
 book". For each original string, the phrase book stores a a
 list of word numbers. Numbers 0-127 are stored in one byte,
 higher numbers (less common words) use two bytes. At this
 time, about 65% of the words can be represented by a single
 byte. The result is a 56k array.
 The final data structure is an offset table, which maps code
 points to phrase book offsets. Instead of using one big
 table, I split each code point into a "page number" and a
 "line number" on that page.
 offset = line[ (page[code>>SHIFT]<<SHIFT) + (code&MASK) ]
 Since the unicode space is sparsely populated, it's possible
 to split the code so that lots of pages gets no contents. I
 use a brute force search to find the optimal SHIFT value.
 In the current database, the page table has 1024 entries
 (SHIFT is 6), and there are 199 unique pages in the line
 table. The total size of the offset table is 26k.
* The code database (getcode)
 For the code table, I use a straight-forward hash table to store
 name to code mappings. It's basically the same implementation
 as in Python's dictionary type, but a different hash algorithm.
 The table lookup loop simply uses the name database to check
 for hits.
 In the current database, the hash table is 32k.
/.../
</F>