compacting _Jv_Utf8Const

Wed May 5 19:23:00 GMT 2004

_Jv_Utf8Const names take up a fair amount of space. (However, I don't 
have numbers on this. Does anyone?) Some of it is "overhead": the 
length (2 bytes), hash code (2 bytes), final '0円' (1 byte), and 
alignment (0-1 bytes). How about this more compact encoding:
struct _Jv_Utf8Const
{
 unsigned char hash;
 /* The data length is split into 7-bit chunks. The chunks appear in
 * low-endian order (because that is easier to generate), with the
 * final chunk in a byte with a 0 high-order bit, while the preceding
 * ones have the high-order bit set. */
 /* unsigned char length[?]; -extra low-order 7-bit chunks as needed */
 unsigned char length0; /* high-order byte of length of data, in bytes */
 char data[1];		/* In Utf8 format; no final '0円'. */
};
The hash code is reduced to 1 byte, saving one byte, under the 
assumption that clashes will be rare. We reduce the length field to a 
single byte in all normal cases, saving another byte. We get rid of the 
final useless '0円', saving a third byte. And we remove the requirement 
for short-alignment, saving on average half a byte.
So the savings would be 3.5 bytes per name. Is that enough to be worth 
while?
We also also removing the restriction to maximum 0xFFFF bytes.
Disadvantages: Slightly slower comparisons. More complex code. More 
awkward to print out _Jv_Utf8Const from gdb. Broking binary 
compatibility. But the biggest is the actual work of changing the code.
There is also the issue whether this change is compatible with the plans 
for new ABI.
Finally, one could compress the actual characters, i.e. use a more 
compact special-purpose encoding than UTF8. 6 bits per characters, with 
some escape mechanism, should be enough, but saving 25% is probably not 
enough to justify the complexity.
-- 
	--Per Bothner
per@bothner.com http://per.bothner.com/