Hello, Maybe a little offtopic, but here we go: On Thu, 7 Dec 2006 08:55:32 -0800 "Ken Smith" <kgsmith@gmail.com> wrote: > On 12/7/06, Roberto Ierusalimschy <roberto@inf.puc-rio.br> wrote: > > If I understand correctly, even asian languages use ascii > > punctuation (dots, spaces, newlines, commas, etc.), which uses 1 > > byte in utf-8 but 2 in utf-16. So, even for these languages utf-8 > > it is not so less compact as it seems. > > I don't know about other Asian languages but Japanese has special > punctuation characters. There is even a wide character for space. > Here are some of them with their ASCII equivalents; I hope your mil > reader groks them. > > . = 。 > , = 、 > " " = 「 」 (note the wide space within the Japanese-style quotes) > > I believe newline is the same in Japanese character sets as it is in > ASCII and I presume this extends into UTF-8. I started learning Japanese (日本語) one month ago in my spare time, so I made a quite complete Unicode setup and as far as I know newline is the same. Also I'm very happy using UTF-8 for all the stuff, for example matching words with grep is possible (as someone pointed out, "traditional" tools still work to some extent), for example: $ echo 'こんいちーわ' > foo $ echo 'すし' >> foo $ grep 'し' foo すし (Yes, I checked this even in recent Linux/BSD and older Solaris systems, at it still works, the same goes for most text utils... but don't expect character ranges in regexps like '[あ-う]' to work, because most apps assume one byte per glyph). Also expect Japanese scripts (esp. hiragana, katakana) taking about half the glyphs used by its transliteration in latin alphabet (romaji). This is not true with all words, but average saving is more than 50%. Just take an example with some random words: word romaji glyphs glyphs hiragana --------------------- ----------- ------ ------ ------------ tree ki 2 1 き sushi sushi 5 2 すし camera kamera 6 3 かめら I watashi 7 3 わたし to be desu 4 2 です newspaper shinbun 7 4 しんぶん superficial knowledge icchihankai 11 7 いっちはんかい As you can see, Japanese uses sometimes less words than English for the same concepts (as in the last example), and even comparing Japanese-romaji to Japanese-hiragana, the latter uses half of the glyphs =) > However, as some of the other readers have pointed out, many of the > multibyte characters express denser ideas so the ideas per byte is > probably not too much different from European languages. Here are > some characters the Japanese use frequently with their English > equivalents. I have chosen non-sino characters to try to make my > point more relevant to the English speaking readership. > > ☎ or ℡ = Tel (when listing telephone numbers) > a 〜 b = a to b or from a to b Totally agree here, "superficial knowledge" may be even written as 一知半解, which is only 4 glyphs compared to the 21 glyphs used in English! Just my two cents. I would really appreciate Unicode support in Lua. I vote for enforcing UTF-8 as encoding for source files. Python is a somewhat hackish: it tries to detect encoding by using a special comment on the first 5 lines of code like '# -*- encoding: utf-8 -*-'. It works but I think it's quite awkward... Cheers, -- User: I'm having problems with my text editor. Help desk: Which editor are you using? User: I don't know, but it's version VI (pronounced: 6). Help desk: Oh, then you should upgrade to version VIM (pronounced: 994).
Attachment:
signature.asc
Description: PGP signature