The Cover Pages [画像:The OASIS Cover Pages: The Online Resource for Markup Language Technologies]
SEARCH | ABOUT | INDEX | NEWS | CORE STANDARDS | TECHNOLOGY REPORTS | EVENTS | LIBRARY
Language Identifiers in the Markup Context

Contents


Introduction

[August 29, 2001] Since machines first began processing digitized text, computer users have understood that the machine needed to know what language a text was "in" so as to perform intelligent processing on the text: for spell-checking, indexing, searching, multilingual-context word wrapping, computer-synthesized speech, hyphenation, transliteration, sorting/collation, grammar checking, thesaurus building, machine translation, etc. The computer needs to know about both language and script (writing system) to do the right thing in a multilingual setting. Thus, the use of language codes to assist in machine processing of text is documented in a wide range of specifications, including markup metalanguages (SGML, XML), markup language applications, and software operating systems. Similarly, descriptive cataloging at the subject/metadata level needs to assign labels for linguistic properties of data/text in order to help users restrict their research to appropriate content. In support of interoperable computing solutions and information longevity, it is desirable to use standardized language codes inserted directly into marked-up documents.

As the mass of networked digital information grows ever larger and becomes easily accessible, demand increases for a taxonomy of human languages adequate to support language data classification, categorization, and linguistic annotation. It is now widely recognized that the ISO standards providing "codes for the representation of names of languages" (ISO 639, ISO/DIS 639-1, ISO 639-2) are inadequate to meet the application requirements being levied by users in a growing number of domains. Librarians and archivists cataloging written and aural language materials from minority languages may find that the 136 codes of ISO 639:1988, or even the 400+ codes of ISO 639-2:1998 are too few to support metadata description. Linguists applying language codes at a low level within natural language texts may discover that the ISO codes do not sufficiently distinguish regional, social, or dialectical variation. Data providers in general fields may find that the code identifiers used in the two largest projects -- with classification for 7,000 or 70,000 languages/dialects -- are too heavy for their purposes. Granularity, genetic reconstruction, and language groupings are but a few of the challenges facing design teams in an endeavor to create a (sic!) theory-neutral language code vocabulary.

Despite these problems inherent to any classification endeavor, petitions are now being heard in many quarters for collaborative effort toward the creation of better language identification formalisms that account for the richness of human language -- increasing the number of language codes and their descriptiveness along several language-property axes. A new work item approved by ISO earlier in 2001, for example, addresses the need for an International Standard with mechanisms for encoding language variation through time, geography, dialectal variation, writing system, and so forth. An initial proposal calls for codes supporting representation of the language along at least five axes: "geog (geographical specification), script (writing system), temp (temporal specification), socli (sociolinguistic specification), and style (stylistic specification)."

This document supplies a collection of references to publications and projects relating to language identification. The goal is multipurpose: (1) to save time for readers who wish to know more about language identification in the markup context; (2) to raise awareness of the importance of language identification; (3) to urge support for standards efforts which will be required to continue the process of requirements gathering and database design reflecting a rigorous intellectual approach to the problems.

Please send corrections/additions via email. -- Robin Cover

Language Code Listings

ANSI/NISO Codes for the Representation of Languages for Information

Codes for the Representation of Languages for Information Interchange . 'ANSI/NISO Z39.53-2001.' Revision of ANSI/NISO Z39.53-1994. An American National Standard Developed by the National Information Standards Organization. Approved August 31, 2001 by the American National Standards Institute. Published by the National Information Standards Organization: NISO Press, Bethesda, Maryland, U.S.A. Maintenance Agency: US Library of Congress. ISSN: 1041-5653. 24 pages. "A standardized 3-character code to indicate language in the exchange of information is defined. Codes are given for languages, contemporary and historical." Source URL as of 2003-09; see also the reference URL.

[March 13, 2001] Codes for the Representation of Languages for Information Interchange. ANSI/NISO Z39.53-200X. ISSN:1041-5653, Revision of ANSI/NISO Z39.53-1994. 24 pages. A Draft American National Standard Developed by the National Information Standards Organization. Status: For Ballot February 9, 2001 - March 23, 2001. [see preceding; broken link removed]

The specification provides "a standardized 3-character code to indicate language in the exchange of information is defined. Codes are given for languages, contemporary and historical. The purpose of this standard is to provide libraries, information services, and publishers a standardized code to indicate language in the exchange of information. This standard for language codes is not a prescriptive device for the definition of language and dialects but rather a list reflecting the need to distinguish recorded information by language." From the Foreword: "This standard was originally prepared by Standards Committee C, Language Codes, which was organized in 1979. Charged with 'providing a standard code for indicating languages for information interchange purposes,' the committee produced a standard based on the list of MARC language codes developed by the Library of Congress in cooperation with the National Agricultural Library and the National Library of Medicine. This code list is now published as the MARC Code List for Languages. Practical application of the MARC language codes has shown that in order to serve as an appropriate retrieval device for information, a standard list of language codes must reflect the linguistic content of the universal collection to which it is applied, with language codes assigned as needed to distinguish information in a given language or group of languages. The MARC language codes constitute such a list. The committee's decision to base the standard on the existing MARC list took into account these contributing factors: (a) several years' successful application of the MARC language codes resulting in many millions of bibliographic records containing the accepted MARC codes, (b) the mnemonic relationship of the MARC codes to the English language names of the languages with English being the operational language of most American libraries, information services, and publishers, and (c) the flexibility inherent in a three-character code. The MARC list may be consulted for references from alternative forms of language names, as well as for the assignments to collective codes of languages for which individual codes have not been established. This revised edition reflects a thorough review of the document and includes changes which are a result of requests and demonstrated need from users and implementors. In addition, it includes numerous changes necessary for compatibility with bibliographic language codes in ISO 639-2 (Codes for the representation of names of languages: Alpha-3 code). The MARC code list is kept consistent with both ANSI/NISO Z39.53 and ISO 639-2/B." See the main description, the comment form, and a cover memorandum. Contact: NISO, 4733 Bethesda Ave, Suite 300, Bethesda, MD 20814; Fax: 301-654-1721; Email: nisohq@niso.org.

Description: "The language codes are designed to be used: (1) To designate the languages in which documents are or have been written or re-corded; (2) To designate the languages in which document handling records (order records, bibliographic records, and the like) have been created. Language codes are not designed to be used: (1) To designate machine programming languages (FORTRAN, BASIC, and the like); (2) To distinguish languages from dialects. The dialect of a language is usually represented by the same language code as that used for the language... Each code comprises three roman alphabet characters. Codes generally were created using three characters usually based on an English form of the language name or, in some cases, a vernacular form of the corresponding language name. Future development of language codes will be based, whenever possible, on the vernacular form of the language, unless another language code is requested by the country or countries using the language. The codes are varied where necessary to resolve conflicts...Language codes are assigned either to individual languages or to related groups of languages. The level of specificity of the language code assigned is determined in each case to be the level necessary to maintain the utility of the standard based on the volume of documents or document handling records that have been or are expected to be written, recorded, or created. Levels of specificity represented by the language codes include: (1) Language codes for individual languages; (2) Collective language codes for linguistically or otherwise related groups of languages; (3) Collective language codes for linguistically or otherwise related groups of languages having individual language codes for some but not all languages so related. This standard does not indicate which level of specificity is represented by each code. The word 'languages' or 'other' as part of a descriptor may be taken to indicate that a language code is a collective language code. A collective language code is not intended to be used when an individual language code or another more specific collective language code is available." [cache]

[1994] National Information Standards Organization. Codes for the Representation of Languages for Information Interchange (ANSI/NISO Z39.53-1994). Bethesda, MD: NISO Press [for NISO], 1994. ISBN: 1-880124-10-6. ISSN: 1041-5653. Overview: "The National Information Standards Organization (NISO) has published a revised standard for language codes. Codes for the Representation of Languages for Information Interchange (ANSI/NISO Z39.53-1994) is used by libraries, information services, and publishers as the standard for designating languages in which documents or document handling records (such as order records or bibliographic records) have been created. The revised standard reflects a thorough review of the 1987 edition and includes many changes requested by users. Codes have been added for 28 languages or language groups previously not represented. The list codifies names for 399 languages. Numerous minor changes also have been made to reflect current accepted usage in language names. The USMARC Code List for Languages is kept consistent with ANSI/NISO Z39.53 and will be revised to incorporate the changes in this new edition." [from a NISO-L news announcement; see the complete text for details.] The standard was approved on September 21, 1994, by the American National Standards Institute. It was developed for NISO by an ad hoc working group composed of John Byrum (Chair), Rebecca Guenther, Sally H. McCallum, and Millicent Wewerka. It is a revision of ANSI Z39.53-1987. The 399 language codes are for contemporary and historical languages. The codes are based (largely) upon an existing MARC list of language names, where the MARC language codes have been used in the cataloging of millions of bibliographic works in a library setting. See unofficially: NISO 3-character language codes (Z39.53-1994), [mirror copy]. Also, for several proposed additions and deletions to Z39.53-1994, approved as of January 1997: see the update to USMARC Code List for Languages from November 15, 1996: "Any changes listed below [in this MARC code list] that were not included in Z39.53 will be incorporated at the next revision of that standard"; [mirror copy].

Ethnologue

[February 13, 2002] Ethnologue resources for language codes, announced by Peter Constable:

Ethnologue: Languages of the World Fourteenth Edition 14 edition (by Barbara and Joseph Grimes) is available in print, CDROM, and online web formats. Published by SIL International, this work represents anthropological and linguistic survey conducted over many years, resulting in a collection of some 6,809 language descriptions listed by country, 41,791 alternate names and dialect names, 109 language family trees, together with 345 overviews of language situations. The work includes available alternate names, dialects, number of speakers, multilingualism, and other demographic and sociolinguistic information. The relevance of this language code index and database is described in the paragraphs below. See the Ethnologue language code index and description of the print version.

[September 07, 2000] "Language identification and IT: Addressing Problems of Linguistic Diversity on a Global Scale." Paper presented by Peter Constable and Gary Simons (SIL International) at the Seventeenth International Unicode Conference (IUC17), September 07, 2000, San Jose, CA. "Information technologies, particularly the internet, are rapidly becoming more global in focus. At the same time, and partly as a result, economic development is quickly expanding in many previously lesser-developed regions of the world. One of the implications of this is that IT systems are being confronted with the challenges of the world's ethno-linguistic diversity. Considerable and productive effort is being made to create adequate I18N infrastructures for issues such as text encoding and processing in IT systems. Yet at the same time, infrastructures for dealing with issues of language and locale identification are lagging behind user needs. The connection between how text is encoded and how it should be processed cannot be properly closed until the language identification problem is solved, since so many aspects of text processing (like collating and spell-checking) are language specific. At present we are confronted with an issue of scale. The leading standard for addressing language identification, ISO 639-2, offers codes to identify approximately 450 languages. In fact, the number of languages spoken in the world today exceeds 6000, as is documented in SIL's online catalogue of the world's languages. The problem is that the world's linguistic diversity is at the same time very complex but well understood by relatively few. In this paper, we will explore the world's ethno-linguistic diversity, it's challenges for IT, and some directions in which we can move forward toward solutions. In particular, we will (1) give an overview of the world's ethno-linguistic diversity; - discuss some of the inherent difficulties in devising systems of language and locale identification; (2) examine some existing IT practices and their successes and limitations; and (3) present work that SIL is doing in relation to language identification that can provide at least part of a needed solution for global IT systems."

[September 28, 2000] "Language Identification and IT: Addressing Problems of Linguistic Diversity on a Global Scale." By Peter Constable and Gary Simons. In SIL Electronic Working Papers. Reference: SILEWP 2000-001. September 2000. 22 pages. Keywords: ISO 639, RFC 1766, internationalization, I18N, linguistic diversity, web development, XML, language identification, information technology (IT). [A revised version of a paper that was presented at the 17th International Unicode Conference in San José, California in September, 2000, and which appears in the conference proceedings.] "Many processes used within information technology need to be customized to work for specific languages. For this purpose, systems of tags are needed to identify the language in which information is expressed. Various systems exist and are commonly used, but all of them cover only a minor portion of languages used in the world today, and technologies are being applied to an increasingly diverse range of languages that go well beyond those already covered by these systems. Furthermore, there are several other problems that limit these systems in their ability to cope with these expanding needs. This paper examines five specific problem areas in existing tagging systems for language identification and proposes a particular solution that covers all the world's languages while addressing all five problems." [...] The information technology (IT) industry has been driven in recent years to address problems of multilingualism and internationalization. This has been driven to a significant extent by the growth of the Internet. Rapidly increasing economic development throughout the world, together with the growth of the 'Net, has actually resulted in a significant increase in the number of languages that technologies need to support. In many parts of the world, speakers of previously 'unknown' languages (that is, unknown to speakers of 'major' languages) are beginning to make their mark on the World Wide Web, and are using their own languages to do so. Even apart from the Internet, communities of speakers of lesser-known languages are using technology to pursue linguistic development of their communities through literacy, literature development and other means. In addition, researchers such as linguists and anthropologists, development and relief organizations, and governments are pursuing interests involving thousands of different linguistic and ethnic communities around the world. In this work, they are seeking to make use of current information technologies, such as Unicode and XML. . . [Problem of scale:] The need for systems to cover thousands of languages is real, not merely hypothetical. For instance, SIL has been involved in projects in some 1,600 different languages, of which about 1,100 are current, and new projects are begun regularly. Thus, just within SIL, we have an immediate need for over 1,600 identifiers that conform to RFC 1766 for use within XML documents. We are aware of several other agencies that have similar, vastly multilingual needs, such as the Linguistics Data Consortium, the Linguist List, the Endangered Language Fund, UNESCO, various departments of the U.S. and other governments, and others. When we add the work of other institutions, individual linguists and the language communities themselves, the existing needs for language identifiers are considerably greater, and are only continuing to grow. As stated earlier, every language in the world represents a real need for a unique language identifier. When confronted with needs for thousands of language identifiers, we find that some existing systems do not scale well. There is the obvious problem of devising several thousand new tags. There are other problems with scaling, however, due either to the mechanism that a system uses for tags, or to the procedures for extending the coverage of a system. We will consider each of these in turn..." Also in PDF format. [cache]

[August 27, 2001] "Mapping Between ISO 639 and the SIL Ethnologue. Principles Used and Lessons Learned." By Peter Constable and Gary Simons (SIL International). 2001年08月09日. 17 pages. "There is a growing consensus that ISO standards for language identification are not meeting current and future industry needs, and that new work should be done to enhance these standards. Various extensions have been considered, including the following: (1) Provide more comprehensive coverage for the world's languages, including the thousands of lesser-known languages that have been attested. (2) Provide more comprehensive coverage for language collections, specifically collections based on genetic language relationships. (3) Provide systems for extending language identifiers to create identifiers for paralinguistic categories, such as writing system, or identifiers for language varieties based on factors such as style, geographic region, or time period... We have endeavoured to provide a definitive statement of how the ISO 639-1 and ISO 639-2 codes map to and from the SIL Ethnologue. We consider it acceptable to use the Ethnologue for this purpose. The Ethnologue is not a perfect representation of all the world's languages. Indeed, such a goal is impossible in principle. The Ethnologue is, nevertheless, among the most complete and generally reliable compilations of information on the world's languages available today. The Ethnologue has identified languages with some form of operational definition for language in mind, one based on a primary criterion of mutual non-intelligibility, and this definition has been applied with at least some level of consistency across languages. In spite of its limitations, the Ethnologue has become a de facto standard among many users because of its completeness of coverage, because the complete inventory of languages and the wealth of supporting information is readily accessible on the Web, and because it has been deemed by these users to warrant a sufficient level of their confidence... The Ethnologue's inventory and identifiers have been used in a number of research efforts and publications conducted by various agencies. They have also been adopted as the basis for language identification by the Linguist List, the Open Language Archive Community [OLAC], and the Rosetta Project... The Ethnologue assigns a unique three-letter code for each language within its scope. Three features that make it particularly useful are that it is a single source providing comprehensive coverage of all modern, natural languages; that each of its identifiers represents the same type of category (namely, a language, as understood in terms of the operational definition it assumes); and that the denotation of each identifier is well documented and readily accessible on a public Web site. By presenting a thorough and detailed mapping of ISO code elements to languages enumerated in the Ethnologue, we can effectively provide an explicit statement as to what type of category each of the ISO code elements represents and what they denote... We have presented our proposed mappings in HTML pages that are available online, along with an analysis ["Analysis of ISO 639-2 to Ethnologue Mappings"]. We acknowledge, though, that definitive mappings can only be specified by the owners of the ISO 639-x standards since they are the ones who determine what normative definitions apply to the standards... In this paper, we outline the principles by which we determined how to map ISO 639-x code elements to languages listed in the Ethnologue. In the course of our work, it was necessary to make judgments regarding what the ISO code elements denote, and in so doing we were able to compile in specific detail a number of issues that need to be considered in relation to the ISO standards as they exist at present." See similarly "An Analysis of ISO 639: Preparing the Way for Advancements in Language Identification Standards," presented at the Twentieth International Unicode Conference (IUC20) (January 28-31, 2002, Washington DC, USA). [source]

IETF RFCs (RFC 5646, 5645, 4646, 4647, 3066, 1766)

IETF Working Group and Discussion List:

[September 09, 2009] Tags for Identifying Languages. Edited by Addison Phillips (Lab126) and Mark Davis (Google). IETF RFC 5646, BCP 47. Precursors of this document include RFC 4646, RFC 4647, RFC 3066, and RFC 1766. Source text, HTML. Credits to Stephane Bortzmeyer, Karen Broome, Peter Constable, John Cowan, Martin Duerst, Frank Ellerman, Doug Ewell, Deborah Garside, Marion Gunn, Alfred Hoenes, Kent Karlsson, Chris Newman, Randy Presuhn, Stephen Silver, Shawn Steele, and many, many others... "This document describes the structure, content, construction, and semantics of language tags for use in cases where it is desirable to indicate the language used in an information object. It also describes how to register values for use in language tags and the creation of user-defined extensions for private interchange..." See also Update to the Language Subtag Registry (RFC 5645). Comment: see the blog article "New Language Tag Specification, RFC 5646, Published" by Richard Ishida (W3C).

[July 07, 2006] Matching of Language Tags. Edited by Addison Phillips (Yahoo! Inc) and Mark Davis (Google). Produced by members of the Language Tag Registry Update (LTRU) Working Group, in the IETF Applications Area. See the (unofficial) announcement of the IESG's approval fpr publication. In November 2005, the IESG approved the Tags for Identifying Languages document as a BCP and Initial Language Subtag Registry as an Informational RFC. Martin Duerst (Aoyama Gakuin University), co-chair of IETF's Language Tag Registry Update (LTRU) Working Group, announced that the IETF had IETF has approved version 15 of the "Matching of Language Tags" draft for publication. This document, together with version 14 of the companion "Tags for Identifying Languages" (now in RFC Ed Queue) will be published as an RFC and replace RFC 3066 ("Tags for the Identification of Languages"), which replaced RFC 1766. Currently, RFC 3066 or its successor is referenced normatively by XML 1.1 and other markup standards for constructing language identification tags. Knowledge about the particular language used by some piece of information content might be useful or even required by some types of processing; for example spell-checking, computer-synthesized speech, Braille transcription, or high-quality print renderings. One means of indicating the language used is by labeling the information content with an identifier or 'tag'. The IETF document 'Tags for Identifying Languages' describes the structure, content, construction, and semantics of language tags for use in cases where it is desirable to indicate the language used in an information object. It also describes how to register values for use in language tags and the creation of user defined extensions for private interchange. The document 'Matching of Language Tags: defines a syntax (called a language range) for specifying items in the user's list of language preferences (called a language priority list), as well as several schemes for selecting or filtering sets of language tags by comparing the language tags to the user's preferences. Applications, protocols, or specifications will have varying needs and requirements that affect the choice of a suitable matching scheme. It describes: how to indicate a user's preferences using language ranges; three schemes for matching these ranges to a set of language tags; and the various practical considerations that apply to implementing and using these schemes..."

"Matching Language Identifiers." Edited by Addison Phillips (Quest Software) and Mark Davis (IBM). IETF Network Working Group. InternetDraft. Reference: 'draft-ietf-ltru-matching-00'. May 13, 2005, expires November 14, 2005. 20 pages. "This document describes different mechanisms for comparing and matching the tags for the identification of languages defined by RFC 3066bis.

"Tags for Identifying Languages." Edited by Addison P. Phillips (Quest Software) and Mark Davis (IBM). IETF Network Working Group. Internet Draft, reference 'draft-ietf-ltru-registry-00'. March 10, 2005, expires September 11, 2005. 44 pages.

[February 28, 2005] IESG Announces Proposed IETF Working Group for Language Tag Registry Update. The Internet Engineering Steering Group (IESG) has announced the submission of a proposal for a new IETF Working Group for 'Language Tag Registry Update' in the IETF Applications Area. The Steering Group requests comment on this proposal through March 2, 2005; it is expected that the creation of the Working Group will be discussed at the IESG teleconference on March 3, 2005. The proposed Working Group would continue technical work on matters related to RFC 1766/RFC 3066 language tags, currently under discussion in the 'ietf-languages' list. RFC 3066, published in 2001, "describes a language tag for use in cases where it is desired to indicate the language used in an information object, how to register values for use in this language tag, and a construct for matching such language tags." RFC 3066 language tags are used in a wide range of computing applications, and particularly in (meta-) markup languages (XML, HTML), to provide language attributes. Computing machines need to know what language a text is "in" so as to perform intelligent processing on encoded text: for spell-checking, indexing, searching, multilingual-context word wrapping, computer-synthesized speech, hyphenation, transliteration, sorting/collation, grammar checking, thesaurus building, machine translation, etc. The computer needs to know about both language and script (writing system) to do the right thing in a multilingual setting. Several individual Internet Drafts have been prepared as a successor to RFC 3066, including the February 14, 2005 two-part version composed of Tags for Identifying Languages and Matching Language Identifiers, edited by Addison P. Phillips and Mark Davis. Review by various parties in the IETF context has pointed out a number of remaining complications stemming from dependencies upon other standards bodies and maintenance agencies (scripts, countries). These would be addressed within the proposed IETF Working Group.

[February 14, 2005] "Tags for Identifying Languages." By Addison P. Phillips (editor; Director, Globalization Architecture, webMethods) and Mark Davis (IBM). Also available in HTML format with hyperlinks. IETF Network Working Group. Internet Draft. Reference: 'draft-phillips-langtags-10'. February 14, 2005, expires August 15, 2005. 45 pages. "This document describes the structure, content, construction, and semantics of language tags for use in cases where it is desirable to indicate the language used in an information object. It also describes how to register values for use in language tags and the creation of user defined extensions for private interchange. This document obsoletes RFC 3066 (which replaced RFC 1766)."

[February 14, 2005] "Matching Language Identifiers." By Addison P. Phillips (editor; Director, Globalization Architecture, webMethods) and Mark Davis (IBM). IETF Network Working Group. Internet Draft. Reference: 'draft-phillips-langmatching-00'. February 14, 2005, expires August 15, 2005. 15 pages. "This document describes different mechanisms for comparing and matching the language identifiers defined by RFC3066bis. Possible algorithms for language negotiation and content selection are described. Portions of this document obsolete RFC 3066."

[December 08, 2004] "IESG Announcement: Last Call for 'Tags for Identifying Languages' to BCP." - "The IESG has been considering 'Tags for Identifying Languages' [draft-phillips-langtags-08.txt] as a BCP. There have been considerable changes to the document since the initial last call, and the IESG would like the community to consider the changes. In addition, the authors have prepared text describing why this mechanism is needed as a replacement for the existing procedure... The IESG plans to make a decision in the next few weeks, and solicits final comments on this action." Reasons for Enhancing RFC 3066: "RFC 3066 and its predecessor, RFC 1766, define language tags for use on the Internet. Language tags are necessary for many applications, ranging from cataloging content to computer processing of text. The RFC 3066 standard for language tags has been widely adopted in various protocols and text formats, including HTML, XML, and CLDR, as the best means of identifying languages and language preferences. This specification proposes enhancements to RFC 3066. Because revisions to RFC 3066 therefore have such broad implications, it is important to understand the reasons for modifying the structure of language tags and the design implications of the proposed replacement. This specification, the proposed successor to RFC 3066, addresses a number of issues that implementers of language tags have faced in recent years: (1) Stability of the underlying ISO standards; (2) Accessibility of the underlying ISO standards for implementers; (3) Ambiguity of the tags defined by these ISO standards; (4) Difficulty with registrations and their acceptance; (5) Identification of script where necessary; (6) Extensibility. The stability, accessibility, and ambiguity issues are crucial..."

[November 15, 2004] "Tags for Identifying Languages." By Addison P. Phillips (editor; Director, Globalization Architecture, webMethods) and Mark Davis (IBM). Also available in HTML format with hyperlinks. IETF Network Working Group. Internet Draft. Reference: 'draft-phillips-langtags-08'. November 9, 2004, expires May 10, 2005. 46 pages. "This document describes the structure, content, construction, and semantics of language tags for use in cases where it is desirable to indicate the language used in an information object. It also describes how to register values for use in language tags and a construct for matching such language tags, including user defined extensions for private interchange. This document replaces RFC 3066 (which replaced RFC 1766)." Editor's note: "You should note that we think that this will be very near to the final version of this document. As such we have created an external document describing in very broad terms the design and design decisions made in hopes of better documenting the whys-and-wherefores for potential implementers. This document is available for public comment..." See the announcement for Draft-08. IETF ephemeral source: http://www.ietf.org/internet-drafts/draft-phillips-langtags-08.txt.

[November 15, 2004] "Reasons for Enhancing RFC 3066." Addison P. Phillips (ed). Inter-Locale. Document for Public Review. "RFC 3066 and its predecessor, RFC 1766, define language tags for use on the Internet. Language tags are necessary for many applications, ranging from cataloging content to computer processing of text. The RFC 3066 standard for language tags has been widely adopted in various protocols and text formats, including HTML, XML, and CLDR, as the best means of identifying languages and language preferences. This specification proposes enhancements to RFC 3066. Because revisions to RFC 3066 therefore have such broad implications, it is important to understand the reasons for modifying the structure of language tags and the design implications of the proposed replacement. The proposed successor to RFC 3066, addresses a number of issues that implementers of language tags have faced in recent years: (1) Stability of the underlying ISO standards; (2) Accessibility of the underlying ISO standards for implementers; (3) Ambiguity of the tags defined by these ISO standards; (4) Difficulty with registrations and their acceptance; (5) Identification of script where necessary; (6) Extensibility. The stability, accessibility, and ambiguity issues are crucial. Currently, because of changes in underlying ISO standards, a valid RFC 3066 language tag may become invalid (or have its meaning change) at a later date. With much of the world's computing infrastructure dependent on language tags, this is simply unacceptable: it invalidates content that may have an extensive shelf-life. In this specification, once a language tag is valid, it remains valid forever... The authors of this specification have worked for the past year with a wide range of experts in the language tagging community to build consensus on a design for language tags that meets the needs and requirements of the user community. Language tags form a basic building block for natural language support in computer systems and content. The revision proposed in this specification addresses the needs of this community of users with a minimal impact on existing content and implementations, while providing a stable basis for future development, expansion, and improvement..."

[October 17, 2004] "Tags for Identifying Languages." By Addison Phillips (Editor, webMethods, Inc.) and Mark Davis (IBM). IETF Network Working Group, Internet Draft. Reference: 'draft-phillips-langtags-05'. October 7, 2004, expires April 7, 2005. 46 pages. "This document describes the structure, content, construction, and semantics of language tags for use in cases where it is desirable to indicate the language used in an information object. It also describes how to register values for use in language tags and a construct for matching such language tags, including user defined extensions for private interchange. This document replaces RFC 3066 which replaced RFC 1766)." See Inter-Locale Home Page and the HTML format.

[September 10, 2004] "Tags for Identifying Languages." Reference: 'draft-phillips-langtags-06'. "Version -06 has one substantive modification: the ABNF for variant subtags was modified to make four-digit year subtags (such as '1996' and '1901') legal. This change was implemented so that variant subtags that start with a digit can be four characters in length. Also in HTML format.

[August 16, 2004] "Tags for Identifying Languages." By Addison Phillips (Editor, webMethods, Inc.) and Mark Davis (IBM). IETF Network Working Group, Internet Draft. Reference: 'draft-phillips-langtags-05'. August 9, 2004, expires February 7, 2005. 47 pages. 18 references. "This document describes the structure, content, construction, and semantics of language tags for use in cases where it is desirable to indicate the language used in an information object. It also describes how to register values for use in language tags and a construct for matching such language tags, including user defined extensions for private interchange. This document replaces RFC 3066 (which replaced RFC 1766)..." See also the HTML version with links. Editor's notes from the announcement: "This document's changes section details the specific alterations in this version of the document. There are not that many substantive changes in this version. The majority of the changes are related to specific comments we received during the last two rounds of review. Also substantial work on the prototype registry between Doug Ewell and the authors (Mark and I) has resulted in a few tweaks to the examples and some rewriting in sections 3.1 and 3.2 (whose order has been swapped). Please review the changes section for specifics. We feel that this draft addresses all of the comments on this list from prior drafts (within the goals we set — which we enumerate now in the changes section). Absent the question of whether there should be a subtag registry at all, we feel that this document is very near its final form. Of course we welcome comments from the community, including vigorous debate where it is necessary, but sincerely hope that we can move forward with this draft with a new Last Call very soon..."

[June 30, 2004] "Tags for Identifying Languages." By Addison Phillips (Editor, webMethods, Inc.) and Mark Davis (IBM). IETF Network Working Group, Internet Draft. Reference: 'draft-phillips-langtags-04'. June 24, 2004, expires December 23, 2004. 42 pages. This document describes the structure, content, construction, and semantics of language tags for use in cases where it is desirable to indicate the language used in an information object. It also describes how to register values for use in language tags and a construct for matching such language tags, including user defined extensions for private interchange. This document replaces RFC 3066 (which replaced RFC 1766)... The language tag is composed of one or more parts: A primary language subtag and a (possibly empty) series of subsequent subtags. Subtags are distinguished by their length, position in the subtag sequence, and content, so that each type of subtag can be recognized solely by these features. This makes it possible to construct a parser that can extract and assign some semantic information to the subtags, even if specific subtag values are not recognized. Thus a parser need not have an up-to-date copy of the registered subtag values to perform most searching and matching operations..." Note: Mark Davis said in v04 "we provide for way for programs to really validate IDs by providing a complete list of all valid subtags... The most substantive issue I'd like to get feedback on is that we still allow in this draft subtags of up to 15 long (for readability), whereas RFC 3066 has a maximum of 8. The question is whether that would cause enough of a problem for older parsers that we should pull back to a maximum of 8..."

[June 21, 2004] "Supplementary Codes for RFC 3066bis." By Doug Ewell [WWW]. Announcement posted 2004年06月21日. The web page "discusses the use of 'deprecated' ISO 639 codes, 'formerly used' ISO 3166 codes, and United Nations M.49 numeric geographical codes in RFC 3066bis. RFC 3066bis provides a great deal of flexibility, and along with it, some potential for confusion. This page describes the two different sets of region codes, explains the rules on deprecated ISO codes, and shows why the freely available, official code lists aren't enough by themselves to answer all questions. I hope this page will solidify the issues in my own head, explain them for anyone who is still puzzled, and eventually turn into a useful reference for language tag users once the new RFC is approved..."

[June 02, 2004] Tags for Identifying Languages . By Addison Phillips (Editor, webMethods, Inc.) and Mark Davis (IBM). IETF Network Working Group. Internet Draft. Reference: 'draft-phillips-langtags-03'. June 02, 2004, expires December 1, 2004. 35 pages. Also in PDF format. IETF Source: http://www.ietf.org/internet-drafts/draft-phillips-langtags-03.txt. See the news story: "Tags for Identifying Languages: IESG Issues Last Call Review for IETF BCP."

[April 09, 2004] "Tags for Identifying Languages." By Addison Phillips (webMethods, Inc) and Mark Davis (IBM). IETF Network Working Group, Internet Draft. Reference: 'draft-phillips-langtags-02'. April 8, 2004, expires October 7, 2004. 31 pages. "This document describes a language tag for use in cases where it is desired to indicate the language used in an information object, how to register values for use in this language tag, and a construct for matching such language tags, including user defined extensions for private interchange." AP note: "This version contains a few changes based on discussion on this list['ietf-languages@alvestrand.no'], notably it more closely defines the rules for using UN M49 identifiers to resolve ambiguity. It also contains semi-substantial wordsmithing in section 2 which is not substantive, but which does make the rules (we think) clearer and easier to understand..." See also Inter-Locale Home (internationalization content and demos written by Addison Phillips). [PDF]

[February 14, 2004] "Tags for Identifying Languages." By Addison Phillips (Editor, webMethods, Inc) and Mark Davis (IBM). IETF Network Working Group, Internet Draft. Reference: 'draft-phillips-langtags-01'. February 10, 2004, expires August 11, 2004. [date issues, need to clarify]. See the note explaining what's new: (1) We removed the key.value structure from extensions. These are now 2 to 32 character alphanum subtags with no defined structure. (2) We added the concept of 'extended language' subtags, to handle the comments by Peter Constable about language relationships stemming from future adoption of ISO639-3. These are also explicitly reserved for future use. (3) We reserved single character subtags explicitly -- these were implicitly reserved by the syntax previously. (4) We revised the ABNF. Note that we have now unified all private use subtags with the same rules. That is the rules are the same for x-gabble and en-Latn-US-x-gabble. (5) We added support for UN country ID numbers (as suggested by John Cowan and others). These were made the 'ambiguity resolution mechanism' of choice for country IDs..." [Addison P. Phillips]

[January 05, 2004] "Tags for Identifying Languages." By Addison Phillips (Editor, webMethods, Inc) and Mark Davis (IBM). IETF Network Working Group, Internet Draft. Reference: 'draft-phillips-langtags-02'. December 17, 2003; expires June 16, 2004. 29 pages. [See also: draft-phillips-langtags-00 and the announcement.] "This document describes a language tag for use in cases where it is desired to indicate the language used in an information object, how to register values for use in this language tag, and a construct for matching such language tags, including user defined extensions for private interchange..." The ABNF formally specifies the syntax in which "the language tag is composed of one or more parts: A primary language subtag and a (possibly empty) series of subsequent subtags. The sequence of subtags has a specific structure that depends on the length of the subtag to distinguish each tag type." This Internet draft is based upon the earlier RFC 3066: "The main goals were to maintain backward compatibility (so that all previous codes would remain valid); reduce the need for large numbers of registrations; to provide a more formal structure to allow parsing into subtags even where software does not have the latest registrations; to provide stability in the face of potential instability in ISO 639, 3166, and 15924 codes (demonstrated instability in the case of ISO 3166); and to allow for external extension mechanisms. [The specification;] (1) Allows ISO15924 script code subtags and allows them to be used generatively. (2) Adds the concept of a variant subtag and allows variants to be used generatively. (3) Adds an extension mechanism which does not require registration to use. (4) Defines the private use tags in ISO639, ISO15924, and ISO3166 as the mechanism for creating private use language, script, and region subtags respectively. (5) Defines a syntax for private use variant subtags which can be used without registration. (6) Defines a process for handling reuse of values by ISO639, ISO15924, and ISO3166 in the event that they register a previously used value for a new purpose. (7) Changes the IANA language tag registry to a language subtag registry..." Note on the ISO 3166 "demonstrated instability": see the entry "Stability of ISO 3166 and other infrastructure standards" under Unicode Technical Committee Public Positions and UTC Resolution 96-M5 (August 26, 2003): "The recent decision by the maintenance agency for ISO 3166 to re-assign 'cs' (formerly Czechoslovakia) to Serbia and Montenegro can cause severe problems. Country codes are a fundamental component of modern computing infrastructure: major operating systems, postal services, business applications, identification and security systems, to name a few. Their stability must be guaranteed. Data that is identified by these codes has a shelf life of decades, not five years. [Recommended corrective actions to take include: (1) Rescind the re-assignment of the code 'cs' to Serbia and Montenegro at the earliest opportunity available, to minimize the impact; (2) Change the policy to allow the re-use of codes only after a long period of time, such as 100 years..." Davis wrote (2003年08月05日) "The major computer systems and standards around the world, including most operating systems, use the two letter country codes. These codes must be stable and unique or data corruption will occur. Simply because a country ceases to exist does not mean that data for that country ceases to exist, nor that new data referring to that previous country cannot be created..." [Note: This document 'Tags for Identifying Languages' updates references given in the following news item 'IETF Draft on Language Tags Defines Mechanism for Private Use Extension'.]

[November 14, 2003] IETF Draft on Language Tags Defines Mechanism for Private Use Extension. An initial public draft of Tags for Languages presented to the IETF Network Working Group builds upon the current IETF RFC 3066 Tags for the Identification of Languages and defines additional mechanisms for private use extension. The Internet Draft also clarifies how private use, registered values, and matching interact. Identifiers known as language tags are authorized for use in XML and many related computing technologies that need to support language-sensitive and locale-based processing. Current practice regarding the creation, registration, and use of language tags is in a considerable state of confusion and "mess," in the experience of localization experts and software engineers. The goal of the new draft is to work toward a new IETF RFC that replaces RFC 3066. The proposed syntax for construction of a language tag provides for designation of language, script, region, variant, and arbitrary extension (using name/value pairs). Under the new proposal, "all 4-letter subtags are interpreted as ISO 15924 alpha-4 script codes from ISO 15924, or subsequently assigned by the ISO 15924 maintenance agency or governing standardization bodies, denoting the script or writing system used in conjunction with this language. All 2-letter and 3-letter subtags are interpreted as ISO 3166 alpha-2 (or alpha-3) country codes from ISO 3166, or subsequently assigned by the ISO 3166 maintenance agency or governing standardization bodies, denoting the area to which this language variant relates. Region tags must occur after any script tags and before any variants or extensions." A further goal of the new RFC is to provide for stable language tags even in the face of ISO instability. "To maintain backwards compatibility, there are two provisions to account for instabilities in ISO 639, 3166, and 15924 codes: (1) Ambiguity - in the event that one of these ISO standards reassigns a code that was previously assigned to a different value, the new use of the code will not be permitted and the IANA registry, as soon as practical, will register a surrogate value for the new code, based on the year that the new code assignment was made. (2) Stability - all other ISO codes are valid, even if they have been deprecated; where a new equivalent code has been defined, implementations should treat these tags as identical."

IETF RFC 3066 Tags for the Identification of Languages . IETF Network Working Group. Request for Comments [RFC]: 3066. January 2001. Obsoletes RFC 1766. Category: Best Current Practice. Harald Tveit Alvestrand (Cisco Systems, Weidemanns vei 27, 7043 Trondheim, Norway. Phone: +47 73 50 33 52; Email: Harald@Alvestrand.no). "This document describes a language tag for use in cases where it is desired to indicate the language used in an information object, how to register values for use in this language tag, and a construct for matching such language tags... Meaning of the language tag: The language tag always defines a language as spoken (or written, signed or otherwise signaled) by human beings for communication of information to other human beings. Computer languages such as programming languages are explicitly excluded. There is no guaranteed relationship between languages whose tags begin with the same series of subtags; specifically, they are NOT guaranteed to be mutually intelligible, although it will sometimes be the case that they are. The relationship between the tag and the information it relates to is defined by the standard describing the context in which it appears. [For example:] In markup languages, such as HTML and XML, language information can be added to each part of the document identified by the markup structure (including the whole document itself). For example, one could write <span lang="FR">C'est la vie.</span> inside a Norwegian document; the Norwegian-speaking user could then access a French-Norwegian dictionary to find out what the marked section meant. If the user were listening to that document through a speech synthesis interface, this formation could be used to signal the synthesizer to appropriately apply French text-to-speech pronunciation rules to that span of text, instead of misapplying the Norwegian rules..." [source]

RFC 3066 language tag sources.

"The following rules apply to the primary subtag:

  • All 2-letter subtags are interpreted according to assignments found in ISO standard 639, 'Code for the representation of names of languages' [ISO 639], or assignments subsequently made by the ISO 639 part 1 maintenance agency or governing standardization bodies. (Note: A revision is underway, and is expected to be released as ISO 639-1:2000)
  • All 3-letter subtags are interpreted according to assignments found in ISO 639 part 2, 'Codes for the representation of names of languages -- Part 2: Alpha-3 code [ISO 639-2]', or assignments subsequently made by the ISO 639 part 2 maintenance agency or governing standardization bodies.
  • The value "i" is reserved for IANA-defined registrations
  • The value "x" is reserved for private use. Subtags of "x" shall not be registered by the IANA.
  • Other values shall not be assigned except by revision of this standard.

The reason for reserving all other tags is to be open towards new revisions of ISO 639; the use of "i" and "x" is the minimum we can do here to be able to extend the mechanism to meet our immediate requirements. [cache]

IETF RFC 1766 Tags for the Identification of Languages . IETF Network Working Group. Request for Comments: 1766. March 1995. Category: Standards Track. By Harald Tveit Alvestrand (UNINETT). "This document describes a language tag for use in cases where it is desired to indicate the language used in an information object. It also defines a Content-language: header, for use in the case where one desires to indicate the language of something that has RFC-822- like headers, like MIME body parts or Web documents, and a new parameter to the Multipart/Alternative type, to aid in the usage of the Content-Language: header." On 'meaning of the language tag': "It would be possible to define (for instance) an SGML DTD that defines a <LANG xx> tag for indicating that following or contained text is written in this language, such that one could write "<LANG FR>C'est la vie</LANG>"; the Norwegian-speaking user could then access a French-Norwegian dictionary to find out what the quote meant... ... In the primary language tag, all 2-letter tags are interpreted according to ISO standard 639, 'Code for the representation of names of languages' [ISO 639]." RFC 3066 which supersedes and obsoletes RFC 1766 allows 3-letter language tags from ISO 639-2:1998; see preceding. [cache] Contact: ietf-languages@iana.org.

"RFC 3066 Language code assignments." By Michael Everson (Dublin). 2001年08月07日 or later. "As language-tag reviewer for RFC 3066, I am maintaining the following table to help users access the codes and information on them. Clicking on the name of the code itself will open the registration document from the IANA website. You can also view the IANA languages directory..."

ISO 3166-1: The Code List. RFC 3066 describes the construction of language tags with ISO 3166 country codes. See here English or the French language version of the country names and Alpha-2 (i.e., two-letter) code elements of ISO 3166-1. See also the ISO 3166 Maintenance Agency (ISO 3166/MA) Home Page.

Two Alternative Proposals for Language Taging in ACAP . IETF Internet Draft. Reference: 'draft-ietf-acap-langtag-00.txt'. June 1997. By Martin J. Dürst (Multimedia-Laboratory, Department of Computer Science University of Zurich). Abstract: "For various computing applications, it is helpful to know the language of the text being processed. This can be the case even if otherwise only pure character sequences (so-called plain text) are handled. From several sides, the need for such a scheme for ACAP has been claimed. One specific scheme, called MLSF, has also been proposed, see 'draft-ietf-acap-mlsf-01.txt' for details. This document proposes two alternatives to MLSF. One alternative is using text/enriched-like markup. The second alternative is using a special tag-introduction character. Advantages and disadvantages of the various proposals are discussed. Some general comments about the topic of language tagging are given in the introduction... Option 1: A Text/Enriched-like Notation for Language Tags (TELT)..."specifies a text/enriched-like notation for language tags, leading to a format simmilar to text/enriched. It can be used with any character encoding that contains the necessary subset of the US-ASCII character repertoire. Language tags are of the form '<LANG=xxxxx>' where xxxxx is a lan- guage tag as defined in [RFC1766], with all letters written in upper case. No whitespace of any kind is allowed between '<' and '>'. Language alternatives are started by '<ALTLANG>'. Again, no whites- pace is allowed between '<' and '>'. The use of the character sequences '<LANG=' and '<ALTLANG<' is not allowed in the text itself. Code to convert from this notation to MLSF and back and to test for false positives in plain text search is given in an appendix... Option 2: Language Tags using a Start Tag Character (STLT)... as a method of language taging is only useable with character encodings that can represent the BMP of the Universal Character Set [ISO10646]. For the purpose of illustration, the character PILCROW SIGN (paragraph sign, U+00B6) is used as the tag start character..." Note: The UTR #7 report was reported to be "the result of an intense email discussion regarding language tagging and related issues, occasioned by the review of draft-ietf-acap-mlsf-01.txt and of draft-ietf-acap-langtag-00.txt, which proposed different mechanisms for language tagging in plain text..."

Multi-Lingual String Format (MLSF) . IETF Internet Draft. Reference: 'draft-ietf-acap-mlsf-01.txt'. June 1997. Author: Chris Newman (Innosoft International, Inc.). Abstract: "The IAB charset workshop concluded that for human readable text there should always be a way to specify the natural language. Many protocols are designed with an attribute-value model (including RFC 822, HTTP, LDAP, SNMP, DHCP, and ACAP) which stores many small human readable text strings. The primary function of an attribute-value model is to simplify both extensibility and searchability. A solution is needed to provide language tags in these small human readable text strings, which does not interfere with these primary functions. This specification defines MLSF (Multi-Lingual String Format) which applies another layer of encoding on top of UTF-8 to permit the addition of language tags anywhere within a text string. In addition, it defines an alternate form which can be used to include alternative representations of the same text in different character sets. MLSF has the property that UTF-8 is a proper subset of MLSF. This preserves the searchability requirement of the attribute-value model. Appendix F of this document includes a brief discussion of the background behind MLSF and why some other potential solutions were rejected for this purpose..." [cache]

ISO 639

Overview. The ISO 639 standard provides an official list of the "names of languages" and related language information. ISO 639:1988 presented a set 136 two-character language codes, while the current revision effort toward ISO 639-1 focuses upon additional two-letter language identifiers. ISO/FDIS 639-1:2001 (Final Draft International Standard) has been completed, and includes about 190 language identifiers; see the note of July 23, 2001 from the TC convener and the provisional listing from the WG web site. ISO 639-2 includes three-letter language codes. From the introduction to ISO 639-2: "ISO 639 provides two sets of language codes, one as a two-letter code set (639-1) and another as a three-letter code set for the representation of names of languages. ISO 639-1 was devised primarily for use in terminology, lexicography and linguistics. ISO 639-2 represents all languages contained in ISO 639-1 and in addition any other language as well as language groups as they may be coded for special purposes when more specificity in coding is needed. The languages listed in ISO 639-1 are a subset of the languages listed in ISO 639-2; every language code in the two-letter code set has a corresponding language code in the alpha-3 list, but not necessarily vice versa. Both code lists are to be considered as open lists. The codes were devised for use in terminology, lexicography, information and documentation (i.e., for libraries, information services, and publishers) and linguistics." ISO 639-2:1998 provides identifiers for about 450 languages.

Update 2004-04: The ISO 639 family of standards is being extended by work in several working groups. See the summary from Håvard Hjulstad as of November 2003, referencing the following:

  • 639-3 Codes for the representation of names of languages -- Part 3: Alpha-3 code for comprehensive coverage of languages
  • 639-4 Codes for the representation of names of languages -- Part 4: Implementation guidelines and general principles for language coding
  • 639-5 Codes for the representation of names of languages -- Part 5: Alpha-3 code for language families and groups
  • 639-6 Codes for the representation of names of languages -- Part 6: Alpha-? code (Possible NWIP)

[September 20, 2004] Codes for the representation of names of languages — Part 3: Alpha-3 code for comprehensive coverage of languages [Codes pour la représentation de noms de langues — Partie 3: Code alpha-3 pour un traitement exhaustif des langues]. Prepared by Technical Committee ISO/TC 37, Terminology and other language resources, Subcommittee SC 2, Terminography and lexicography. Draft working document: "an ISO International Standard; it is distributed for review and comment; it is subject to change without notice and may not be referred to as an International Standard." References: ISO TC 37/SC 2 xxx. Date: 2004年09月20日. ISO/DIS 639-3.5. ISO TC 37/SC 2/WG 1. From the Scope statement: "This part of ISO 639 provides a code consisting of language code elements comprising three-letter language identifiers for the representation of languages. The language identifiers according to this part of ISO 639 were devised for use in a wide range of applications, especially in computer systems, where there is potential need to support a large number of the languages that are known to have ever existed. Whereas ISO 639-1 and ISO 639-2 are intended to focus on the major languages of the world that are most frequently represented in the total body of the world's literature, this part of ISO 639 attempts to provide as complete an enumeration of languages as possible, including living, extinct, ancient and constructed languages, whether major or minor. As a result, this part of ISO 639 lists a very large number of lesser-known languages. Languages designed exclusively for machine use, such as computer-programming languages, and reconstructed languages are not included in this code. Knowledge of the world's languages at any given time is never complete or perfect. Additional language identifiers may be created for this list when it becomes apparent that there is a linguistic variety that is deemed to be distinct from other languages in accordance with the definitions in clause 3 and their elaboration in clause 4. In addition, the denotation of existing identifiers may be revised or identifiers may become deprecated when it becomes apparent that they do not accurately reflect actual language distinctions. In all such changes, careful consideration is given to ensure existing implementations are not adversely affected..." Note also from the Introduction: "The three-letter codes in ISO 639-2 and ISO 639-3 are complementary and compatible. The two codes have been devised for different purposes. The set of individual languages listed in ISO 639-2 is a subset of those listed in ISO 639-3. The codes differ in that ISO 639-2 includes code elements representing some individual languages and also collections of languages, while ISO 639-3 includes code elements for all known individual languages but not for collections of languages. Overall, the set of individual languages listed in ISO 639-3 is much larger than the set of individual languages listed in ISO 639-2..." Source reference: posting of 14-December-2004 by Peter Constable to the IETF-languages mailing list [ietf-languages@alvestrand.no] and to 'unicore@unicode.org'. "Yesterday, I wrote mentioning the DIS ballot for ISO 639-3. Someone asked me offline whether the draft could be obtained somewhere publicly. Unfortunately, ISO TC 37/SC 2 doesn't have a public document repository. However, I did post a draft online on SIL's site: Look for the link to the document at the bottom of the page to ISO_DIS_639-3.5 [Draft 5 of ISO 639-3]. This copy contains the complete draft code tables (one sorted by ID and one by name), but not the French translation. I mentioned this link back in August; that was prior to the TC 37/SC 2/WG 1 meeting, and some changes made to the draft after the meeting..." See the posting and the follow-up comment from Håvard Hjulstad: "The code table of the FINAL 639-3 will be made freely available..."

Update 2002年02月27日: "Codes for the representation of names of languages -- Part 1: Alpha-2 code. [Codes pour la représentation des noms de langue -- Partie 1:Code alpha-2.]." From ISO/TC 37/SC 2 (Secretariat: SCC). International Standard ISO/FDIS 639-1. Reference: ISO/FDIS 639-1:2002(E/F). Final Draft. 48 pages. Voting begins on 2002年02月28日. Voting terminates on 2002年04月28日.

ISO 639:1988. ISO 639:1988 was published as the successor to and technical revision of ISO/R 639:1967 Symbols for languages, countries and authorities, withdrawn by ISO TC 37 in 1988年03月01日. ISO 639, now sometimes referenced proleptically as 639-1 to distinguish it from ISO 639-2, was produced by the "Terminology (principles and co-ordination)" Technical Committee 37 of the International Organization for Standardization (ISO). ISO/TC 37 began to operate in 1952, and was chartered for "standardization of methods for creating, compiling and co-ordinating terminologies. The objective of ISO/TC 37 was to prepare standards specifying principles and methods for terminology work and terminography within the framework of standardization and related activities." Its technical work results in International Standards and Technical Reports covering terminological principles and methods as well as various aspects of computer-assisted terminography." More specifically, ISO 639:1988 was produced within Subcommittee 2, "ISO/TC 37/SC 2 Layout of vocabularies," which was tasked to prepare International Standards concerning terminology work, preparation and layout of terminology standards, coding and codes in the field of terminology, translation-oriented terminography, and terminology management." The ISO 639 standard Code for the representation of names of languages as published in 1988 was said to be "devised primarily for use in terminology, lexicography and linguistics, but they may be used for any application requiring the expression of languages in coded form."

Bibliographic reference: ISO 639:1988 (E/F). Code for the Representation of Names of Languages First edition, 1988年04月01日. Reference number: ISO 639:1988 (E/F). Geneva: International Organization for Standardization, 1988. iii + 17 pages. ISO 639:1988 Code for the representation of names of Languages is under revision to become ISO 639-1. [Current 2001-08] ISO/DIS 639-1 Code for the Representation of Names of Languages - Part 1: Alpha-2 Code is thus a revsion of ISO 639:1988.

ISO 639-1. The revision of ISO 639:1988 is ISO 639-1 Code for the representation of names of languages - Part 1: Alpha 2 code / Code pour la représentation des noms de langue - Partie 1: Code alpha-2. ISO/FDIS 639-1:2001 was sent to the ISO Central Secretariat in late June, 2001. ISO 639-1 "consists of language code elements comprising two-letter language identifiers and the respective names of languages represented by these identifiers. The language identifiers according to this standard were devised originally for use in terminology, lexicography and linguistics, but may be adopted for any application requiring the expression of language in two-letter coded form, especially in computerized systems. The alpha-2 code was devised for practical use for most of the major languages of the world that are not only most frequently represented in the total body of the world's literature, but which also comprise a considerable volume of specialized languages and terminologies. Additional language identifiers are created when it becomes apparent that a significant body of documentation written in specialized languages and terminologies exists in a language, for which an alpha-2 identifier is needed, but does not exist yet... Languages designed exclusively for machine use, such as computer programming languages, are not included in this code." The ISO 639-1 Project Leader is Mr. Håvard Hjulstad (RTT - Rådet for teknisk terminologi, Norway). Note 2002年02月27日: Balloting on the revised specification ends on 2002年04月28日 [HH].

ISO 639-1 Registration Authority. Infoterm [International Information Centre for Terminology] "has been designated the ISO 639-1/RA (Registration Authority) for the purpose of maintaining a register of 2-letter coded names of languages comprised in the International Standard ISO 639-1, Code for the representation of names of languages - Part 1: Alpha 2 code / Code pour la repr駸entation des noms de langues - Partie 1: Code alpha-2. The ISO 639-1/RA receives and reviews applications for the registration of new and for the change of existing language identifiers. See the Criteria for requesting new language codes. The development of the list is carried out by the ISO 639 Joint Advisory Committee (ISO 639/RAs-JAC) in cooperation with the Library of Congress, which functions as the Registration Authority for ISO 639-2, Code for the representation of names of languages - Part 2: Alpha-3 code / Codes pour la représentation des noms de langue - Partie 2: Code alpha-3 (ISO 639-2/RA). A list of information associated with registered language identifiers and updates of registered language codes is maintained by Håvard Hjulstad, Convener of ISO/TC 37/SC 2/WG 1, 'Coding systems'."

ISO 639-1 Registration Authority Contact: International Information Centre for Terminology (Infoterm), Heinerstr. 38, P.O. Box 130, A-1021 Vienna, Austria. Phone: +43-1-74040-442 od. 441; Telefax:+43-1-74040-444; Email: infopoint@infoterm.or.at; WWW: http://www.infoterm.org. The SC 2 "Layout of vocabularies" - ISO/TC 37/SC 2 Secretariat may be reached through Ms. Helen Hutcheson [Secretary], Terminology and Standardization, Directorate/Translation Bureau, Public Works and Government Services, Canada; Phone: +1-819-994-5934; Telefax:+1-819-953-9691.

ISO 639 online references:

ISO 639-2:1998 International Standard ISO 639-2 was prepared jointly by Technical Committees ISO/TC 37, Terminology (principles and coordination), Subcommittee SC 2, Layout of vocabularies and ISO/TC 46, Information and documentation, Subcommittee SC 4, Computer applications in information and documentation. Technical Committee 46/Subcommittee 4 (TC46/SC4) is the International Organization for Standardization (ISO) Subcommittee "responsible for technical standards used to facilitate interoperability of information services such as libraries, information centers, indexing and abstracting services, archives, and publishers. These technical standards include standards for information retrieval and interlibrary loan, applications of SGML, data elements directories, data formats, character sets, codes and user commands."

ISO 639-2:1998 has about 400 language codes, depending upon how one counts the "collective" codes that partially duplicate some individual codes. The most noticeable feature is the bibliographic and terminological variants, as explained below; the variants evidently represent the differing needs of the terminologists (TC 37, ISO 639-2/T) and bibliographers (TC 46, ISO 639-2/B).

On [re-]interpretation of B/T variants as "synonyms": Note from Rebecca S. Guenther (Chair, ISO 639 Joint Advisory Committee) on November 09, 2001 relative to the work done in the "10+ years of development of a 3-character code in having alternative codes... The twenty-one (21) alternative codes were necessary to satisfy both constituencies and to deal with the issue of millions of existing records using already established codes. It is always difficult to satisfy everyone in the development of ISO 639, but we are making a valiant attempt. More to the point is that the existing ISO 639-2 list (and ISO 639-1) has been developed for use with written languages, and accommodating variations in spoken languages is a matter for further discussion because of the now broader use of the list... A recent message (I think from John Clews) made a comparison between various codes and listed ISO 639-2/B and ISO 639-2/T codes separately. In discussions at the ISO 639 registration authorities meeting in conjunction with TC37 in August [2001], we all agreed that the few cases where there are alternative codes (only 21 out of 450+) should be considered synonyms, rather than different code sets. Thus the distinction between the bibliographic and terminologic is essentially unimportant. We have since updated the code lists on our Web site to discontinue the use of separate columns for the /B and /T, but rather to list them as synonyms."

From the 'Normative Text' of ISO 639-2:1998:

ISO 639-2 "provides two sets of three-letter alphabetic codes for the representation of names of languages, one for terminology applications and the other for bibliographic applications. The code sets are the same except for twenty-three languages that have variant language codes because of the criteria used for formulating them. The language codes were devised originally for use by libraries, information services, and publishers to indicate language in the exchange of information, especially in computerized systems. These codes have been widely used in the library community and may be adopted for any application requiring the expression of language in coded form by terminologists and lexicographers. The alpha-2 code set was devised for practical use for most of the major languages of the world that are most frequently represented in the total body of the world's literature. Additional language codes are created when it becomes apparent that a significant body of literature in a particular language exists. Languages designed exclusively for machine use, such as computer programming languages, are not included in this code.

Form of the language codes: "The language codes consist of three Latin-alphabet characters in lowercase. No diacritical marks or modified characters are used. Implementors should be aware that these codes are not intended to be an abbreviation for the language, but to serve as a device to identify a given language or group of languages. The language codes are derived from the language name. Two code sets are provided, one for bibliographic applications (ISO 639-2/B), and one for terminology applications (ISO 639-2/T). Criteria for selecting the form of a language code for code set B were: (1) preference of the countries using the language; (2) established usage of codes in national and international bibliographic databases, and; (3) the vernacular or English form of the language. Code set T was based on: (1) the vernacular form of the language, or (2) preference of the countries using the language. There are twenty-three language names that have variant codes assigned depending on the code set chosen..."

The US Library of Congress "has been designated the ISO 639-2 Registration Authority for the purpose of processing requests for alpha-3 language codes comprising the International Standard, Codes for the representation of names of languages -- Part 2: alpha-3 code. The ISO 639-2/RA receives and reviews applications for requesting new language codes and for the change of existing ones according to criteria indicated in the standard. It maintains an accurate list of information associated with registered language codes, processes updates of registered language codes, and distributes them on a regular basis to subscribers and other parties."

The LOC web site provides ISO 639 code lists. Codes for the Representation of Names of Languages-Part 2: Alpha-3 Code. With introductions and normative text. The normative text provides the ISO 639-2 Language Codes (with corresponding ISO 639-1 code) arranged alphabetically by:

Other LOC ISO 639/RA resources:

ISO 639 Joint Advisory Committee. The ISO 639 Joint Advisory Committee (ISO 639/JAC) has been established [from ISO TC37/SC2 and ISO TC46/SC4] to advise both the ISO 639-1/RA Registration Authority and the ISO 639-2/RA Registration Authority on the two parts of the International Standard on language codes, Codes for the representation of names of languages -- Part 1: alpha-2 code and Codes for the representation of names of languages -- Part 2: alpha-3 code. The ISO 639/JAC guides the application of the coding rules as laid down in both standards. If you have questions concerning ISO 639 please contact us at: Library of Congress; Network Development and MARC Standards Office; Washington, DC 20540-4402; Email: iso639-2@loc.gov; Phone: +1 202 707 6237; FAX: +1 202 707 0115.

ISO 639 Joint Advisory Committee: Rules of procedure for conducting business. Document ISO 639/JAC N2R. 10-March-2000. "The following documents rules of procedure for the conduct of meetings and email business by the ISO 639 Joint Advisory Committee. It repeats some information that is in ISO 639-2:1998 in the normative Annex A and elaborates where necessary for clarification of procedures. In particular it details how business is run in the absence of regular meetings... ISO 639/JAC is composed of: (1) one representative of the International Information Centre for Terminology (Infoterm; representing ISO 639-1/RA); (2) one representative of the Library of Congress (LC; representing ISO 639-2/RA); (3) three representatives of ISO/TC 37 (nominated by ISO/TC 37); (4) three representatives of ISO/TC 46 (nominated by ISO/TC46). Both ISO/TCs may nominate substitute representatives for a meeting..."

ISO 639 Joint Advisory Committee: Working principles for ISO 639 maintenance. Document ISO 639/JAC N3R. 8-March-2000. "The following documents working principles for the maintenance of language codes by the ISO 639 Joint Advisory Committee both in ISO 639-1 (Alpha-2 code) and ISO 639-2 (Alpha-3 code). It repeats some information that is in ISO 639-2:1998 in section 4 (Language codes) and the normative Annex A. In addition, it gives further details as to how language code changes that are submitted are considered and how the two parts of ISO 639 are related."

Historical [earlier than 2000年04月28日; some links may be broken]:

Linguasphere Project

Developed and maintained for many years by David Dalby, Linguasphere "provides a very detailed listing of all the world's languages and many of its dialects, using a new system of classification." Dalby's classification puts the emphasis on verifiable immediate relationships rather than on distant and often hypothetical ones. It features a stable framework of 100 referential zones, each identified by two leading digits, successively refined into six layers of relationship, coded alphabetically, broadly reflecting proportions of shared basic vocabulary. Language/dialect identifiers are constructed as alpha-numeric codes, offering a coded classification of language groups ('sets, chains, nets') and idioms ('outer languages, inner languages, and dialects'). Linguasphere is comparable in some respects to SIL's Ethnologue project; both provide a system of language identifiers with codes for almost all the living languages of the world." [adapted from a book review by Philip Baker]

Web site description:

"The Linguasphere Register of the World's Languages and Speech Communities is the first attempt at a comprehensive and transnational classification of the modern languages and dialects of the world -- and of the communities of humankind. Compiled over several decades by David Dalby (Linguasphere Observatory, London School of Oriental and African Studies and University of Wales, Cardiff), the Register classifies all known languages and dialects on the basis of their closest linguistic relationships, and includes a theoretical and practical discussion and presentation of the linguasphere. A complete index of linguistic and ethnolinguistic names has been prepared by Michael Mann (School of Oriental and African Studies)."

Including over 21,000 'inner languages' and dialects, and a classified index of over 70,000 linguistic and ethnic names, the Register was compiled for the Observatoire Linguistique, an independent non-profit research network created in Quebec in 1983, established in France and coordinated from Wales.

The Linguasphere is composed of two interlocking and evolving strata of human conventions: (1) the total lexical repertoire of humankind, made up of the overlapping and shifting repertoires of all spoken and recorded languages, (2) the global distribution of the overlapping and shifting phonological and grammatical patterns which serve to structure those repertoires. The Register provides a referential framework for monitoring the future welfare of individual communities and their languages, across all national frontiers. Work in the SOAS Departments of Geography and Africa has created an interface between the classificational system of the Register and the language map of Africa, the most complex of all continents. This GIS work, funded by Leverhulme, has been the first step towards creating a global Linguasphere Mapbase, to be compiled and consultable online as a cartographic extension of the Register

The work of the Observatoire is presented through complementary websites. The main site is www.linguasphere.org, a forum for information on the current development of the world's languages, available in English and partially in French, with further options planned in Chinese and other languages. The Linguasphere Press has a parallel website www.linguasphere.net through which both the Linguasphere Register and the licensed version of Linguasphere Online may now be obtained.

References:

E-MELD Language Codes Workgroup

For general information, see the main topic page: "Electronic Metadata for Endangered Languages Data (EMELD)."

EMELD Language Lookup Pages:

An E-MELD Language Codes mailing list was set up in July 2001 to support the Electronic Metastructure for Endangered Languages Data Project: 'e-meld-codes@listserv.linguistlist.org'. The E-MELD Project Family of Lists includes the E-MELD Superlist, E-MELD Sublist on Language Codes, and E-MELD Sublist on Language Markup. See the Linguist web site archives.

See the announcement of November 05, 2001 for an initial version of a language database facility which allows one to look up a language, family, or language code; to show all ancient languages in the LINGUIST Database; and to show all constructed languages in the LINGUIST Database.

E-MELD: "To combat the decrease in the number and diversity of languages and to capitalize on a growing store of digitized linguistic data, a team of National Science Foundation (NSF)-funded researchers led by Anthony Aristar at Wayne State University is developing an endangered languages database and a central information server that will allow users to access the material remotely by computer. A 2ドル million NSF grant to Aristar and his colleagues at Eastern Michigan University, the University of Pennsylvania and the University of Arizona will be used to create this public digital archive. The goals of the Electronic Metastructure for Endangered Languages Data (E-MELD) project are to collect data on endangered languages and to devise a Web-based protocol so that new and existing data will be accessible to researchers and native speakers everywhere..." [From the NSF report of Peter West, cited below. Similarly: [Wayne State] University receives 2ドル million grant for endangered languages."]

Input to the E-MELD Language Codes List is recorded in Gene Gragg's "Report on Language Codes Workgroup Recommendations" from the Linguist List workshop. See details in the summary.

References:

Linguist List Genetic Classification Coding Scheme

The Linguist List community "provides a forum where academic linguists can discuss linguistic issues and exchange linguistic information." From the Linguist List, Language Classification Working Group: A Proposed LINGUIST Coding Scheme for Genetic Classification. "One proposed LINGUIST coding scheme for language classification works as follows. Each language family is assigned a 2-letter code. Then, subgroups which attach directly to the highest node are assigned a letter of the alphabet, in alphabetical order, which is appended to the family code. Any subgroups which belong to a lower subgroup are assigned codes in the same way, until all subgroups have been assigned codes. For example, the code for Altaic is AT. The Mongolian node is the first in alphabetical order, and is assigned the letter A. Its code is thus ATA. Tungus is the next in order, and is assigned the letter B, and its code is thus ATB. Turkic is next, and is assigned the code ATC. The subgroups under each of these nodes are assigned codes in the same way. Mongolian has two subgroups, Eastern and Western. Eastern is assigned the code ATAA, Western the code ATAB. Tungus has two groups, Northern and Southern. Northern is assigned ATBA, and Southern ATBB. Each subgroup is assigned a code in the same way. To see the result for some actual data, select one of the families in our database from the form below [see the online document]... Each language is now categorized by two codes: its own unique code (in this case its Ethnologue code) and the code of its immediate subgroup."

Examples [snapshot 2001-08]:

MARC Code List for Languages

MARC Code List for Languages. Web Version of the 2000 Edition. Description 2001年08月27日: "This document contains a list of languages and their associated three-character alphabetic codes. The purpose of this list is to allow the designation of the language or languages in MARC records. The list contains 437 discrete codes, of which 54 are used for groups of languages... The list includes all valid codes and code assignments as of February 2000. This revised edition includes numerous changes necessary for compatibility with the newly approved ISO 639-2 in 1998 (Codes for the Representation of Names of Languages Part 2: Alpha-3 Code). There are 26 code changes, 1 deletion, and 35 additions in this revision. The language codes are three-character lowercase alphabetic strings usually based on the first three letters of the English form or, in some cases, vernacular of the corresponding language name. The codes are varied where necessary to resolve conflicts. In the case of modern and older forms of some languages, the initial letters of each part of the language name are used to form the code, e.g., gmh for German, Middle High, and goh for German, Old High. When the name of a language is changed in the list, the original code is generally retained. The list includes individual codes for most of the major languages of the modern and ancient world, e.g., Arabic, Chinese, English, Hindi, Latin, Tagalog, etc. These are the languages that are most frequently represented in the total body of the world's literature. Additional codes for individual languages are created from time to time when it becomes apparent that a significant body of literature in a particular language already exists, or when it is determined that the amount of material in a language is growing. Usually only one code is provided for a given language, even if that language can be written in more than one set of characters. In a few cases however, separate codes are provided for the same spoken language written in different characters... In addition to codes for individual languages, the list also contains a number of codes for language groups. While some individual languages are given their own unique code, although linguistically they are part of a language group, many individual languages are assigned a group code, because it is not considered practical to establish a separate code for each." See also notes on the Code sequence, Changes in 2000 edition, Changes since 2000 edition , and ASCII version.

Published Subjects for Geography and Languages TC

[February 02, 2002] OASIS Technical Committee to Define Published Subjects for Geography and Languages. A proposal has been submitted to OASIS for the creation of a new technical committee on 'Published Subjects for Geography and Languages'. The TC will define sets of published subjects "for language, country, and region subjects, in accordance with the guidelines for published subjects to be laid down by the OASIS Published Subjects TC. Languages, countries, and regions are subjects that occur frequently across a wide range of topic maps. In order to promote maximum reusability, interchangeability and mergeability, standardised sets of published subjects are required to cover these domains. Two such PSI sets (for country and language) were published as part of the XML Topic Map 1.0 Specification; the task of this TC will be to update and extend those PSI sets using existing code sets defined by recognised standards bodies such as the ISO and the UN." Published subjects will be created for languages according to ISO 639 and USMARC codes; published subjects for countries and regions will be based upon ISO 3166; PSI sets for countries, regions, and geographic areas will also be created for USMARC codes; another set of published subjects for regions will be based up on the UNSD Standard Country or Area Codes. Published subjects are a form of controlled vocabulary allowing "unambiguous indication of the identity of a subject"; they are defined in the ISO 13250 Topic Maps standard and further refined in the XML Topic Maps (XTM) 1.0 Specification. [Full context]

Use of Standard Code Lists

SGML (Standard Generalized Markup Language)

The SGML standard (ISO 8879:SGML, page 36, section 10.2.2.3) references ISO 639 language codes in connection with the specification for "public text language" used in a "text identifier." A text identifier (Section 10.2.2, production 84) is most commonly encountered within a "formal public identifier" (production 79). An FPI might occur in a document type declaration, external entity specification, notation identifier, or link type declaration. The ISO standard [10.2.2.3, production 88 for public text language] says that the "public text language must be a name [viz., an SGML 'name' per production 55], entered with upper-case letters. The name should be the two-character language code from ISO 639 that defines the principal natural language used in the public text. Notes: (1) The natural language will affect the usability of some public text classes more than others; (2) The portions of text most likely to be influenced by a natural language include the data, defined names, and comments; (3) A system can use the public text language to facilitate automatic language translation."

Since most SGML applications anticipate the use of language tagging at the level of the element (frequently not at the level of the entity, where an FPI would be used), ISO 639 language codes are often used in SGML DTDs within attribute definition lists. Elements which require language tagging are given a special attribute such as lang or language, declared as NMTOKEN, IDREF, or CDATA, with the requirement that the attribute value be drawn from the list of ISO 639 language codes. Using IDREF or an enumerated list [name token group] allows the SGML parser to validate the nomination of an authorized language code in the attribute value.

XML (Extensible Markup Language)

The W3C XML 1.0 Recommendation (second edition) references language codes in two places: (1) Section 1.1 reads: "This specification, together with associated standards (Unicode and ISO/IEC 10646 for characters, Internet RFC 3066 for language identification tags, ISO 639 for language name codes, and ISO 3166 for country name codes), provides all the information necessary to understand XML Version 1.0 and construct computer programs to process it."] (2) More specific mention occurs in Section 2.12 'Language Identification': "In document processing, it is often useful to identify the natural or formal language in which the content is written. A special attribute named xml:lang may be inserted in documents to specify the language used in the contents and attribute values of any element in an XML document. In valid documents, this attribute, like any other, must be declared if it is used. The values of the attribute are language identifiers as defined by IETF RFC 3066, 'Tags for the Identification of Languages', or its successor on the IETF Standards Track. [Note: IETF RFC 3066 tags are constructed from two-letter language codes as defined by ISO 639 [International Organization for Standardization. ISO 639:1988 (E). Code for the representation of names of languages. Geneva: International Organization for Standardization, 1988.], from two-letter country codes as defined by ISO 3166, or from language identifiers registered with the Internet Assigned Numbers Authority [IANA-LANGCODES - Internet Assigned Numbers Authority, Registry of Language Tags, ed. Keld Simonsen et al.; see http://www.isi.edu/in-notes/iana/assignments/languages/ ]... The intent declared with xml:lang is considered to apply to all attributes and content of the element where it is specified, unless overridden with an instance of xml:lang on another element within that content."

Note the XML 1.0 second edition amendment characterized as substantive: E11 in "XML 1.0 Second Edition Specification Errata" records that the next-to-last paragraph in Section 1.1 was amended to read "RFC 3066" in place of "RFC 1766". Similarly, other references throughout were changed to read [at the level of surface text] "RFC 3066" -- since RFC 3066 updates and obsoletes RFC 1766. The same errata listing says to "Remove the last sentence of the Note [in Section 2.12, which read]: 'It is expected that the successor to [IETF RFC 1766] will introduce three-letter language codes for languages not presently covered by [ISO 639].'" However, no change was made in the second edition of XML 1.0 to explicitly allow for three-letter codes as values for xml:lang, even though RFC 3066 allows the composition of a language tag using the 3-letter codes from ISO 639 part 2, "Codes for the representation of names of languages -- Part 2: Alpha-3 code." It appears that the intent was to allow the 3-letter codes.

For historical purposes, note the listing E73 [Substantive] listing from the XML 1.0 Specification Errata. E73 "Obsoletes E31, E60, and part of E38."

"Section 2.12: Change the last sentence of the first paragraph to: <corr>The values of the attribute are language identifiers as defined by [IETF RFC 1766], "Tags for the Identification of Languages" or its successor on the IETF Standards Track.</corr> Replace productions [33] to [38] and all the following text, down to but excluding the sentence "For example" just before the examples, with the following: <corr>Note: RFC 1766 tags are constructed from two-letter language codes as defined by [ISO 639], from two-letter country codes as defined by [ISO 3166] or from language identifiers registered with the Internet Assigned Numbers Authority [IANA-LANGCODES]. It is expected that the successor to [IETF RFC 1766] will introduce three-letter language codes for languages not presently covered by [ISO 639].</corr> Rationale The XML processor does not deal with the value of xml:lang, it just passes it on to the application. Checking its correctness at this level has no benefit and hurts with updates to RFC1766 (forthcoming). The spec must still impose the semantics of xml:lang by pointing to RFC 1766.

Evaluation of xml:lang:

The implied inheritance of a language property (viz., the xml:lang value) by subelements in the instance hierarchy may be considered a very useful feature. However, the prescribed semantic "is considered to apply to all attributes and content of the element where it is specified" may be regarded (arguably) as suboptimal for tagging multilingual text, or even for annotating a text in a single language "foreign" to the markup specialist. In some settings, xml:lang may simply be unusable if the semantic prescription of the XML 1.0 specification is to be honored. Details follow.

Section 2.12 describes the use and meaning of xml:lang as follows:

  • ... to specify the language used in the contents and attribute values of any element...
  • ...The intent declared with xml:lang is considered to apply to all attributes and content of the element where it is specified, unless overridden with an instance of xml:lang on another element within that content.

DTD authors will naturally want to design markup constructs (e.g., element type names, attribute names, attribute value name-tokens in an enumerated attribute type) for their users in terms of the users' native language. That is: users want markup labels (XML "names") to be in their first language. Even more critically: if users are required to supply a short phrase-level descriptor as CDATA content for an attribute, they naturally want to think and write in their own language. The XML specification seems not to allow this in cases where the element content is declared to be in some other language. The phrase "all attributes and content" seems to require that a global language assertion would be made by the use of xml:lang in any element.

Example #1: The TEI (P4) DTD defines a <q> element for quoted speech; this element has two CDATA attributes ('who' and 'type') as well as an enumerated-type attribute 'direct' with attribute type and default value (y | n | unspecified) "unspecified". Using the TEI P4 'lang' attribute (a global IDREF attribute indicating the language, writing system, and character set associated with a given element), the following <q>...</q> encoding would be sensible for an English-speaking student wishing to mark up a German quoted phrase: <q lang="de" who="Hans" type="spoken" direct="unspecified">bei mir</q>. The following would not: <q xml:lang="de" who="Hans" type="spoken" direct="unspecified">bei mir</q>. The prescribed meaning of xml:lang seems to require a declaration that the terms "spoken" and "unspecified" (at least) are in German, as well as "bei mir." This is not a boundary case, as the TEI DTD has dozens or maybe hundreds of CDATA attributes which invite substrings "in" the native language of the encoder, which would conflict with the semantic for xml:lang in a bilingual or multilingual encoding environment. It is unclear how the TEI editors could entertain a proposal to substitute the xml:lang attribute of XML 1.0 for the TEI P4 lang attribute in the P4 XML DTD, given the scope specification for xml:lang.

Example #2: Suppose the DTD is "in English" for the benefit of the English speaking users; assume declarations like <!ELEMENT quotation (#PCDATA) > <!ATTLIST quotation speaker CDATA #REQUIRED type (direct | indirect) "direct" context CDATA #IMPLIED xml:lang NMTOKEN #REQUIRED >. How then might one use this DTD to mark up an English language document containing occasional quotations in Spanish uttered by José, as in <quotation speaker="Jos&eacute;" xml:lang="ES" type="indirect" context="whispered to his girlfriend seated at the bar">probablemente</quotation> -- if the markup is to embody an assertion that the text portions "quotation," "speaker," "ES," "type," "indirect," "whispered to his girlfriend seated at the bar," and "probablemente" are all in Spanish.

Data directly affected would seem to include all CDATA [StringType] 'content' in attribute values, all name tokens [Enumeration type] used in attribute values, and all character data [#PCDATA] between the start-tag and end-tag. Depending upon the XML 1.0 meaning of "contents" and "all attributes," the declaration would also apply to all (other) attribute names, to the language of the element type name, entity names, ID/IDREF name, etc.

[Please send email if you think I have misunderstood the meaning of the phrase "all attributes" and "contents" or if you think xml:lang is useful for multilingual documents.]

XHTML (Extensible HyperText Markup Language)

Authoring Techniques for XHTML & HTML Internationalization: Specifying the Language of Content 1.0 . Edite by Richard Ishida (W3C). W3C Working Draft. 15-October-2004. Produced by the Guidelines, Education & Outreach Task Force (GEO) of the W3C Internationalization Working Group (I18N WG). Latest version URL: http://www.w3.org/TR/i18n-html-tech-lang/. "Specifying the language of content is useful for a wide number of applications, from linguistically sensitive searching to applying language-specific display properties. In some cases the full application is still awaiting full development, whereas in others, such as detection of language by voice browsers, it is a necessity today. Marking up language meta information is something that can and should be done today. Without it, none of these applications can be taken advantage of. This document is one of a series of documents providing HTML authors with techniques for developing internationalized HTML using XHTML 1.0 or HTML 4.01, supported by CSS1, CSS2 and some aspects of CSS3. It focuses specifically on advice about specifying the language of content..."

Text processing language: When specifying the text processing language you are declaring the language in which a particular range of text is actually written, so that user agents or applications that manipulate the text, such as voice browsers, spell checkers, or style processors can effectively handle the text in question. So we are, by necessity, talking about associating a single language with a specific range of text. The text processing language is usually best declared using attributes on elements. Enclosed elements inherit the declared value, but you can, of course, override an initial declaration by specifying a different language on embedded elements where the language changes, eg. a French word in an English paragraph. The default text processing language is not necessarily the same as metadata about the primary language of a document..."

Tutorial: Using Language Information in XHTML, HTML and CSS (DRAFT) . "This tutorial provides advice in the following areas: (1) guidelines for declaring the language of documents and text; (2) how to specify language attribute values; (3) applicability of the language tag to apply language-specific CSS styling; (4) a brief introduction to the concept of server-based language negotiation. The tutorial additionally attempts to provide explanations of the basic concepts needed to understand the advice given..."

Other references:

HTML

See "Language Tagging in HTML and XML." By Martin J. Dürst, W3C i18n Coordinator. Updated 2001年08月30日 or later. "Language codes as defined in RFC 3066 can be (and should be) used to indicate the language of text in HTML and XML documents. For HTML 4, language codes are specified with the lang attribute. For XML, language codes are given in the xml:lang attribute. In both cases, language information is inherited along the document hierarchy,i.e., it has to be given only once if the whole document is in one language, and language information nests, i.e., inner attributes overwrite outer attributes... Language codes starting with i- are defined in the IANA registry of language codes. Language codes starting with x- denote experimental codes without guarantee for uniqueness... Many other W3C and Web-related specifications use language codes [for example], (1) XHTML 1.0, reformulating HTML in terms of XML, which advises to use both the HTML lang attribute and the XML xml:lang attribute, with the later taking precedence in case there should be any differences. (2) HTTP uses language codes in the Accept-Language and Content-Language headers. (3) SMIL and SVG can use language codes in the <switch> statement. (4) CSS and XSL use language codes for detailed style control..."

On HTTP, see Hypertext Transfer Protocol -- HTTP/1.1 = IETF RFC 2616. From Section 3.10 'Language Tags': "A language tag identifies a natural language spoken, written, or otherwise conveyed by human beings for communication of information to other human beings. Computer languages are explicitly excluded. HTTP uses language tags within the Accept-Language [14.4] and Content- Language[14.12] fields. The syntax and registry of HTTP language tags is the same as that defined by RFC 1766. In summary, a language tag is composed of 1 or more parts: A primary language tag and a possibly empty series of subtags: language-tag = primary-tag *( "-" subtag ) // primary-tag = 1*8ALPHA // subtag = 1*8ALPHA. White space is not allowed within the tag and all tags are case-insensitive. The name space of language tags is administered by the IANA. Example tags include: en, en-US, en-cockney, i-cherokee, x-pig-latin where any two-letter primary-tag is an ISO-639 language abbreviation and any two-letter initial subtag is an ISO-3166 country code. (The last three tags above are not registered tags; all but the last are examples of tags which could be registered in future.)"

For the W3C HTML 4.01 Specification [W3C Recommendation 24-December-1999], 'International considerations for text' are presented in Section 8, "Language information and text direction." Excerpts: "Language information specified via the lang attribute may be used by a user agent to control rendering in a variety of ways. Some situations where author-supplied language information may be helpful include: (1) Assisting search engines; (2) Assisting speech synthesizers; (3) Helping a user agent select glyph variants for high quality typography; (4) Helping a user agent choose a set of quotation marks; (5) Helping a user agent make decisions about hyphenation, ligatures, and spacing; (6) Assisting spell checkers and grammar checkers. The lang attribute specifies the language of element content and attribute values; whether it is relevant for a given attribute depends on the syntax and semantics of the attribute and the operation involved. The intent of the lang attribute is to allow user agents to render content more meaningfully based on accepted cultural practice for a given language. This does not imply that user agents should render characters that are atypical for a particular language in less meaningful ways; user agents must make a best attempt to render all characters, regardless of the value specified by lang. The lang attribute's value is a language code that identifies a natural language spoken, written, or otherwise used for the communication of information among people. Computer languages are explicitly excluded from language codes. RFC1766 defines and explains the language codes that must be used in HTML documents... An element inherits language code information according to the following order of precedence (highest to lowest): #1: The lang attribute set for the element itself; #2: The closest parent element that has the lang attribute set -- i.e., the lang attribute is inherited; #3: The HTTP 'Content-Language' header, which may be configured in a server. Table cells may inherit lang values not from its parent but from the first cell in a span; please consult the section on alignment inheritance for details... In the context of HTML, a language code should be interpreted by user agents as a hierarchy of tokens rather than a single token. When a user agent adjusts rendering according to language information (say, by comparing style sheet language codes and lang values), it should always favor an exact match, but should also consider matching primary codes to be sufficient. Thus, if the lang attribute value of 'en-US' is set for the HTML element, a user agent should prefer style information that matches 'en-US' first, then the more general value 'en'."

XHTML 1.0: The Extensible HyperText Markup Language addresses HTML and XHTML compatability in Section C.7, 'The lang and xml:lang Attributes': "Use both the lang and xml:lang attributes when specifying the language of an element. The value of the xml:lang attribute takes precedence..." HTML used the attribute lang with language tag values constructed according to RFC 1766.

On language support in the Scalable Vector Graphics (SVG) 1.0 Specification [W3C Recommendation], see Section 5.8.5 The systemLanguage attribute and Section 5.8.2 The 'switch' element as part of conditional processing. "SVG contains a 'switch' element along with attributes requiredFeatures, requiredExtensions, and systemLanguage to provide an ability to specify alternate viewing depending on the capabilities of a given user agent or the user's language... The attribute value for the systemLanguage attribute is a comma-separated list of language names as defined in RFC3066..."

TEI (Text Encoding Initiative Guidelines)

The TEI DTD provides a global lang attribute, applicable to all elements in the DTD, which names the language in which the content of an element is written. For example: <p lang="EN">...</p>. The TEI global attribute lang (language) "indicates the language of the element content, usually using a two- or three-letter code from ISO 639. Its datatype is IDREF. The value must be the identifier specified for a writing system declaration declared in the TEI header, as described in section 5. The default is %INHERITED;: If no value is specified for lang, the lang value for the immediately enclosing element is inherited; for this reason, a value should always be specified on the outermost element." A typical declaration is thus <!ATTLIST foo lang IDREF %INHERITED; >.

The language element in the TEI Writing System Declaration has an attribute iso639 providing additional functionality. Check out the TEI WSD in the TEI Guidelines [P4] Chapter 25, "Writing System Declaration: "The writing system declaration or WSD is an auxiliary document which provides information on the methods used to transcribe portions of text in a particular language and script. We use the term writing system to mean a given method of representing a particular language, in a particular script or alphabet; the WSD specifies one method of representing a given writing system in electronic form. A single WSD thus links three distinct objects: (1) the language in question; (2) the writing system (script, alphabet, syllabary) used to write the language; (3) the coded character set, entity names, or transliteration scheme used to represent the graphic characters of the writing system. Different natural languages thus have different writing system declarations, even if they use the same script. Different methods used to write the same language (e.g., Cyrillic or Latin encoding of Serbo-Croatian), and different methods of representing the same script in electronic form (e.g., different coded character sets such as ASCII or EBCDIC, or different transliteration schemes) similarly must use different writing system declarations..." See WSD examples (TEI P3).

The TEI DTD also defines elements <langUsage> and <language>, as described in P4 Section 5.4.2, 'Language Usage'. "The <langUsage> element is used within the <profileDesc> element to describe the languages, sublanguages, registers, dialects, etc. represented within a text. It contains one or more <language> elements, each of which takes attributes specifying the writing system used and the quantity of that language present in the text. Following the <language> elements, prose description may also be added to specify further relevant information..."

The TEI design for lang="" is not entirely happy, however, as reflected in a note from Lou Burnard, Summer 2001: "Every now and then this returns to bite us... In the TEI, the global LANG attribute is supposed to identify both natural language and writing system, using a single code. So 'Hebrew written in Greek characters' gets a unique code, rather than a code for Greek and a code for Hebrew writing system (though the WSD does allow us to say what the ISO639 language is). I believe the rationale for this has something to do with the fact that our LANG attribute in fact identifies a transliteration scheme, which is precisely the union of a particular language and a writing system. But everyone -- not unreasonably -- assumes it means 'language' and so wonders why we think Hebrew stops being Hebrew when you write it in a different writing system. Every now and then people asked why we persist in this folly when there are distinct ISO standards for natural language (ISO 639) and for writing system (ISO 15924). EAD, LOC, and CES all distinguish between them accordingly..." [For details contact Lou Burnard, European TEI Editor]

Note that 'ISO 15924' was sent out for ballot as DIS [Draft International Standard] apparently in October 2000. The closing date for ballots was 2001年01月10日. Code for the representation of names of scripts . 'DIS' ISO 15924:2000 (E/F). 2000年05月18日. "... provides a code for the presentation of names of scripts. The codes were devised for use in terminology, lexicography, bibliography, and linguistics, but they may be used for any application requiring the expression of scripts in coded form. This standard also includes guidance on the use of script codes in some of these applications... The alphabetic script codes are created from the original script name in the language commonly used for it, transliterated or transcribed into Latin letters. If a country, where the script concerned has the status of a national script, requests a certain script code, preference is given to this code whenever possible. The four-letter codes shall be written with an initial capital Latin letter and final small Latin letters (taken from the range Aaaa to Zzzz). This serves to help differentiate script codes from language codes and country codes: so, for example, Mong mon MON or Mong mn MN would refer to a book in the Mongolian script, in the Mongolian language, originating in Mongolia... The numeric script codes have been assigned to provide some measure of mnemonicity to the codes used..." Note the HTML-style syntax, provided as an example of usage in markup: <META HTTP-EQUIV="Content-Language" CONTENT="ga, ru"> <META NAME="Content-Script" CONTENT="Latg, Cyrl">. See also the ISO 15924 document list. Details: contact Michael Everson, Editor, DIS 15924. [cache]

Encoded Archival Description (EAD)

The DTD of the Encoded Archival Description (EAD) supplies a standard for encoding archival finding aids using SGML and XML. The <eadheader> comprises a set of metadata about the finding aid that serves to identify unambiguously each particular EAD instance by providing a unique identification code for the document; by stating bibliographic information such as the author, title, and publisher of the finding aid; and by tracking significant revisions to the EAD file... Language encoding for EAD instances utilizes ISO 639-2 Codes for the Representation of Names of Languages, and the LANGENCODING attribute always should be 'ISO 639-2'." Spring 2001, EAD new DTD item adds script name: Scripts/symbol systems vs. languages (2001 EAD DTD Revision Suggestions, 46). "According to the 2nd ed. of ISAD(G) [General International Standard Archival Description], the data element 3.4.3 Language/scripts of material can contain 'the language(s) and/or script(s) of the materials. Optionally, also include the appropriate ISO codes for language(s) (ISO 639-1 and ISO 639-2) or script(s) (ISO 15924). EAD has no specific element to tag scripts, symbol systems, abbreviations employed...' [We should] create a new element to tag the scripts, symbol systems or abbreviations employed. This element could contain a new attribute to include the appropriate code(s) for scripts (ISO 15924)." Apparently resolved: "... One of the proposals that went through with little or no dissent [EAD Working Group meeting, Spring 2001] was the proposal to set up a separate attribute for 'script codes' (which is analogous to what we are calling writing system). There is a place to say something about the language of the materials in one tag, and then the ability to tie in language codes (ISO 639) and also the 'script codes' which are also defined by an ISO standard (ISO 15924). So it then becomes possible to say '"Latinized Hebrew' or what have you. I think in EAD, this would look like: <language langcode="heb" scriptcode="latn">Latinized Hebrew</language>..." [Based on a note from Merrilee Proffitt.]

Corpus Encoding Standard

The Corpus Encoding Standard (CES) [formal model in XML as well as SGML] has facility for encoding linguistic annotation. The 'cesAna DTD' has five global attributes, three of which pertain to language and writing system; lang and type store information about the (natural) language of an element's content; the type attribute supplies a two-letter code or three-letter code from ISO 639; wsd provides a means of indicating that the element's content is encoded in a specified character set. See also Section 3.5.2 on the 'langUsage' element and Section 3.5.3 on the CES 'wsdUsage' element for writing system declaration which identifes a character set.

Common Locale Data Repository (CLDR)

"The purpose of the Common Locale Data Repository project is to provide a general XML format for the exchange of locale information for use in application and system development, and to gather, store, and make available a common set of locale data generated in that format. CLDR Version 1.2, with the Locale Data Markup Language specification (LDML 1.2), provides key building blocks for software to support the world's languages. This new release contains data for 232 locales, covering 72 languages and 108 territories. There are also 63 draft locales in the process of being developed, covering an additional 27 languages and 28 territories." A CLDR locale id contains a language_code, as defined in the Locale Data Markup Language (LDML), where the language_code is drawn from RFC 3066 [or its successor] of from the set of 2-character ISO 639 language codes.

References:

Language Tagging in Unicode

Unicode 3.0 provided for language codes, but the use of tag characters is [now] deprecated, except in very limited special cases. See the references and discussion below.

Unicode Version 3.0 Section 5.11 "Language Tagging in Plain Text" said: "For interchange purposes, it is becoming common to use tagged information, which is embedded in the text. Unicode Technical Report #7, 'Plane 14 Characters for Language Tags,' which is found on the CD-ROM or in its up-to-date version on the Unicode Web site, provides a proposed mechanism for representing language tags. Like most tagging mechanisms, these language tags are stateful: a start tag establishes an attribute for the text, and an end tag concludes it..." This paragraph has been deleted in version 3.1.

UTR #7 likewise has been superseded by the publication of Unicode Version 3.1. See for background: Unicode Technical Report #7. Plane 14 Characters for Language Tags. By Ken Whistler and Glenn Adams. Version 4.0. 2001年03月23日. Text in 7-3.1. The Plane 14 Technical Report represented the consensus of a meeting of the UTC Working Group on Tagging and Annotation and of IETF representatives which took place on June 24,1997. Rationale was offered (approximately) as follows: "... The difficulty of using general in-band text markup for simple protocols derives from the fact that some characters are used both for textual content and for the text markup; this makes it more difficult to write simple, fast algorithms to find only the textual content and ignore the tags, or vice versa. (Think of this as the algorithmic equivalent of the difficulty the human reader has attempting to read just the content of raw HTML source text without a browser interpreting all the markup tags.) The Plane 14 technical report addresses the recurrent and persistent call for a lighter-weight mechanism for text tagging than typical text markup mechanisms in Unicode. It proposes a special set of characters used only for tagging. These tag characters can be embedded into plain text and can be identified and/or ignored with trivial algorithms, since there is no overloading of usage for these tag characters--they can only express tag values and never textual content itself. Tag characters are not intended for general annotation of text..."

Unicode 3.1. See now the new Section 13.7 on "Tag Characters" in Unicode Standard Annex #27. Unicode 3.1, dated 2001年05月16日. Tag Characters: U+E0000-U+E007F, sub 'Block Descriptions'. "The characters in this block provide a mechanism for language tagging in Unicode plain text. However, the use of these characters is strongly discouraged. The characters in this block are reserved for use with special protocols. They are not to be used in the absence of such protocols, or with any protocols that provide alternate means for language tagging, such as HTML or XML. The requirement for language information embedded in plain text data is often overstated. See Section 5.11, Language Information in Plain Text in The Unicode Standard, Version 3.0. This block encodes a set of 95 special-use tag characters to enable the spelling out of ASCII-based string tags using characters which can be strictly separated from ordinary text content characters in Unicode. These tag characters can be embedded by protocols into plain text. They can be identified and/or ignored by implementations with trivial algorithms because there is no overloading of usage for these tag characters -- they can only express tag values and never textual content itself. In addition to these 95 characters, one language tag identification character and one cancel tag character are also encoded. The language tag identification character identifies a tag string as a language tag; the language tag itself makes use of RFC 3066 (or its successors) language tag strings spelled out using the tag characters from this block... Tags of the same type cannot be nested in any way. For example, if a new embedded language tag occurs following text which was already language tagged, the tagged value for subsequent text simply changes to that specified in the new tag... Tags of different types can have interdigitating scope, but not hierarchical scope. In effect, tags of different types completely ignore each other, so that the use of language tags can be completely asynchronous with the use of future tag types... Avoiding Language Tags: Because of the extra implementation burden, language tags should be avoided in plain text unless language information is required and it is known that the receivers of the text will properly recognize and maintain the tags. However, where language tags must be used, implementers should consider the following implementation issues involved in supporting language information with tags and decide how to handle tags where they are not fully supported. This discussion applies to any mechanism for providing language tags in a plain text environment. Language tags should also be avoided wherever higher-level protocols, such as a rich-text format, HTML or MIME, provide language attributes. This practice prevents cases where the higher-level protocol and the language tags disagree..." See also the announcement for version 3.1 and the document "XML and Unicode."

UTR #20. Unicode language tag characters are also discussed in "Unicode in XML and other Markup Languages." Unicode Technical Report #20. W3C Note 15 December 2000. Revision #5. By Martin Dürst (duerst@w3.org) and Asmus Freytag (asmus@unicode.org). [The document "contains guidelines on the use of the Unicode Standard in conjunction with markup languages such as XML."] See Section 3.8 on 'Language Tag Characters': "A proposed series of characters from U+E0000 .. U+E007F for expressing language tags, based on existing standards for language tags using the rules in [UTR7]. Reason for inclusion: These characters allow in-band language tagging in situations where full markup is not available, while allowing easy filtering by applications that do not support them. They were specifically included for the benefit of Internet protocols such as ACAP, which require a standard mechanism for marking language in UTF-8 strings and to avoid the use of other schemes that relied on specific details of the encoding form used. Problems when used in markup: These characters duplicate information that can be expressed in markup. Problems with other uses: Their special code range allows them to be easily filtered, but applications that don't expect them will treat them as garbage characters. Replacement markup: Replace with equivalent language markup [e.g., <xhtml:lang>. What to do if detected: Browsers may ignore these characters. When received in an editing context, editors may remove and/or replace them by equivalent markup..."

Related proposal. Adding Embedded Language Identifiers to Plain Unicode Text . By Daniel Wood, Mark W. Davis, and Mark Leisher. (Computing Research Lab New Mexico State University). September 22, 1995. "...This approach basically consists of two parts: [A.] Allocation of sixteen codepoints from the Private Use area of the Unicode Standard character set and identification of their properties (in the Unicode Character Properties sense). [B.] A technique for constructing a language identifier when some contiguous subset of these sixteen codepoints is encountered in a Unicode text stream..." Of this proposal, Mark Leisher wrote: "And speaking of 'A Bad Idea(tm),' I proposed a different sort of bad idea a while back with a language tag idea that is technically more elegant than the Plane 14 tags, in my humble opinion. But it has its own problems as well..." [cache]

Language Tags and Operating Systems

This section has little to do with "markup" in the traditional sense. I include it as a reminder that a lot is at stake as designs are crafted for language identifiers: operating systems and applications software need to be customizable and language-property-aware, based upon the rapidly-changing world of language-smart data begging to be respected for linguistic properties they declare to be relevant.

Note: This section is taken substantially (wholesale) from Peter Constable of SIL, with permission; I have not checked it in any respect. See Peter's article which explains language identifiers in relation to locales.

Universal Locales for Linux. The Universal Locales for Linux provides 140 or more Unicode based locales for Linux The site contains a list of locale identifiers that illustrates this use of ISO 639-1 two-letter codes in Linux and Unix locale identifiers.

About MS Win32 LANGIDs. A description of MS Win32 LANGIDs, taken from Developing International Software for Windows 95 and Windows NT, by Nadine Kano.

List of Locale Ids and Language Groups. LANGIDs from MS Global Software Development site. See also the list of LANGIDs in the Platform SDK documentation. They are language identifiers composed of a primary language identifier and a sublanguage identifier.

Mac OS language constants (Carbon). Lists of constants used for language identification in the Mac OS Carbon interfaces. (This page points to other lists of constants as well.)

Mac OS language and locale identifiers (Cocoa). Describes the mechanisms used for language and locale identification in the Mac OS Cocoa interfaces, taken from Inside Mac OS X: System Overview.

Locales API Preliminary Documentation (Mac OS 8.6 and later). A draft document describing proposed mechanisms for language identification for use in new Mac OS technologies (Cocoa). This does not conform precisely to what is used in Mac OS X, but presents some interesting discussion of language identification issues.

Creating a Locale. Language identifiers in Java. Section of Java tutorial describing the use of language identifiers in creation of locale identifiers in Java. "To create a Locale object, you typically specify the language code and the country code; The first argument is the language code, a pair of lowercase letters that conform to ISO-639."

General References

SEARCH
Advanced Search
ABOUT
Site Map
CP RSS Channel
Contact Us
Sponsoring CP
About Our Sponsors

NEWS
Cover Stories
Articles & Papers
Press Releases

CORE STANDARDS
XML
SGML
Schemas
XSL/XSLT/XPath
XLink
XML Query
CSS
SVG

TECHNOLOGY REPORTS
XML Applications
General Apps
Government Apps
Academic Apps

EVENTS
LIBRARY
Introductions
FAQs
Bibliography
Technology and Society
Semantics
Tech Topics
Software
Related Standards
Historic
Last modified: December 11, 2009

Hosted By
OASIS - Organization for the Advancement of Structured Information Standards

Sponsored By

IBM Corporation
ISIS Papyrus
Microsoft Corporation
Oracle Corporation

Primeton

XML Daily Newslink
Receive daily news updates from Managing Editor, Robin Cover.

Newsletter Subscription
Newsletter Archives
[画像:Globe Image]

Document URI: http://xml.coverpages.org/languageIdentifiers.htmlLegal stuff
Robin Cover, Editor: robin@oasis-open.org


AltStyle によって変換されたページ (->オリジナル) /