Wiktionary future
This page aims to group ideas for Wiktionary's future. Indeed, with the introduction of Wikidata, a lot of possibilities emerge. Moreover, the proposal to adopt OmegaWiki is adding to this current effervescence on projects with overlapping goals.
The goal is first to expose what the specificities of each project are, and what parts overlap. Then, suggestions on how to dispatch non-overlapping goals and not waste resources in redundant ways can be discussed. Finally, this page should be used to present and debate long-term plans concerning Wiktionaries.
Targets, business-model of the proposals presented here?
Adding a proposal here needs to pose the question to oneself: Who will benefit from my proposal?
- Who are my potential customers, the final users of my proposal?
- What is the business model behind it?
- Will it lead to a win-win-situation
- for me (only for me?),
- for the worldwide WT Project,
- for the worldwide (hopefully expanding) user group: Reading users, contributing users or even (very essential) donating users?
Here the win-win-situation can exist between two parties: The WT Project itself and its users.
WT users
The WT project will prosper in the long term only if it satisfies the true needs of a broad spectrum of users, correctly assesses and continues to evolve in the direction of satisfying the needs of this user group (concerning language), currently and in the future.
This user group will be only willing to pay for it (in the form of contributions, and, absolutely necessary for the WT infrastructure, donations and hopefully not, by enduring ads) if the WTs are of high and rising quality and growing easiness of use (easy to understand AND easy to contribute).
This leads to the question, posed in every business: Who are the WT Project users (world-wide)?
I, personally, se the following user groups:
- The monolingual WT user
- The bilingual WT user.
Within these groups, you have to distinguish between common users (merely looking for a word; learning a foreign language) and language experts (translators, people who studied the language). Surely the common users are the bigger user base. A smaller part of the common users will be able to contribute. The easer WT can be used the more (mass) input can be produced by them. WT Project should also fulfill the needs of the language experts. Do the interests of language experts match the interests of common users? I think: No. But they can coexist in a good way. Do the language experts understand the real needs of the common users? Sometimes my feeling is: No.
What are the sizes of these user bases to estimate the real, general benefit of a good proposal?
NoX (talk) 12:22, 29 March 2013 (UTC) [reply ]
Proposal of an improved Wiktionary data model
Abstract: The following proposal aims at improving the internal Wiktionary data structures. The current mostly text- and mark-up-based structures with more or less accidental quality of data-content should be moved, step by step, into a better apt, more easily understandable and usable data model, representing the real long term needs of the subject. The view is that of a cross-Wiktionary user. The target is described, not the road map.
Currently: One Wiktionary project contains m Wiktionaries. (m = many). Each Wiktionary contains m Wikis. Each Wiki contains one (and only one) text bloc containing text intermixed with mark-up. Within such a text bloc can be contained 1 : m words of the same or different languages (not necessarily the actual Wiktionary language). Each word can contain the word-token itself (string of single or double byte characters), its type of word (noun, verb etc.), its gender and other characteristics (if it has), 1 : m pronunciations, 1 : m word-meanings. 0 : m (word) expressions, 0 : m references (links) to derived words. Each word-meaning has 1 : m translations into 1 : m languages. Each translation contains 2 links (French viewpoint). One into the same Wiktionary to a Wiki, containing a word of the language into which it is translated , and a second link to another Wiktionary project of the language referenced, to a Wiki, containing the translated word. So far to the well-known staff. See also [WIKI, INTERWIKI].
Recommendations:
The Wiktionary project is an admirable worldwide project that greatly helps to improve worldwide communication. Therefore the recommendations presented are those of a cross-language-boundary-user, a user, that is not focused on only one language within one Wiktionary.
1. Use only ONE tag-standard, ONE model-standard and ONE model-sequence-standard across ALL Wiktionaries.
German Wiktionary, French Wiktionnaire and English Wiktonary use different tags and model structures.
e.g.:
- (German) Translation mark-up: *{{fr}}: [1] {{Ü|fr|maison}} {{f}}, ''Normannisch:'' maisoun
- (French) Translation mark-up: * {{T|de}} : {{trad+|de|Haus}} {{n}} or * {{T|de}} : {{trad+|de|Haus|n}}
- (English) Translation mark-up: * French: {{t+|fr|maison|f}}
Partly mark-tags exist in parallel.
- e.g.: the gender mark-up '|f' (feminine) at a defined place and {{f}} (anywhere).
At the first glance this seems to be a little narrow-minded.
But. The creation of tools to update the tag-standard version would be facilitated and could be broadly used across Wiktionary boundaries. The lack of standardization prevents the transfer of only one time created functions. e.g. to improve automated data transfer into another Wiktionary to automatically update translations. In the same way all automated functions could be used in one Version for all Wiktionaries.
I propose to use the English tag-, model- and model-sequence standard as general standard.
In this case the actual English HELP description for the mark-up language must and should be the actually leading one. Those for other Wiktionaries should be merely translated. Currently (in the French Wiktionnaire) the partly poor, partly outdated and inaccessible model descriptions hamper the broad usage by users who are willing to contribute but lack know-how.
2. Avoid duplicates in different Wiktionaries.
[NoX: Modified 28 March 2013.]
In my opinion, it’s an incredible waste of time and effort that the same word with the same language code exists in each Wiktionary (WT).
All words of a foreign language (other than the host language of the WT (e.g. non-French words in French WT)) should be eliminated. In image 1 I address all words in pink. In my opinion they are abused as best word representation in this language. They are redundant and in some respect their production is an unnecessary waste of time for contributing users. This effort could be used for better purposes: to improve bilingual translation examples.
In the 3-Wiktionary view of image 1 this means:
- Haus in the English [1] and French Wiktionary [2] should be eliminated.
- Maison in the German and English Wiktionary should be eliminated.
- house in the German and French Wiktionary should be eliminated
and substited by another ENTITY as described in the next chapter (see the TransEx-Entity).
To explain this, let me first take the perspective of a (reading (opposite to contributing)) user, mother language English, strongly interested in French (cross WT view). If he looks for a French word representing house, not knowing maison, he could use English WT house, take the translation reference type 2 (see image 3), cross over to the French WT by clicking on that link and get full and best information concerning maison. I’m sure, the language information concerning maison he finds there is the best one he can find in ANY other WT of WT project. After having enriched his knowledge concerning maison, why should he add this word to English WT (if he is a contributing user)? The other way round (user mother language English interested in French) it works same way.
So my argument is:
- The best (language) Information concerning house you will find in the English WT.
- The best information concerning maison you will find in the French WT.
- The best information concerning Haus you will find in the German WT etc.
Generally best information concerning a word in a specific language you will find in its host WT (blue words in image 1). All representations of these words in non-host-WTs (those represented in pink in image 1) are generally of lower quality.
BUT: I know this is not the whole truth. In reality these pink WIKI words not only contain redundant information concerning this word itself (e.g. pronunciation, gender, declination, conjugation, etc.).
They also provide:
- A synonym reference to synonyms in host language.
Synonym references are currently of better quality within the pink words (image 1) than in the blue ones. E.g. a contributor with mother language English better knows the English synonyms for maison. But why should he add them to maison in the English WT as he does today? Wouldn’t it be better to improve the translations of maison in the French WT? This would make the blue words to be used as synonym references. (I know this touches the problem of differences in WT mark-up which hampers updating other WTs: different translation mark-up tags. See my recommendation 1: Use only ONE tag-standard across all WTs.) - A bilingual representation of the usage of a word in a defined bilingual meaning/sense context (translation examples).
Today they are in my opinion the only strong reason for the existence of the words in pink (which I propose to remove). The true sense of these bilingual representations is, that they pass to the reading user a hint, a helping hand, how a word is typically and properly used in a (bilateral) language context: In the form of examples, translated into the other (!?!) langue. This decisive information must be kept and survive. I propose to transfer it into the TransEx-Entity (Translation examples). Its content should composed (in my example) from house in the French WT, and from maison in the English WT. Bilaterally.
NoX (talk) 14:36, 28 March 2013 (UTC) (NoX)[reply ]
3. Improve the data model. Introduce IDs and attributes.
The data model generally prosed for Wiktionaries is represented as ERM diagram in crowfoot representation. Data modelling and data representation (as shown to the user or edited) are two different things.
Primary Entity-Type shown is Word. It should only contain the basic information representing one single word in a defined language (Wiktionary language code), and word type (verb, substantive etc.).
In terms to understand the intention of the data model it seems to be necessary the identifier of Word. The unique identifier, the key of Word should be composed of the following attributes:
- Word-token. String of single or double byte characters representing the written word in any typeface.
- Language code.
- Word-type. Defined permissible word-types in relation to language code. E.g. substantive, verb, adjective etc.
- Homonym-token. String single or double byte characters in any typeface to discern homonyms. (Words of the same word-token, same language and same word type). Normally empty.
Each Word has one or more Meanings. Meanings are currently represented in Wikis by
- (example DE:) :[1] [[Unterkunft]], [[Gebäude]],
- (example FR:) # {{architecture|fr}} [[bâtiment|Bâtiment]] [[servir|servant]] de [[logis]], d’habitation, de demeure.
- (example EN:) # {{senseid|en|abode}} A structure serving as an [[abode]] of human beings..
Each Meaning-Entity should contain one and only one meaning and one or more characteristic sentences (repetition is not shown in image 2), using the word under this specific meaning (as currently). Meanings should be ordered by relevance.
Each Meaning has 0 to n Translations. Each Translation has two references.
I propose to substitute reference Typ 1 (see image 3) to refer to TransEx-Entity (bi-lingual translation examples). This seems to me to be a decisive change. Reference Typ 2 should be kept. It needs to point to a word in the language referenced by the translation.
Establishing the TransEx-Entity (bi-lingual translation examples) avoids all redundancies currently met at foreign language words in one defined Wiktionary. Data technically it’s a relationship-type-Entity to resolve the many-to-many relationships between a specific word meaning in different languages. Its content is bi-lingual. It does not belong to one Wiktionary. It’s a bridge between two Wiktionary.
How could its content look like? Example: TransEx between German first meaning of DE: Haus and French first meaning of FR: maison. (Examples taken from (DE: maison) and (FR : Haus).)
- Signification – Bedeutung
- DE: Haus im Sinne von: [1] Unterkunft, Gebäude
- FR: Maison en sens de: (Architecture) Bâtiment servant de logis, d’habitation, de demeure.
- Exemples – Beispiele
- FR: [1] Dans quelle maison est-ce que tu habites?
- DE: In welchem Haus wohnst du?
- FR: Sa maison se trouvait seule sur une colline. De là, on avait une vue sur les toits des autres maisons du village.
- DE: Sein Haus stand einsam auf einem Hügel. Von dort blickte man über die Dächer der anderen Häuser der Stadt.
All other information in TransEx should be avoided. E.g.: Word-type, gender, pronunciation, translation. See German example of (DE: maison). They are superfluous and redundant at this place. These information are attributes of other entities; mostly of the entity Word itself.
Other entities: They seem to be self-explaining. Not all of them are detailed.
Comparison between the current data model and the proposed one.
ID of Word.
Currently the ID of a WIKI is only the character-string, representing a word.
The proposed ID of a Word consists of several attributes that should be put into separated (database) data-fields. Not into text mark-up. (A change of one of these IT-attributes would result into a database-move-process of the word.)
Other common attributes.
They could be put into a text-container, containing the well-known, hopefully standardised mark-up. Such a container could also contain (as currently) repeating groups, e.g. translated sentence pairs in TransEx. The same could be the case with entities like See, Expression, Derived word etc. Another possibility could be to put them into separate database elements.
TransEx entity.
As described, this entity represents the deep and broad Jordan River, which has to be crossed.
Future presentation and editing.
If you look at the English style of presentation I do not see big differences (besides the not-yet-extant TransEx entity).
One thing that really needs improvement in the presentation area is the display of translations. The currently usable roll-in roll-out mode seems to me to be simple-minded. An experienced cross-language user is interested in only two or three languages. User-defined it should be possible to select a translation language roll-out mode that rolls-out only translations of languages requested.
A big challenge seems to me to be the future editing process. It needs to be greatly improved. Preferably window based pop-up sequences, oriented at the entity structure, containing input fields that do not require the knowledge of the mark-up, except perhaps in an expert mode.
Advantages of the proposal.
- One general, commonly usable data structure.
- The limits between the Wiktionaries could be demolished. All Wiktionaries could be put into one worldwide Wiktionary-Project data pot.
- One general, worldwide usable mark-up language would be established.
- Functions need only one times be developed and can be used in all Wiktionaries. (I know this will kill the beloved babies of many a Wiktionary power user.)
- Automation processes (automated content controlling, automated Word-stub creation, automated translation transfer, mark-up upgrade etc.) could be greatly improved.
- No word redundancies, less editing effort.
Disadvantages of the proposal.
- Strong effort is needed. The TransEx entity is a broad and deep Jordan River to be crossed.
- The theme in its entirety is difficult to communicate between single language users, database experts focused at the WIKI Data Structure, language experts and those who need to create a Wiktonary-project data representation style-guide.
NoX (talk) 19:22, 17 March 2013 (UTC). (NoX)[reply ]
Your opinion?
NoX (talk) 19:22, 17 March 2013 (UTC). (NoX)[reply ]
- Wow. That's a lot to digest.
- Allow me to say for now that the underlying idea, of better data portability, is a good one.
- However, your proposal here calls for unifying many many things that really have no business being unified -- many of these variant aspects are different at least in part because they meet the needs of different user communities. For instance, the wikitext markup used in translation tables reflects the languages of the host language of each Wiktionary. The DE WT uses {{Ü}} to stand in for Übersetzung; the FR WT uses {{trad}} as shorthand for traduction, since {{t}} there is used to stand in for transitif; the EN WT uses {{t}} as shorthand for translation, since there wasn't the same name collision as on the FR WT. Requiring that all Wiktionaries use {{t}} for translation table items might make English speakers happy, but it would be a poor mnemonic for editors of languages where the relevant term for translation does not start with a "t". It would also require renaming any other existing templates already at {{t}}, and then going through all entries that referenced the previous name to update with the new name.
- This does not even begin to address the more complicated issue that different Wiktionaries employ sometimes very different entry structures because of the different ideas about grammar and linguistics held by the different user communities. If you are serious about moving forward with this proposal, I strongly recommend that you do some grass-roots building by addressing each Wiktionary user community directly. I'm pretty much never over here on Meta, and the only reason I learned of your proposal was thanks to another editor who posted on wiktionary:Wiktionary:Beer_Parlor about this. I suspect that I am not alone in having missed this post earlier. -- Eiríkr Útlendi │ Tala við mig 22:05, 19 March 2013 (UTC).[reply ]
- I think your worry on the translated templates is not as much a problem as you think. I wouldn't be difficult to have "code-common" where templates have a name in english, and in each local chapter offering a wrapper template which translate it (as well as documentation). But I think that you are nonetheless right on the importance on making such a structuration with the community. We must communicate on this project and working with the whole community, so every regular contributor will know about it, and hopefully will be enthusiast to get involved so their current specificity could be preserved by making an enough flexible structure for every use case. --Psychoslave (talk) 13:57, 22 March 2013 (UTC).[reply ]
- Re: localized labels for various features, your idea of a wrapper is probably a good one. Now that we have Lua, that might be less of a concern, though I have read that editors are running into possible performance concerns when a single Lua module is being called multiple times all at once. Collapsing all translation item templates for all Wiktionaries into a single Lua module might wind up creating an extremely limited bottleneck. -- Eiríkr Útlendi │ Tala við mig 21:52, 22 March 2013 (UTC).[reply ]
- Well, I don't know the detail of this specific load balancy problem, but it seems like a resolvable to me. A simple solution would be to duplicate the code automatically, so you keep the central editable version, but executed code is distributed. Now if there's really a load ballency with lua module, there should be a serious investigation to resolve it. I just haven't the proper representation of the technical infrastructure to give an appropriate answer just like that, but I have no doubt this can be resolved. --Psychoslave (talk) 08:55, 23 March 2013 (UTC).[reply ]
- Excellent work. Now to my mind the word entity should not have a single orthography, because it doesn't reflect reality : even if you restrict yourself to well known and widely used spells, there are words which have several acceptations. For example in french you may write clef or clé to refer to a key. So orthography should be an other entity, just like meaning, and one word may have one or more orthography (for a given language). One orthography can correspond to one or more word. Moreover an orthography should be categorizable so you can say if it's considered a correct orthography, a mispelled word, or special things like the all your base are belong to us locution and the word l33t. Also the proposition should be extended to include synonyms, hypernyms, and so on, as well as etymologia. Etymologia should have it's own ERM part I think, because we can for sure establish a well structured schema of how words slided from one form to an other (I'm not a specialist, but I know there are specific vocabulary for many supposed transformation, like a l sliding to r. --Psychoslave (talk) 10:41, 22 March 2013 (UTC).[reply ]
- Also each spelling could be attached to examples, and examples could have 0 or m translations, as well as a well defined reference (url/document with isbn...). --Psychoslave (talk) 13:14, 22 March 2013 (UTC).[reply ]
- Re: orthographies, different spellings often carry different connotations, sometimes different enough that they should be considered different entities in their own rights, even if the underlying concept referred to by the terms is the same thing. English thru and through are two different labels for one concept, but the labels themselves carry sufficiently different semantic information that dictionaries often treat these two different spellings as separate entries.
- Even with your French example, I see that clé has a secondary sense of "wrench, spanner", that seems to be missing from the clef entry. Assuming that this difference in meaning is valid and not just an accidental omission by Wiktionary editors, then these two spellings carry different semantic information, and deserve to be treated as different entries, at least for Wiktionary purposes.
- Japanese gets much more complicated due to the extremely visually rich nature of the written language. The hiragana spelling つく (tsuku) can mean "to arrive; to turn on; to stab", among other meanings. Meanwhile, the kanji spelling 着く (tsuku) is limited to "to arrive"; 付く (tsuku) is limited to "to turn on"; and 突く (tsuku) is limited to "to stab". (Simplified examples; all of these entries have additional senses.) Whether to use the more-specific kanji spellings is a matter of style and preference, not to mention clarity and disambiguation; which kanji spelling to use depends on semantic context. The hiragana spellings of many short verbs have similar one-to-many correlations to kanji spellings, where the kanji spellings are generally more specific than the hiragana spellings, and often the hiragana spellings are in common use right alongside the kanji spellings.
- The data model must ostensibly account for all of this variation. Separating the spelling from the concept, which I think is what @Psychoslave here is proposing, is probably necessary for this. Some commercial terminology management tools that I have used take the concept as the top level of the data structure. One concept may have multiple terms, and one term may point to multiple concepts. One serious potential shortfall of such software is clarity --
- how are concepts identified within the data model?
- how does one add a synonym (such as a new orthography) to a concept?
- when looking at a single term, how are different concepts identified for the user?
- is each individual sense of any given term (implemented now in Wiktionary as a numbered definition line) to be transformed into a "concept" in the data model?
- how does one manage different "concept" data objects, to do things such as find potential duplicates (possibly differing only by minor wording choices)?
- how does one manage "concept" data objects, for purposes of splitting a sense into multiple separate senses when more specific meanings are identified?
- etc., etc.
- This is an enormously complicated problem, even when limited to looking at just one language. Expanding the problem scope to include all languages is both insanely ambitious and deliciously challenging. Good luck to all! :) -- Eiríkr Útlendi │ Tala við mig 21:52, 22 March 2013 (UTC).[reply ]
- Ok, let me begin with the simplest point (for a french native speaker point of view): clef and clé are exactly the same "word", orthography being the only difference. They have the same meaning, and you pronounce them in the same way. To understand why, you can begin with [fr:w:Rectifications orthographiques du français|]] if you want to know more about it (some equivalent articles are available in other chapters). But to stay both on the topic and the french specificities (or at least, linguistic phenomena which may not happened in all languages), there are word that you write in the same way, but you'll pronounce differently according to their meaning. See wikt:Catégorie:Homographes non homophones en français . I have no doubt each language have it's own curiosities, so indeed, we are here speaking of a daunting task. Fortunately (and hopefully) this task can rely on a global community (or a global set of communities if you prefer). Probably no single human could afford the time and experience needed to accomplish such a task, but I believe that together we can do it.
- For the how can we deal with identification and more than that, what are the element which should form a key to a unique entry in our database, I would be personally interested to know about the wikiomega contributors opinion, because they probably have interesting analyze to share that they gained through their experience.
- Also we for sure have to gather information to be sure we can establish a model flexible enough to take account of all languages/communities specificities, but how do we decide we gathered enough information to freeze a structure? Ideally, to my mind, we should come with an extensible basic solid structure. --Psychoslave (talk) 09:57, 23 March 2013 (UTC).[reply ]
NoX: You (Psychoslave, Eirikr) are absolutely right. My ERM is only a sketch. Relevant Entities and Relationships are missing. In IT Database projects it’s a good idea to begin with a simple ERM. Its purpose is, to initiate a discussion between IT- and (in this case) language-experts about necessary and relevant things (Entities) and their relationships to each other. Later in the database design process they are mapped, not at all 1 : 1 into specific (MS SQL-, ORACLE-, DB2-, WIKI-) databases and tables. In an IT project e.g. your contribution would lead to (a discussion and) an extension of the ERM by adding Entities (not seen by me, or left away in the discussion provoking startup process). A good ERM on Language and language translation would reflect the long lasting nature and the essence of all things (Entities) and their relationships in this environment.
But our current problem is different. We have a multipurpose WIKI-database with shortcomings in the language area (WT Project) and big advantages in other areas (e.g. WIKIPEDIA). So my idea was, looking at the current French WT (knowing also English, German and Italian WT), what could its ERM look like, what could be improved. I didn’t write anything about HOW to do it. The change could be made evolutionary (I’m not sure if this can work because I’m not a WIKI-database expert), or it could be made revolutionary: Simply said 1. Harmonize the mark-up, 2. Export current WT content (eg. into an agreed XML Structure). 3. Reload it into a database better apt (see following proposals by others).
NoX (talk) 21:35, 24 March 2013 (UTC).[reply ]
@NoX?: Since I look upon all Wiktionaries as one big Wiktionary and then also share your interest in cross-Wiktionary questions I believe that your approach is basically praiseworthy, however some or your suggestions above raises questions. Before I start critizing the proposals too severly I therefore want to ask you a question concerning the section "Avoid duplicates in different Wiktionaries". What exactly do you mean by that? To make a comparison: Do you want all Chinese users to throw away their French-Chinese dictionaries, all Swedes to throw away their French-Swedish dictionaries and so forth, and that they should all start using Larousse's French-French dictionary, to get the exact meaning of a French word? Is that the idea you have for Wiktionary, or have I completly misunderstood your vision? Lars Gardenius (diskurs) 13:21, 26 March 2013 (UTC) [reply ]
- NoX: Hi Lars. I partly rewrote chap 2 concerning duplicates. I hope this answers your questions. If not, let me know. Still unanswered rests, where (in which WT) to put the TransEx-Entity if established. NoX (talk) 14:46, 28 March 2013 (UTC) [reply ]
Thank You for the new and extended version of "Avoid duplicates in different Wiktionaries". However I am still very critical. I would like to stress again that I find Your initiative and approach praiseworthy, and I hope that You will not find what I write below as an attack on Your proposal as a whole. However I believe that You have overlooked some very basic facts about languages in that specific section.
I have worked as a professional translator (from Chinese) during a short period of my life. I, as many others, recommend that you start using monolingual dictionaries as soon as possible, like Oxford Dictionary for English, Larousse for French or 新华字典 for Chinese. The reason is the one you give above, the best explanation you can find is probably in this kind of dictionary. Since it is very costly to produce a (paper) dictionary you have to limit the space given to explanations in bilingual dictionaries.
However, this recommendation is easier to give than to follow. To be able to handle e.g. a monolingual Chinese dictionary, you have to study Chinese at least a couple of years. An effort that perhaps not everybody is ready to make. I don't think it is reasonable to believe that any average user can understand an explanation written in Chinese, however good it is.
You could of course propose a translation of the Chinese articles (on Chinese words) to all other languages but then you are back in the situation you wanted to avoid, and how many can translate from Chinese to Finnish, Romanian, Quechua etc., and keep them updated?
There is also another mayor reason why this is not a good approach.
Every monolingual dictionary is written in a social and linguistic context. A Chinese monolingual dictionary is written in a Chinese social and linguistic context, that you have to know to really understand the explanations. All languages also have different ways to solve different grammatical and linguistic problems. So what perhaps is not at all mentioned in a Chinese monolingual dictionary, because it is considered trivial to everybody having Chinese as their mother tongue, is perhaps very difficult to understand, and necessary to treat in a dictionary, if you e.g. just speak Portuguese.
These are some of the reasons why I think every serious translator use all kinds of monolingual, bilingual (both ways) dictionaries when translating. If you are a Swedish translator you simply need a dictionary explaining the word from e.g. a Chinese point of view as well as from a Swedish point of view.
So both the professional translator as well as the average layman needs both monolingual and bilingual dictionaries, now and in the future.
Then it should also be said that it is obvious that the biggest problem in Wiktionary lies with these bilingual parts of the dictionary. If you for instance want to create a Chinese-Romanian Wiki, you will need at least ten people working on it for several years before it reaches a level of quality and usability that is acceptable. These number of people is obviously often lacking. But to throw these bilingual parts out doesn ́t solve the problem, just hides it.
This problem is I believe also partly linked to the translation part of the articles. The space devoted to translations is very small (in all Wiktionaries). I at one time made a comparison with an ordinary (paper) dictionary. While they devoted 40 lines to translate a german word (to a certain language), Wiktionary devoted half a line, that is about as much as you can find in an ordinary cheap pocket dictionary!
I believe that Wiktionary have to find a whole new way to present translations and to link them to the articles.
Lars Gardenius (diskurs) 18:13, 28 March 2013 (UTC) [reply ]
- (-sche here:) I don't have time right now to respond to everything that has been said, but: it's true that many (e.g.) English Wikt entries for (e.g.) French words are currently smaller than French dictionaries' entries for those words, because Wiktionary is incomplete. However, because Wiktionary is not paper, it has the ability to cover all words in all languages in greater detail than any paper dictionary. wikt:de:life (as a result of my work) and wikt:de:be, for example, provide German-language coverage of the English words life and be that is as expansive and detailed as a monolingual English dictionary's. wikt:en:-ak is provides English-language coverage of -ak more detailed than any Abenaki-language dictionary's—not that there are (m)any Abenaki-language dictionaries! That's what each Wiktionary can do at its best, and it's what would be lost or made more difficult by proposals to centralise foreign-language content either on Wikidata or on OmegaWiki (cf. my comments on the proposal to adopt OmegaWiki). -sche (talk) 23:34, 28 March 2013 (UTC) [reply ]
-
- @Lars, one difference in the treatment of translations is that the English Wiktionary, for instance, links straight through to the translated term entry pages. The full treatment is available there, but not right in the "Translations" table.
- @Nox, your rewrite quite concerns me, particularly this paragraph:
To explain this, let me first take the perspective of a (reading (opposite to contributing)) user, mother language English, strongly interested in French (cross WT view). If he looks for a French word representing house, not knowing maison, he could use English WT house, take the translation reference type 2 (see image 3), cross over to the French WT by clicking on that link and get full and best information concerning maison. I’m sure, the language information concerning maison he finds there is the best one he can find in ANY other WT of WT project. After having enriched his knowledge concerning maison, why should he add this word to English WT (if he is a contributing user)? The other way round (user mother language English interested in French) it works same way.
- You assume that this hypothetical English reader is also capable of fully understanding the French Wiktionary entry at wiktionary:fr:maison. This is a seriously flawed assumption. As Lars notes, each Wiktionary represents thousands of hours of work by host-language contributors, writing in the host language.
- I am also concerned about some of your operating assumptions about applicable data models. The only commonalities in entry structure and data, across all Wiktionaries that I have seen, is the presence of the lemma term itself, and possibly lists like for translations, derived terms, and descendant terms. It is not even safe to assume common parts of speech for term categorization, as not all host languages treat parts of speech in the same way. For instance, what English grammarians think of as an "adjective" roughly maps to at least three different parts of speech in Japanese (形容詞 [keiyōshi], 形容動詞 [keiyō dōshi], and 連体詞 [rentaishi]). What Japanese grammarians of as a 語素 (goso) roughly maps to two different parts of speech in English ("prefix" or "suffix"). Meanwhile, it seems that the Russian Wiktionary forgoes such labeling entirely and instead uses running text to describe the morphology of each term. (NB: I'm not a Russian reader; this comes to me as second-hand information.)
- Since each Wiktionary describes each term using the host language, there is no guarantee at all that the labels used in the Russian Wiktionary match the labels used in the French Wiktionary match the labels used in the English Wiktionary match the labels used in the Japanese wiktionary... all for any single given term.
- I certainly wish you luck in your research. However, I think this problem is much more complicated, and much more intractable, than your description above suggests. -- Eiríkr Útlendi │ Tala við mig 23:59, 28 March 2013 (UTC) [reply ]
- @Eirikr et al.: Since I have worked quite a lot in the Russian Wiktionary (Викисловарь) I can assure that you can find that kind of labelling also there. But the reason why you don't see it when you just look at a page like this, is that you don't know where to look. I think that this illustrates very well one of the problems, if you want all Wiktionaries to function as one big Wiktionary (and if you want to be able to contribute to many Wiktionaries without starting from scratch everytime), all the unnecessary differences. There an initiative like this could make a difference.
- Lars Gardenius (diskurs) 09:14, 29 March 2013 (UTC) [reply ]
- @Eirikr et al.: Since I have worked quite a lot in the Russian Wiktionary (Викисловарь) I can assure that you can find that kind of labelling also there. But the reason why you don't see it when you just look at a page like this, is that you don't know where to look. I think that this illustrates very well one of the problems, if you want all Wiktionaries to function as one big Wiktionary (and if you want to be able to contribute to many Wiktionaries without starting from scratch everytime), all the unnecessary differences. There an initiative like this could make a difference.
Thinking out of the classical online Wiktionary format and reading usage
Our goal is not only to build dictionaries as complete as possible, we also want the result to be as useful as possible, which mean it should be easy to integrate them elsewhere and generaly to be used in innovative ways.
In this part, contributors are encouraged to expose what kind of usage could be made easier if taken into account at the design step rather than an after thought.
Generating standard dictionary output.
Currently, dump which are generated are not directly usable usable in offline application, for example gnome dictionnary. As far as I know we doesn't provide a standard way to consult it like through the w:DICT protocole. It would also be convenient to be able to download wiktionary for e-ink devices. --Psychoslave (talk) 13:40, 22 March 2013 (UTC).[reply ]
Voice recognition.
One way one could want to access to an entry in wiktionnary, is to pronounce the word/locution. As smartphone become more common, people acquire a device which is able to take voice input. On the other hand, sometime people will meet a word they can't spell. For example, two person from distant native culture became friends, and they like to share their respective knowledge through their talk. So sometime one will talk a specific word of its native language, but the other person won't understand it and in fact won't even be able to pronounce it because it contains sounds s/he doesn't know (or may it's a tone language while s/he doesn't know tone language). So they take a smartphone, run the wikipronounce app, and voilà, the original graphy, an IPA transcription (eventually a roman transcription if relevant), a definition in the user native language. --08:45, 23 March 2013 (UTC).
Speech synthesis.
Along the IPA (and X-SAMPA), wiktionnary also offer prononciation sample. Currently this sounds need to be recorded and uploaded by contributors, one by one. This solution is better than nothing, and even should be probably kept to give real world examples of the word prononciation. Thus said, there are many disadvantages with it. First of all, not all word have such sample. Some have nonetheless IPA, but probably must people won't be able to read it easily, given that, as far as I know, no primary school in the world teach it. So it would be very helpful for much reader to have a speech synthesis using the IPA data (when present), so not only would people have at least a minimal idea of how to pronounce it, but also will they be able to learn IPA with accustom. An other pro would be that it will provide an unified prononciation voice accross all words (possibly customable in preferences), while records will change and represent contributors diversity. This last sentance should not be taken as a critic of diversity, as previously said records should be considered of great value because they provide real world examples, and should stay as a complementary data to a speech synthesis. --Psychoslave (talk) 08:11, 25 March 2013 (UTC).[reply ]
Helping avoiding/creating neologisms.
Language primary purpose is to communicate, share ideas. Often people know no specific word to express what they are thinking and willing to communicate. Usually, one may use a sentence using a set of words which enable to express, more or less accurately, what they think. But when a new concept is central to a thought, one may decide to create a new word to express it. Different strategies may be used to coin such a word, each having pro and cons:
- Use etymological knowledge of the given language to build a word which doesn't add new roots, and will be both be short and hopefully understandable to someone having a good knowledge of this language. The advantage here is that it extend the language in more or less familiar way to speakers, possibly in a word that they will understand even if they never heard it before. For many word of this kind, no high knowledge is really needed, as many (all?) languages have affixes which enable to coin such adhoc words. But sometime making such a construction can require such a high linguistic level, especially in specific topics such as science activities, where people may be competent in their specialty but not in linguistic.
- Make an acronym. The clear advantage here is that you need no linguistic competence to coin a word. The evident con is that the coined word is completely opaque and native speakers won't be able to use their lexical knowledge to deduce its meaning. An acronym is not necessarily used because no specific expression exists, it's often a matter of shortness. Thus, DNA which stands for deoxyribonucleic acid trade a ten syllables against three.
- Using a loanword. Advantages are that the word exist, just eventually need some pronunciation tuning, and it probably have a known meaningful etymology. The con is that the word may be opaque to native speakers of the target language, so they can't establish semantic relations based on their already acquired lexical/meaning mind network.
Here wiktionnaries should help by:
- first, avoiding unwanted[1] redundant neologisms, making easy to find existing expression to express a given concept,
- making easier to create neologisms as relevant as possible given existent lexicon of the target language.
Notes and references
- ↑ The goal is not to prevent people to create new words or languages if they want, just to let them know if there are existing expression if they would like to avoid it.