Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Farsi #41

Open
Open
@gilgameshjw

Description

Farsi

Transliteration in Farsi

With mahdi, we have identified a number of challenges peculiar to Farsi:

  1. Persians can use various characters for a particular one, requiring "normalisation" work, probably with maps.
  2. Persians are in practice not strict with the usage of spaces, i.e. the same Farsi word can appear with or without spaces between the characters or they may use a ZWNJ character (zero-width non-joiner).
  3. Transliteration of single words:
    • Mahdi has found Large dictionaries with farsi words and with transliteration in their various part of speech (N,V,...)
    • The above table is quite extensive and could be used.
    • Research shows that transliteration can be better learned with NNets than with rules.
    • The resulting transliteration seems NOT aligned with interscript one (requiring maps probably)
  4. Transliteration of several words
    • In Farsi, words get pre/suffixes depending on their position and role in a sentence.
    • As a consequence, we think of using a PoS tagging technology
    • PoS Tagging: there are Algos doing that in Farsi, we need to research software and possibly compare or even train.

Ideas (bad and goods)

  • speech to text data?
  • learn Farsi $\Rightarrow$ interscript-like transliteration

Plan

  1. Look for mappings: farsi $\Rightarrow$ +- latine
    Done
  2. Stats of collisions and concept validation
    952 collisions for 50k dictionary, 0.5% at word level.
    Done, Validated
  3. Create git branch so that Mahdi+Jair can collaborate
    Done
  4. Run simplest possible transliteration:
    • Mahdi provides dataset
    • Jair build naive map and transliterate (model 0)
    • Ronald, Mahdi, Jair: feedbacks
  5. Review NLP libraries, codebases and research in Farsi.
  6. Improve (char normalisation, preprocessing and PoS)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    High priority

    Milestone

    No milestone

      Relationships

      None yet

      Development

      No branches or pull requests

      Issue actions

        AltStyle によって変換されたページ (->オリジナル) /