-
Notifications
You must be signed in to change notification settings - Fork 46
What kind of preprocessing is being done? #154
-
During morphological analysis, it seems that certain preprocessing steps are taken. For instance, half-width spaces are converted to full-width spaces, and line breaks are eliminated.
I'm using rhoknp for the analysis.
import rhoknp juman = rhoknp.Jumanpp() text = " これは半角スペースです。" print([morpheme.surf for morpheme in juman.apply_to_sentence(text).morphemes]) # ['\u3000', 'これ', 'は', '半角', 'スペース', 'です', '。'] text = "\nこれは改行です。" print([morpheme.surf for morpheme in juman.apply_to_sentence(text).morphemes]) # ['これ', 'は', '改行', 'です', '。']
What other kinds of preprocessing might be done?
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 1 comment 5 replies
-
jumanpp does not perform any preprocessing on the input, it analyses input data as at is.
Beta Was this translation helpful? Give feedback.
All reactions
-
Which version are you referring to?
In Juman++ 2.0.0-rc3, half-width spaces are substituted with full-width spaces as demonstrated below:
$ jumanpp -version
Juman++ Version: 2.0.0-rc3 / Dictionary: 20190731-356e143 / LM: K:20190430-7d143fb L:20181122-b409be68 F:20171214-9d125cb
$ jumanpp
New York
New York
New New New 未定義語 15 アルファベット 3 * 0 * 0 "未知語:ローマ字 品詞推定:名詞"
特殊 1 空白 6 * 0 * 0 "代表表記:S/* 元半角"
York York York 未定義語 15 アルファベット 3 * 0 * 0 "未知語:ローマ字 品詞推定:名詞"
EOS
I'm uncertain whether this substitution was performed as part of the preprocessing, though.
Beta Was this translation helpful? Give feedback.
All reactions
-
Ah, yes, this is postprocessing.
This is only done for JUMAN-style output though, otherwise it would be unparseble.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
Got it.
Is it possible to treat a line break (\n) as part of a sentence? As far as I know, Juman++ requires input in which each line represents a complete sentence and is unable to process it as a component of a sentence.
$ echo '改行\n文字' | jumanpp
改行 かいぎょう 改行 名詞 6 サ変名詞 2 * 0 * 0 "代表表記:改行/かいぎょう カテゴリ:抽象物"
EOS
文字 もじ 文字 名詞 6 普通名詞 1 * 0 * 0 "代表表記:文字/もじ カテゴリ:抽象物"
EOS
Beta Was this translation helpful? Give feedback.
All reactions
-
Not via the CLI. Anyway, \n is definitely a token break, it won't change token stream.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
Maybe I should add "\n" and "\t" tokens to the dictionary so they would be usable as an input text without doing any weird tokenization stuff.
Beta Was this translation helpful? Give feedback.