What kind of preprocessing is being done? · ku-nlp/jumanpp · Discussion #154

tealgreen0503
May 12, 2023

During morphological analysis, it seems that certain preprocessing steps are taken. For instance, half-width spaces are converted to full-width spaces, and line breaks are eliminated.

I'm using rhoknp for the analysis.

import rhoknp
juman = rhoknp.Jumanpp()
text = " これは半角スペースです。"
print([morpheme.surf for morpheme in juman.apply_to_sentence(text).morphemes])
# ['\u3000', 'これ', 'は', '半角', 'スペース', 'です', '。']
text = "\nこれは改行です。"
print([morpheme.surf for morpheme in juman.apply_to_sentence(text).morphemes])
# ['これ', 'は', '改行', 'です', '。']

What other kinds of preprocessing might be done?

Replies: 1 comment 5 replies

eiennohito
May 12, 2023
Collaborator

jumanpp does not perform any preprocessing on the input, it analyses input data as at is.

5 replies

@hkiyomaru

hkiyomaru May 18, 2023
Maintainer

Which version are you referring to?

In Juman++ 2.0.0-rc3, half-width spaces are substituted with full-width spaces as demonstrated below:

$ jumanpp -version
Juman++ Version: 2.0.0-rc3 / Dictionary: 20190731-356e143 / LM: K:20190430-7d143fb L:20181122-b409be68 F:20171214-9d125cb
$ jumanpp
New York
New York
New New New 未定義語 15 アルファベット 3 * 0 * 0 "未知語:ローマ字 品詞推定:名詞"
   特殊 1 空白 6 * 0 * 0 "代表表記:S/* 元半角"
York York York 未定義語 15 アルファベット 3 * 0 * 0 "未知語:ローマ字 品詞推定:名詞"
EOS

I'm uncertain whether this substitution was performed as part of the preprocessing, though.

@eiennohito

eiennohito May 18, 2023
Collaborator

Ah, yes, this is postprocessing.
This is only done for JUMAN-style output though, otherwise it would be unparseble.

@hkiyomaru

hkiyomaru May 19, 2023
Maintainer

Got it.

Is it possible to treat a line break (\n) as part of a sentence? As far as I know, Juman++ requires input in which each line represents a complete sentence and is unable to process it as a component of a sentence.

$ echo '改行\n文字' | jumanpp
改行 かいぎょう 改行 名詞 6 サ変名詞 2 * 0 * 0 "代表表記:改行/かいぎょう カテゴリ:抽象物"
EOS
文字 もじ 文字 名詞 6 普通名詞 1 * 0 * 0 "代表表記:文字/もじ カテゴリ:抽象物"
EOS

@eiennohito

eiennohito May 19, 2023
Collaborator

Not via the CLI. Anyway, \n is definitely a token break, it won't change token stream.

@eiennohito

eiennohito May 19, 2023
Collaborator

Maybe I should add "\n" and "\t" tokens to the dictionary so they would be usable as an input text without doing any weird tokenization stuff.

What kind of preprocessing is being done? #154

Uh oh!

tealgreen0503 May 12, 2023

Replies: 1 comment · 5 replies

Uh oh!

Uh oh!

eiennohito May 12, 2023 Collaborator

Uh oh!

hkiyomaru May 18, 2023 Maintainer

Uh oh!

Uh oh!

eiennohito May 18, 2023 Collaborator

Uh oh!

hkiyomaru May 19, 2023 Maintainer

Uh oh!

eiennohito May 19, 2023 Collaborator

Uh oh!

eiennohito May 19, 2023 Collaborator

tealgreen0503
May 12, 2023

Replies: 1 comment 5 replies

eiennohito
May 12, 2023
Collaborator

hkiyomaru May 18, 2023
Maintainer

eiennohito May 18, 2023
Collaborator

hkiyomaru May 19, 2023
Maintainer

eiennohito May 19, 2023
Collaborator

eiennohito May 19, 2023
Collaborator