Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

What kind of preprocessing is being done? #154

tealgreen0503 started this conversation in General
Discussion options

During morphological analysis, it seems that certain preprocessing steps are taken. For instance, half-width spaces are converted to full-width spaces, and line breaks are eliminated.

I'm using rhoknp for the analysis.

import rhoknp
juman = rhoknp.Jumanpp()
text = " これは半角スペースです。"
print([morpheme.surf for morpheme in juman.apply_to_sentence(text).morphemes])
# ['\u3000', 'これ', 'は', '半角', 'スペース', 'です', '。']
text = "\nこれは改行です。"
print([morpheme.surf for morpheme in juman.apply_to_sentence(text).morphemes])
# ['これ', 'は', '改行', 'です', '。']

What other kinds of preprocessing might be done?

You must be logged in to vote

Replies: 1 comment 5 replies

Comment options

jumanpp does not perform any preprocessing on the input, it analyses input data as at is.

You must be logged in to vote
5 replies
Comment options

Which version are you referring to?

In Juman++ 2.0.0-rc3, half-width spaces are substituted with full-width spaces as demonstrated below:

$ jumanpp -version
Juman++ Version: 2.0.0-rc3 / Dictionary: 20190731-356e143 / LM: K:20190430-7d143fb L:20181122-b409be68 F:20171214-9d125cb
$ jumanpp
New York
New York
New New New 未定義語 15 アルファベット 3 * 0 * 0 "未知語:ローマ字 品詞推定:名詞"
   特殊 1 空白 6 * 0 * 0 "代表表記:S/* 元半角"
York York York 未定義語 15 アルファベット 3 * 0 * 0 "未知語:ローマ字 品詞推定:名詞"
EOS

I'm uncertain whether this substitution was performed as part of the preprocessing, though.

Comment options

Ah, yes, this is postprocessing.
This is only done for JUMAN-style output though, otherwise it would be unparseble.

Comment options

Got it.

Is it possible to treat a line break (\n) as part of a sentence? As far as I know, Juman++ requires input in which each line represents a complete sentence and is unable to process it as a component of a sentence.

$ echo '改行\n文字' | jumanpp
改行 かいぎょう 改行 名詞 6 サ変名詞 2 * 0 * 0 "代表表記:改行/かいぎょう カテゴリ:抽象物"
EOS
文字 もじ 文字 名詞 6 普通名詞 1 * 0 * 0 "代表表記:文字/もじ カテゴリ:抽象物"
EOS
Comment options

Not via the CLI. Anyway, \n is definitely a token break, it won't change token stream.

Comment options

Maybe I should add "\n" and "\t" tokens to the dictionary so they would be usable as an input text without doing any weird tokenization stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Converted from issue

This discussion was converted from issue #153 on May 12, 2023 23:31.

AltStyle によって変換されたページ (->オリジナル) /