I needed in my project to import .txt files without knowing their encoding, but knowing, they will most likely be in Czech or Slovak language. Sadly there is bunch of possible encodings so I decided to create code, that will try to detect encoding based on content of this ByteArray
.
Idea is to:
- Convert all characters that have different encoding across charsets to their
ByteArray
representation (in this case characters with accents in Czech and Slovak language). - Count their occurrence in original
ByteArray
for eachCharset
and calculate it's own rating based on how many characters were found inByteArray
(and character byte size). Charset
with biggest rating and priority will be chosen (priority in case some charsets are tied).
My current code looks like this (removed some util conversion functions to focus on important bits):
const val CZECH_CHARS_LOWER = "říšžťčýůňúěďáéó"
const val CZECH_CHARS_UPPER = "ŘÍŠŽŤČÝŮŇÚĚĎÁÉÓ"
const val CZECH_CHARS = CZECH_CHARS_LOWER + CZECH_CHARS_UPPER
const val SLOVAK_CHARS = "ÄäÔôĹĽĺ" //'ľ' causes problems - has same as 'ž' in other encoding
data class CharCountResult(val charset: Charset, val score: Double, val priority: Int = 0) :
Comparable<CharCountResult> {
override fun compareTo(other: CharCountResult): Int {
return if (this.score == other.score) {
other.priority.compareTo(this.priority)
} else {
other.score.compareTo(this.score)
}
}
}
class UTFCharsetCharCounter(text: String, priority: Int = 10) :
CharsetCharCounter(charsToDetect = text, charset = UTF_8, priority = priority) {
override fun autoReject(text: ByteArray): Boolean {
return text.countSubArray(UTF8_INVALID_CHARACTER_BYTES) > 0
}
override fun autoAccept(text: ByteArray): Boolean {
return text.startsWith(UTF8_BOM)
}
}
fun String.toCharsToBytesMap(charset: Charset): Map<Char, ByteArray> {
return this.map { it to charset.encode(it.toCharBuffer()).toByteArray() }.toMap()
}
open class CharsetCharCounter(
val charsToDetect: String = CZECH_CHARS + SLOVAK_CHARS,
val charset: Charset,
val charsToBytes: Map<Char, ByteArray> = charsToDetect.toCharsToBytesMap(charset),
val priority: Int = 0
) {
fun count(text: ByteArray): CharCountResult {
if (autoReject(text)) {
return CharCountResult(charset, 0.0)
}
if (autoAccept(text)) {
return CharCountResult(charset, 100.0)
}
return CharCountResult(charset, (100.0 * foundCharacters(text) / text.size), priority = priority)
}
private fun foundCharacters(text: ByteArray): Int {
val result = charsToBytes.values.map {
val c: Int = text.countSubArray(it) * it.size
c
}.sum()
return result
}
open fun autoAccept(text: ByteArray): Boolean {
return false
}
open fun autoReject(text: ByteArray): Boolean {
return false
}
}
class CharsetDetector(
val charCounters: Array<CharsetCharCounter>,
val defaultCharset: Charset = UTF_8
) {
fun detect(
rawData: ByteArray
): List<CharCountResult> {
return charCounters.map { it.count(rawData) }.sorted()
}
fun smartRead(
rawData: ByteArray
): String {
val charset = detect(rawData).firstOrNull()?.charset ?: defaultCharset
return String(rawData, charset)
}
companion object {
const val CHARS_TO_DETECT = CZECH_CHARS + SLOVAK_CHARS
val CZECH_CHAR_COUNTERS = arrayOf(
UTFCharsetCharCounter(CHARS_TO_DETECT),
CharsetCharCounter(CHARS_TO_DETECT, charset = CP1250, priority = 5),
CharsetCharCounter(CHARS_TO_DETECT, charset = ISO_8859_2, priority = 4),
CharsetCharCounter(CHARS_TO_DETECT, charset = IBM852, priority = 3)
)
val CZECH_CHARSET_DETECTOR = CharsetDetector(
charCounters = CZECH_CHAR_COUNTERS
)
}
}
I tried to wrote in so that anyone can use this class for different group of characters and different set of charsets. My questions are (feel free to answer only ones, you like):
- What would you change semantically in code in general?
- How would you improve algorithm?
- Disadvantage of this approach is, that you need to load whole byte array and try to convert it into String to detect. That could be issue for big data. I was thinking of trying to create
InputStream
, that would detect encoding and change it "on the fly". Any points to that? - Currently this code is JVM-dependent. How would you approach to make this multiplatform?
You can see full code on github:
There is also test class:
2 Answers 2
What would you change semantically in code in general?
- I'd use the functionality provided by
CharsetEncoder
andCharsetDecoder
, rather than performing a map to characters themselves. - Adding to that, I would also allow 16 bit encodings such as UTF-16 LE and BE, especially since UTF-16LE (under the stupid class name
Unicode
) is the stupid default for .NET. - I'd clearly disallow invalid characters, and quickly decide that this is not the charset when they are present.
How would you improve algorithm?
- I'd use some kind of frequency analysis on top of just looking for specific characters.
- Even if specific characters are detected, I would weigh the characters according to the presence in generic texts.
Disadvantage of this approach is, that you need to load whole byte array and try to convert it into String to detect. That could be issue for big data. I was thinking of trying to create
InputStream
, that would detect encoding and change it "on the fly". Any points to that?
I guess this is similar to the first point I made. The encoders and decoders use ByteBuffer
(and CharBuffer
for encoding) and that buffer does not need to cover the whole file (and they have specific methods for handling end of buffer).
Currently this code is JVM-dependent. How would you approach to make this multiplatform?
That I cannot answer, as it depends on what functionality is present on the other platforms. However, you can always define a generic interface to your methods and then implement & test on whatever platform.
As a complete out of the box comment, I'd say that this kind of thing is also a good candidate for machine learning. But I guess that depends on the developer having to understand AI to a rather large extend.
-
\$\begingroup\$ But note that I don't think the code is that bad either. What it does need is documentation and comments of course. Currently we need to guess the purpose of constructions. \$\endgroup\$Maarten Bodewes– Maarten Bodewes2019年12月14日 17:52:46 +00:00Commented Dec 14, 2019 at 17:52
-
\$\begingroup\$ Thank you for suggestions. I will definitely add other Unicode encodings, that's good idea. I am already disallowing (auto-rejecting) based on some invalid characters or auto-accepting based on BOM. I get your point with AI and frequency analysis, but this should be rather be small utility with little dependencies. And by better documentation you need documentation outside of code? I am trying to document mostly by naming things correctly and I have feeling it could be done better here (and I already renamed those classes/method bunch of times). \$\endgroup\$K.H.– K.H.2019年12月15日 03:20:22 +00:00Commented Dec 15, 2019 at 3:20
I had a hard time finding out how to actually call this code, because there is no obvious single entry point.
Instead of starting the code with the character constants, you should start it with the main function to be called.
It Kotlin you can define top-level functions, therefore you should not hide the entry point as a method on a constant.
fun loadAutodetectCharset(file: File): String {
}
The remaining classes should be made private
, as far as they are implementation details.
How do you intend that other people can plug their own language and encoding detectors into your framework? If it's only by defining the list of characters, they don't need the complicated API of deriving from an open class.
-
\$\begingroup\$ I take it by "hard time finding out" you mean by just looking at the code, without looking at the tests? I never really thought of organizing my code in a way so that public API is first to see. Closes thing to what you would call "top level" entry point is probably
CharsetDetector.CZECH_CHARSET_DETECTOR
. I don't like overly restricting visibility, I'd rather make more clear, what is public API, but you mentioned, that it isn't clear, fair point. Plugging their own language and detectors is shown is CharsetDetector companion object, exCZECH_CHAR_COUNTERS
. \$\endgroup\$K.H.– K.H.2019年12月15日 03:24:33 +00:00Commented Dec 15, 2019 at 3:24 -
\$\begingroup\$ Yes, exactly. I first tried to only look at the code. \$\endgroup\$Roland Illig– Roland Illig2019年12月15日 09:38:47 +00:00Commented Dec 15, 2019 at 9:38
Charset
is undeclared, which is easy to fix. But when I load the code into my IDE, it cannot determine any import forCP1252
. Please post the complete compilable code next time. --- There's a rule on this site that once a question has answers, you must not modify your question anymore, to make sure that the answers stay valid. Since none of the answers has mentioned this so far, I'd say it's ok to add the missing imports. \$\endgroup\$master
branch but to a fixed commit? Then it would be perfect. Thanks for adding the links. \$\endgroup\$