Detect charset from raw bytes in Kotlin

Question 1

I needed in my project to import .txt files without knowing their encoding, but knowing, they will most likely be in Czech or Slovak language. Sadly there is bunch of possible encodings so I decided to create code, that will try to detect encoding based on content of this ByteArray.

Idea is to:

Convert all characters that have different encoding across charsets to their ByteArray representation (in this case characters with accents in Czech and Slovak language).
Count their occurrence in original ByteArray for each Charset and calculate it's own rating based on how many characters were found in ByteArray (and character byte size).
Charset with biggest rating and priority will be chosen (priority in case some charsets are tied).

My current code looks like this (removed some util conversion functions to focus on important bits):

const val CZECH_CHARS_LOWER = "říšžťčýůňúěďáéó"
const val CZECH_CHARS_UPPER = "ŘÍŠŽŤČÝŮŇÚĚĎÁÉÓ"
const val CZECH_CHARS = CZECH_CHARS_LOWER + CZECH_CHARS_UPPER
const val SLOVAK_CHARS = "ÄäÔôĹĽĺ" //'ľ' causes problems - has same as 'ž' in other encoding
data class CharCountResult(val charset: Charset, val score: Double, val priority: Int = 0) :
 Comparable<CharCountResult> {
 override fun compareTo(other: CharCountResult): Int {
 return if (this.score == other.score) {
 other.priority.compareTo(this.priority)
 } else {
 other.score.compareTo(this.score)
 }
 }
}
class UTFCharsetCharCounter(text: String, priority: Int = 10) :
 CharsetCharCounter(charsToDetect = text, charset = UTF_8, priority = priority) {
 override fun autoReject(text: ByteArray): Boolean {
 return text.countSubArray(UTF8_INVALID_CHARACTER_BYTES) > 0
 }
 override fun autoAccept(text: ByteArray): Boolean {
 return text.startsWith(UTF8_BOM)
 }
}
fun String.toCharsToBytesMap(charset: Charset): Map<Char, ByteArray> {
 return this.map { it to charset.encode(it.toCharBuffer()).toByteArray() }.toMap()
}
open class CharsetCharCounter(
 val charsToDetect: String = CZECH_CHARS + SLOVAK_CHARS,
 val charset: Charset,
 val charsToBytes: Map<Char, ByteArray> = charsToDetect.toCharsToBytesMap(charset),
 val priority: Int = 0
) {
 fun count(text: ByteArray): CharCountResult {
 if (autoReject(text)) {
 return CharCountResult(charset, 0.0)
 }
 if (autoAccept(text)) {
 return CharCountResult(charset, 100.0)
 }
 return CharCountResult(charset, (100.0 * foundCharacters(text) / text.size), priority = priority)
 }
 private fun foundCharacters(text: ByteArray): Int {
 val result = charsToBytes.values.map {
 val c: Int = text.countSubArray(it) * it.size
 c
 }.sum()
 return result
 }
 open fun autoAccept(text: ByteArray): Boolean {
 return false
 }
 open fun autoReject(text: ByteArray): Boolean {
 return false
 }
}
class CharsetDetector(
 val charCounters: Array<CharsetCharCounter>,
 val defaultCharset: Charset = UTF_8
) {
 fun detect(
 rawData: ByteArray
 ): List<CharCountResult> {
 return charCounters.map { it.count(rawData) }.sorted()
 }
 fun smartRead(
 rawData: ByteArray
 ): String {
 val charset = detect(rawData).firstOrNull()?.charset ?: defaultCharset
 return String(rawData, charset)
 }
 companion object {
 const val CHARS_TO_DETECT = CZECH_CHARS + SLOVAK_CHARS
 val CZECH_CHAR_COUNTERS = arrayOf(
 UTFCharsetCharCounter(CHARS_TO_DETECT),
 CharsetCharCounter(CHARS_TO_DETECT, charset = CP1250, priority = 5),
 CharsetCharCounter(CHARS_TO_DETECT, charset = ISO_8859_2, priority = 4),
 CharsetCharCounter(CHARS_TO_DETECT, charset = IBM852, priority = 3)
 )
 val CZECH_CHARSET_DETECTOR = CharsetDetector(
 charCounters = CZECH_CHAR_COUNTERS
 )
 }
}

I tried to wrote in so that anyone can use this class for different group of characters and different set of charsets. My questions are (feel free to answer only ones, you like):

What would you change semantically in code in general?
How would you improve algorithm?
Disadvantage of this approach is, that you need to load whole byte array and try to convert it into String to detect. That could be issue for big data. I was thinking of trying to create InputStream, that would detect encoding and change it "on the fly". Any points to that?
Currently this code is JVM-dependent. How would you approach to make this multiplatform?

You can see full code on github:

https://github.com/hovi/kotlintools/blob/6d04a7fcda1d0bddecb89984d8e14a9ff18816be/src/jvmMain/kotlin/com/github/hovi/kotlintools/charset/CharsetDetector.kt

There is also test class:

https://github.com/hovi/kotlintools/blob/6d04a7fcda1d0bddecb89984d8e14a9ff18816be/src/jvmTest/kotlin/com/github/hovi/kotlintools/charset/CharsetDetectorTest.kt

Question 2

Your code doesn't compile. For example, the Charset is undeclared, which is easy to fix. But when I load the code into my IDE, it cannot determine any import for CP1252. Please post the complete compilable code next time. --- There's a rule on this site that once a question has answers, you must not modify your question anymore, to make sure that the answers stay valid. Since none of the answers has mentioned this so far, I'd say it's ok to add the missing imports.

Question 3

As mentioned in my question, I left out some of the code to keep only important pieces of code. I'd have to add imports, some functions and also other kotlin files. Including everything would probably double or triple the code here and I'd expect discourage people from digging into it. I linked source code on github if you want to run it. I think that way it's a lot nicer, but please feel fee to tell me what's standard and I will change accordingly.

Question 4

Linking to the source code is fine as well. Could you perhaps change the links to not point to the master branch but to a fixed commit? Then it would be perfect. Thanks for adding the links.

Question 5

Links have been there since the beginning, changed them to fixed commit. Thanks for the suggestion, will use that regularly here.

Question 6

What would you change semantically in code in general?

I'd use the functionality provided by CharsetEncoder and CharsetDecoder, rather than performing a map to characters themselves.
Adding to that, I would also allow 16 bit encodings such as UTF-16 LE and BE, especially since UTF-16LE (under the stupid class name Unicode) is the stupid default for .NET.
I'd clearly disallow invalid characters, and quickly decide that this is not the charset when they are present.

How would you improve algorithm?

I'd use some kind of frequency analysis on top of just looking for specific characters.
Even if specific characters are detected, I would weigh the characters according to the presence in generic texts.

Disadvantage of this approach is, that you need to load whole byte array and try to convert it into String to detect. That could be issue for big data. I was thinking of trying to create InputStream, that would detect encoding and change it "on the fly". Any points to that?

I guess this is similar to the first point I made. The encoders and decoders use ByteBuffer (and CharBuffer for encoding) and that buffer does not need to cover the whole file (and they have specific methods for handling end of buffer).

Currently this code is JVM-dependent. How would you approach to make this multiplatform?

That I cannot answer, as it depends on what functionality is present on the other platforms. However, you can always define a generic interface to your methods and then implement & test on whatever platform.

As a complete out of the box comment, I'd say that this kind of thing is also a good candidate for machine learning. But I guess that depends on the developer having to understand AI to a rather large extend.

Question 7

But note that I don't think the code is that bad either. What it does need is documentation and comments of course. Currently we need to guess the purpose of constructions.

Question 8

Thank you for suggestions. I will definitely add other Unicode encodings, that's good idea. I am already disallowing (auto-rejecting) based on some invalid characters or auto-accepting based on BOM. I get your point with AI and frequency analysis, but this should be rather be small utility with little dependencies. And by better documentation you need documentation outside of code? I am trying to document mostly by naming things correctly and I have feeling it could be done better here (and I already renamed those classes/method bunch of times).

Question 9

I had a hard time finding out how to actually call this code, because there is no obvious single entry point.

Instead of starting the code with the character constants, you should start it with the main function to be called.

It Kotlin you can define top-level functions, therefore you should not hide the entry point as a method on a constant.

fun loadAutodetectCharset(file: File): String {
}

The remaining classes should be made private, as far as they are implementation details.

How do you intend that other people can plug their own language and encoding detectors into your framework? If it's only by defining the list of characters, they don't need the complicated API of deriving from an open class.

Question 10

I take it by "hard time finding out" you mean by just looking at the code, without looking at the tests? I never really thought of organizing my code in a way so that public API is first to see. Closes thing to what you would call "top level" entry point is probably CharsetDetector.CZECH_CHARSET_DETECTOR. I don't like overly restricting visibility, I'd rather make more clear, what is public API, but you mentioned, that it isn't clear, fair point. Plugging their own language and detectors is shown is CharsetDetector companion object, ex CZECH_CHAR_COUNTERS.

Question 11

Yes, exactly. I first tried to only look at the code.

Maarten Bodewes Maarten Bodewes 6,59920 silver badges53 bronze badges · Answer 1 · 2019-12-14 14:11:46Z

What would you change semantically in code in general?

I'd use the functionality provided by CharsetEncoder and CharsetDecoder, rather than performing a map to characters themselves.
Adding to that, I would also allow 16 bit encodings such as UTF-16 LE and BE, especially since UTF-16LE (under the stupid class name Unicode) is the stupid default for .NET.
I'd clearly disallow invalid characters, and quickly decide that this is not the charset when they are present.

How would you improve algorithm?

I'd use some kind of frequency analysis on top of just looking for specific characters.
Even if specific characters are detected, I would weigh the characters according to the presence in generic texts.

Disadvantage of this approach is, that you need to load whole byte array and try to convert it into String to detect. That could be issue for big data. I was thinking of trying to create InputStream, that would detect encoding and change it "on the fly". Any points to that?

I guess this is similar to the first point I made. The encoders and decoders use ByteBuffer (and CharBuffer for encoding) and that buffer does not need to cover the whole file (and they have specific methods for handling end of buffer).

Currently this code is JVM-dependent. How would you approach to make this multiplatform?

That I cannot answer, as it depends on what functionality is present on the other platforms. However, you can always define a generic interface to your methods and then implement & test on whatever platform.

As a complete out of the box comment, I'd say that this kind of thing is also a good candidate for machine learning. But I guess that depends on the developer having to understand AI to a rather large extend.

But note that I don't think the code is that bad either. What it does need is documentation and comments of course. Currently we need to guess the purpose of constructions.
Thank you for suggestions. I will definitely add other Unicode encodings, that's good idea. I am already disallowing (auto-rejecting) based on some invalid characters or auto-accepting based on BOM. I get your point with AI and frequency analysis, but this should be rather be small utility with little dependencies. And by better documentation you need documentation outside of code? I am trying to document mostly by naming things correctly and I have feeling it could be done better here (and I already renamed those classes/method bunch of times).

Roland Illig Roland Illig 21.8k2 gold badges36 silver badges83 bronze badges · Answer 2 · 2019-12-14 13:45:44Z

I had a hard time finding out how to actually call this code, because there is no obvious single entry point.

Instead of starting the code with the character constants, you should start it with the main function to be called.

It Kotlin you can define top-level functions, therefore you should not hide the entry point as a method on a constant.

fun loadAutodetectCharset(file: File): String {
}

The remaining classes should be made private, as far as they are implementation details.

How do you intend that other people can plug their own language and encoding detectors into your framework? If it's only by defining the list of characters, they don't need the complicated API of deriving from an open class.

I take it by "hard time finding out" you mean by just looking at the code, without looking at the tests? I never really thought of organizing my code in a way so that public API is first to see. Closes thing to what you would call "top level" entry point is probably CharsetDetector.CZECH_CHARSET_DETECTOR. I don't like overly restricting visibility, I'd rather make more clear, what is public API, but you mentioned, that it isn't clear, fair point. Plugging their own language and detectors is shown is CharsetDetector companion object, ex CZECH_CHAR_COUNTERS.

Stack Exchange Network

Detect charset from raw bytes in Kotlin

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Detect charset from raw bytes in Kotlin

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions