Java split String performances

Question 1

Here is the current code in my application:

String[] ids = str.split("/");

When profiling the application, a non-negligeable time is spent string splitting. Also, the split method takes a regular expression, which is superfluous here.

What alternative can I use in order to optimize the string splitting? Is StringUtils.split faster?

(I would've tried and tested myself but profiling my application takes a lot of time.)

Question 2

N.B. most answers are out of date because the JDK's String.split() receives frequent optimisations, while most of the alternatives do not.

Question 3

String.split(String) won't create regexp if your pattern is only one character long. When splitting by single character, it will use specialized code which is pretty efficient. StringTokenizer is not much faster in this particular case.

This was introduced in OpenJDK7/OracleJDK7. Here's a bug report and a commit. I've made a simple benchmark here.

$ java -version
java version "1.8.0_20"
Java(TM) SE Runtime Environment (build 1.8.0_20-b26)
Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode)
$ java Split
split_banthar: 1231
split_tskuzzy: 1464
split_tskuzzy2: 1742
string.split: 1291
StringTokenizer: 1517

Question 4

thanks for this benchmark. Your code is "unfair" though since the StringTokenizer part avoids creating a List and converting it to an array....great starting point though!

Question 5

to avoid regex creation inside split method, having 1 char long pattern isn't enough. This char also must not be one of the regex meta characters ".$|()[{^?*+\\" e.g. split(".") will create/compile regex pattern. (verified on jdk8 at least)

Question 6

In my version of Java 8 it does. From the split implementation comment: fastpath if the regex is a (1) one-char String and this character is not one of the RegEx's meta characters ".$|()[{^?*+\\", or (2)two-char String and the first char is the backslash and the second is not the ascii digit or ascii letter.

Question 7

Adding a qualification. If you just put in say "|" that's going to be treated as regular expression. But "\\|" Is not treated as a regular expression. That confused me a bit at first.

Question 8

At least the split_banthar (tested with copy/paste code) does NOT have the same behaviour has the JAVA SPLIT...

Question 9

If you can use third-party libraries, Guava's Splitter doesn't incur the overhead of regular expressions when you don't ask for it, and is very fast as a general rule. (Disclosure: I contribute to Guava.)

Iterable<String> split = Splitter.on('/').split(string);

(Also, Splitter is as a rule much more predictable than String.split.)

Question 10

This made a very significant difference for me while using it on the lines from a large file.

Question 11

This post recommends the non-use of Iterable even Guava's team lead says so...alexruiz.developerblogs.com/?p=2519

Question 12

The blog entry has vanished but there is a snapshot available in the internet archive.

Question 13

@sirvon the point that blog post makes does not apply here...

Question 14

StringTokenizer is much faster for simple parsing like this (I did some benchmarking with a while back and you get huge speedups).

StringTokenizer st = new StringTokenizer("1/2/3","/");
String[] arr = new String[st.countTokens()];
arr[0] = st.nextToken();

If you want to eek out a little more performance, you can do it manually as well:

String s = "1/2/3"
char[] c = s.toCharArray();
LinkedList<String> ll = new LinkedList<String>();
int index = 0;
for(int i=0;i<c.length;i++) {
 if(c[i] == '/') {
 ll.add(s.substring(index,i));
 index = i+1;
 }
}
String[] arr = ll.size();
Iterator<String> iter = ll.iterator();
index = 0;
for(index = 0; iter.hasNext(); index++)
 arr[index++] = iter.next();

Question 15

StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.

Question 16

Just because it's legacy doesn't mean it's not useful. And in fact, this particular class is actually very useful for that extra performance boost so I am actually against this "legacy" label.

Question 17

The split method of String and the java.util.regex package incur the significant overhead of using regexes. StringTokenizer does not.

Question 18

@tskuzzy it doesn't matter whether you are against "legacy" label or not, as javadoc says: its use discouraged.

Question 19

@NandkumarTekale You did not apparently understand my point. But if you want to avoid using "legacy" classes in favor of "slow" ones that is your choice.

Question 20

Seeing as I am working at large scale, I thought it would help to provide some more benchmarking, including a few of my own implementations (I split on spaces, but this should illustrate how long it takes in general):

I'm working with a 426 MB file, with 2622761 lines. The only whitespace are normal spaces (" ") and lines ("\n").

First I replace all lines with spaces, and benchmark parsing one huge line:

.split(" ")
Cumulative time: 31.431366952 seconds
.split("\s")
Cumulative time: 52.948729489 seconds
splitStringChArray()
Cumulative time: 38.721338004 seconds
splitStringChList()
Cumulative time: 12.716065893 seconds
splitStringCodes()
Cumulative time: 1 minutes, 21.349029036000005 seconds
splitStringCharCodes()
Cumulative time: 23.459840685 seconds
StringTokenizer
Cumulative time: 1 minutes, 11.501686094999997 seconds

Then I benchmark splitting line by line (meaning that the functions and loops are done many times, instead of all at once):

.split(" ")
Cumulative time: 3.809014174 seconds
.split("\s")
Cumulative time: 7.906730124 seconds
splitStringChArray()
Cumulative time: 4.06576739 seconds
splitStringChList()
Cumulative time: 2.857809996 seconds
Bonus: splitStringChList(), but creating a new StringBuilder every time (the average difference is actually more like .42 seconds):
Cumulative time: 3.82026621 seconds
splitStringCodes()
Cumulative time: 11.730249921 seconds
splitStringCharCodes()
Cumulative time: 6.995555826 seconds
StringTokenizer
Cumulative time: 4.500008172 seconds

Here is the code:

// Use a char array, and count the number of instances first.
public static String[] splitStringChArray(String str, StringBuilder sb) {
 char[] strArray = str.toCharArray();
 int count = 0;
 for (char c : strArray) {
 if (c == ' ') {
 count++;
 }
 }
 String[] splitArray = new String[count+1];
 int i=0;
 for (char c : strArray) {
 if (c == ' ') {
 splitArray[i] = sb.toString();
 sb.delete(0, sb.length());
 } else {
 sb.append(c);
 }
 }
 return splitArray;
}
// Use a char array but create an ArrayList, and don't count beforehand.
public static ArrayList<String> splitStringChList(String str, StringBuilder sb) {
 ArrayList<String> words = new ArrayList<String>();
 words.ensureCapacity(str.length()/5);
 char[] strArray = str.toCharArray();
 int i=0;
 for (char c : strArray) {
 if (c == ' ') {
 words.add(sb.toString());
 sb.delete(0, sb.length());
 } else {
 sb.append(c);
 }
 }
 return words;
}
// Using an iterator through code points and returning an ArrayList.
public static ArrayList<String> splitStringCodes(String str) {
 ArrayList<String> words = new ArrayList<String>();
 words.ensureCapacity(str.length()/5);
 IntStream is = str.codePoints();
 OfInt it = is.iterator();
 int cp;
 StringBuilder sb = new StringBuilder();
 while (it.hasNext()) {
 cp = it.next();
 if (cp == 32) {
 words.add(sb.toString());
 sb.delete(0, sb.length());
 } else {
 sb.append(cp);
 }
 }
 return words;
}
// This one is for compatibility with supplementary or surrogate characters (by using Character.codePointAt())
public static ArrayList<String> splitStringCharCodes(String str, StringBuilder sb) {
 char[] strArray = str.toCharArray();
 ArrayList<String> words = new ArrayList<String>();
 words.ensureCapacity(str.length()/5);
 int cp;
 int len = strArray.length;
 for (int i=0; i<len; i++) {
 cp = Character.codePointAt(strArray, i);
 if (cp == ' ') {
 words.add(sb.toString());
 sb.delete(0, sb.length());
 } else {
 sb.append(cp);
 }
 }
 return words;
}

This is how I used StringTokenizer:

 StringTokenizer tokenizer = new StringTokenizer(file.getCurrentString());
 words = new String[tokenizer.countTokens()];
 int i = 0;
 while (tokenizer.hasMoreTokens()) {
 words[i] = tokenizer.nextToken();
 i++;
 }

Question 21

splitStringChList discards the last string. Add before return: java if (sb.length() > 0) words.add(sb.toString()); Also: - replace sb.delete(0, sb.length()); with sb.setLength(0); - remove unused int i=0;

Question 22

Also you should just make a string from a range in the char array rather than use a StringBuilder. I don't find your implementation to be faster than split in java11

Question 23

java.util.StringTokenizer(String str, String delim) is about twice as fast according to this post.

However, unless your application is of a gigantic scale, split should be fine for you (c.f. same post, it cites thousands of strings in a few miliseconds).

Question 24

it doesn't take a gigantic-scale application, a split in a tight loop such as a document parser is enough -and frequent- Think about typical routines of parsing twitterlinks, emails, hashtags .... They are fed with Mb of text to parse. The routine itself can have a few dozen lines but will be called hundreds of times per second.

Question 25

Guava has a Splitter which is more flexible that the String.split() method, and doesn't (necessarily) use a regex. OTOH, String.split() has been optimized in Java 7 to avoid the regex machinery if the separator is a single char. So the performance should be similar in Java 7.

Question 26

Oh OK I'm using Java 5 (unfortunately yeah, can't change that)

Question 27

StringTokenizer is faster than any other splitting method, but getting the tokenizer to return the delimiters along with the tokenized string improves performance by something like 50%. That is achieved by using the constructor java.util.StringTokenizer.StringTokenizer(String str, String delim, boolean returnDelims). Here some other insights on that matter: Performance of StringTokenizer class vs. split method in Java

Question 28

Use Apache Commons Lang » 3.0 's

StringUtils.splitByWholeSeparator("ab-!-cd-!-ef", "-!-") = ["ab", "cd", "ef"]

If you need non regex split and wants the results in String array, then use StringUtils, I compared StringUtils.splitByWholeSeparator with Guava's Splitter and Java's String split, and found StringUtils is faster.

StringUtils - 8ms
String - 11ms
Splitter - 1ms (but returns Iterable/Iterator and converting them to string array takes total of 54ms)

Question 29

The String's split method is probably a safer choice. As of at least java 6 (though the api reference is for 7) they basically say that use of the StringTokenizer is discouraged. Their wording is quoted below.

"StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead."

Question 30

You can write the split function yourself , which is going to be the fastest. Here is the link that proves it , it worked for me too, optimized my code by 6X

StringTokenizer - reading lines with integers

Split: 366ms IndexOf: 50ms StringTokenizer: 89ms GuavaSplit: 109ms IndexOf2 (some super optimised solution given in the above question): 14ms CsvMapperSplit (mapping row by row): 326ms CsvMapperSplit_DOC (building one doc and mapping all rows in one go): 177ms

Piotr Praszmo Piotr Praszmo 18.4k2 gold badges59 silver badges66 bronze badges · Accepted Answer · 2012-06-12 18:13:33Z

63

String.split(String) won't create regexp if your pattern is only one character long. When splitting by single character, it will use specialized code which is pretty efficient. StringTokenizer is not much faster in this particular case.

This was introduced in OpenJDK7/OracleJDK7. Here's a bug report and a commit. I've made a simple benchmark here.

$ java -version
java version "1.8.0_20"
Java(TM) SE Runtime Environment (build 1.8.0_20-b26)
Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode)
$ java Split
split_banthar: 1231
split_tskuzzy: 1464
split_tskuzzy2: 1742
string.split: 1291
StringTokenizer: 1517

Share

Improve this answer

edited Nov 13, 2023 at 19:27

Community's user avatar

Community Bot

11 silver badge

answered Jun 12, 2012 at 18:13

Piotr Praszmo's user avatar

Piotr Praszmo Piotr Praszmo

18.4k2 gold badges59 silver badges66 bronze badges

5

2

thanks for this benchmark. Your code is "unfair" though since the StringTokenizer part avoids creating a List and converting it to an array....great starting point though!

Yossi Farjoun
– Yossi Farjoun

2016年12月28日 17:56:03 +00:00
Commented Dec 28, 2016 at 17:56
14

to avoid regex creation inside split method, having 1 char long pattern isn't enough. This char also must not be one of the regex meta characters ".$|()[{^?*+\\" e.g. split(".") will create/compile regex pattern. (verified on jdk8 at least)

andrii
– andrii

2018年06月21日 23:55:58 +00:00
Commented Jun 21, 2018 at 23:55
In my version of Java 8 it does. From the split implementation comment: fastpath if the regex is a (1) one-char String and this character is not one of the RegEx's meta characters ".$|()[{^?*+\\", or (2)two-char String and the first char is the backslash and the second is not the ascii digit or ascii letter.

David Bradley
– David Bradley

2021年03月07日 12:49:06 +00:00
Commented Mar 7, 2021 at 12:49
1

Adding a qualification. If you just put in say "|" that's going to be treated as regular expression. But "\\|" Is not treated as a regular expression. That confused me a bit at first.

David Bradley
– David Bradley

2021年03月07日 13:39:15 +00:00
Commented Mar 7, 2021 at 13:39
At least the split_banthar (tested with copy/paste code) does NOT have the same behaviour has the JAVA SPLIT...

marcolopes
– marcolopes

2021年09月15日 03:29:44 +00:00
Commented Sep 15, 2021 at 3:29

Add a comment |

CollectivesTM on Stack Overflow

Java split String performances

10 Answers 10

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

10 Answers 10

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related