Parse floats from a string

Question 1

While fixing some old code I came up with a class that parses floats from a string.

This is used for parsing svg files in xml format. Floats can be separated by comma, space or anything at all in case the next number starts with a sign. They can also use exponential notation.

Sample:

<path d="
 M245.792,247.275c0.828,1.036,0.794,2.333,0.924,3.53c0.161,1.442,0.122,2.924,0.016,4.377c-0.109,1.469-0.365,2.93-0.626,4.382
 c-0.353,1.946-0.965,3.814-1.914,5.563c-0.365,0.67-0.647,1.397-1.086,2.019c-0.804,1.14-1.705,2.212-2.548,3.326
 c-1.484,1.967-3.566,3.112-5.648,4.304c-1.356,0.778-2.707,1.553-4.194,1.98c-2.577,0.742-5.196,1.243-7.904,0.672
 c-1.02-0.213-2.065-0.286-3.099-0.44c-2.901-0.436-4.463-2.366-5.446-4.917c-0.84-2.179-0.598-4.317,0.067-6.526
 c0.899-2.974,2.905-5.095,4.931-7.278c1.564-1.688,3.005-3.498,4.384-5.339c1.188-1.581,0.951-2.972-0.513-4.263
 c-1.453-0.238-1.456-0.238-2.139,1.176c-1.119,2.31-2.463,4.513-4.409,6.169c-2.983,2.536-6.378,4.271-10.45,4.197
 c-3.49-0.062-6.985-0.234-10.473-0.174c-1.821,0.03-3.679,0.295-5.442,0.754c-1.948,0.508-3.091,2.061-3.743,3.915
 c-0.615,1.764-0.973,3.642-1.755,5.324c-1.615,3.472-3.705,6.573-7.654,7.836c-0.857,0.273-1.614,0.861-2.471,1.143
 c-2.482,0.814-5.011,0.832-7.585,0.362c-1.683-0.308-3.407-0.384-4.915-1.309c-1.85-1.138-3.589-2.405-4.922-4.163
 c-0.35-0.46-0.777-0.859-1.171-1.287"/>

This is the class:

private class ParserData
{
 String s;
 int cur = 0;
 int len = 0;
 float prev_x = .0f;
 float prev_y = .0f;
 float prev_cx = .0f;
 float prev_cy = .0f;
 float start_x = .0f;
 float start_y = .0f;
 char prev_curve;
 char next_curve = 0;
 public void setData(String data)
 {
 s = data;
 len = s.length();
 cur = 0;
 prev_x = 0;
 prev_y = 0;
 prev_cx = .0f;
 prev_cy = .0f;
 start_x = .0f;
 start_y = .0f;
 prev_curve = 0;
 next_curve = 0; 
 }
 public boolean hasNext()
 {
 return cur < len;
 }
 public char getNext()
 {
 return s.charAt(cur++);
 }
 public char peekNext()
 {
 return s.charAt(cur);
 }
 public void skipSpaces()
 {
 while(hasNext())
 {
 char c = s.charAt(cur);
 switch(c)
 {
 case ' ':
 case ',':
 case '\n':
 case '\t':
 case '\r':
 break;
 default:
 return;
 }
 cur++;
 }
 }
 private float parseFloat()
 {
 boolean exp = false;
 boolean last = false;
 skipSpaces();
 int j = cur;
 if(s.charAt(j) == '-')
 {
 getNext();
 }
 while(hasNext() && !last)
 {
 char c = getNext();
 switch(c)
 {
 case '0':
 case '1':
 case '2':
 case '3':
 case '4':
 case '5':
 case '6':
 case '7':
 case '8':
 case '9':
 case '+':
 case '.':
 exp = false;
 break;
 case '-':
 if(exp)
 {
 exp = false;
 }
 else
 {
 cur--;
 return Uti.safeParseFloat(s.substring(j, cur));
 }
 break;
 case 'e':
 case 'E':
 exp = true;
 break;
 default:
 cur--;
 return Uti.safeParseFloat(s.substring(j, cur));
 }
 }
 return Uti.safeParseFloat(s.substring(j, cur));
 }
}

While hasNext is true getNext is used to read the type of element, then parseFloat is called to get all the points and so on. For example, "c" is a cubic bezier curve and after it there are 3 control points (3 comma separated pairs, or 6 floats) to read.

I'm looking for some ways to speed up the process.

Here is the full code of the content handler.

Question 2

Can you add a code example for how you use this class?

Question 3

It is used in the handler of a sax parser, here is the code: pastebin.com/r7VM7NSp

Question 4

At least for now, I did not look at improving performance or at the general approach, but at some more basic things I think would improve the readability of the code.

setData vs Constructor

It seems to me that setData takes the role of a constructor here. In that case, I would just make it a constructor instead of making it a method. Sure, this would result in the creation of a new object, but it would result in cleaner code (the ParserData would always be a valid object).

Naming

First of all, in Java CamelCase is used instead of underscores. It's always good to follow the conventions of a language, so prev_curve for example should be prevCurve.

Variables should always have expressive names, but this is especially important at class level. Your names wouldn't tell me the difference between prev_x and prev_cx for example. s is also not a very expressive name (I cannot think of a name that's a lot better, but even string or inputString would be better. For consistency, data should then be changed to the same name).

len could be length, cur currentPosition, and c could be currentChar.

exp should probably be exponent and j also should get a more expressive name (something like startFloatString).

access modifiers

Your fields should be private.

Your skipSpaces should probably be private as well.

The parseFloat method is private, and not called from within the class. But it is the main parse method. This seems wrong, I would make it public.

Unused Fields

Defining fields and then not using them is not good practice. You should remove next_curve, prev_curve, start_y, start_x, prev_cy, prev_cx, prev_y and prev_x. I see that you do use these fields in a different context, but not in your parsing code. So I would create a new class for them instead of tagging them onto this one.

You should also remove last and change

while(hasNext() && !last)

to

while(hasNext())

Use methods that you have defined

You have getNext, so why not use it in skipSpaces instead of getting the character yourself and then increasing cur.

Same with peekNext in parseFloat.

Formating

In Java, it is customary to put the { on the same line, not the next.

Question 5

Thank you for the feedback, I updated my question with the updated code.

Question 6

@キキジキ sure, glad to help :) But I have to role back your edit (please see What you can and cannot do after receiving answers). You could wait for some more answers, and then post a follow-up question.

Question 7

Oh sorry, got it.

Question 8

Parsing character by character is going to be slow. Perhaps you could consider alternative implementations:

Split the string with a regular expressions
Use a regular expression to match the numbers and iterate over the matches

Here's an example:

public static List<Float> parse(String data) {
 String[] parts = data.replaceAll("[^\\d.-]|(?=-)", " ").trim().split(" +");
 List<Float> numbers = new ArrayList<>(parts.length);
 for (String s : parts) {
 numbers.add(Float.parseFloat(s));
 }
 return numbers;
}

For the sample you gave, this method will return a list with numbers:

 245.792, 247.275, -0.828, 1.036, 0.794, 2.333, 0.924, 3.53, 0.161, 1.442, ...

For other approaches using regular expressions, see this related question on stackoverflow (I asked just for this question ;-)

Question 9

I also need to get the letters that define curves inside the path, like "c" "C" "s" "S" ... For example, when a "c" comes I read the next 6 floats and generate a cubic spline, "v" I read one value and add a vertical line and so on.

Question 10

It's probably not a good test but, I tried ignoring the problem above and feeding a file with only numbers to both versions. With regexp I get 2.5x worse performance. Source and data file -> pastebin.com/nV9eaaFW pastebin.com/ZUtgdvEY

tim tim 25.3k3 gold badges31 silver badges76 bronze badges · Answer 1 · 2014-10-06 19:38:36Z

At least for now, I did not look at improving performance or at the general approach, but at some more basic things I think would improve the readability of the code.

setData vs Constructor

It seems to me that setData takes the role of a constructor here. In that case, I would just make it a constructor instead of making it a method. Sure, this would result in the creation of a new object, but it would result in cleaner code (the ParserData would always be a valid object).

Naming

First of all, in Java CamelCase is used instead of underscores. It's always good to follow the conventions of a language, so prev_curve for example should be prevCurve.

Variables should always have expressive names, but this is especially important at class level. Your names wouldn't tell me the difference between prev_x and prev_cx for example. s is also not a very expressive name (I cannot think of a name that's a lot better, but even string or inputString would be better. For consistency, data should then be changed to the same name).

len could be length, cur currentPosition, and c could be currentChar.

exp should probably be exponent and j also should get a more expressive name (something like startFloatString).

access modifiers

Your fields should be private.

Your skipSpaces should probably be private as well.

The parseFloat method is private, and not called from within the class. But it is the main parse method. This seems wrong, I would make it public.

Unused Fields

Defining fields and then not using them is not good practice. You should remove next_curve, prev_curve, start_y, start_x, prev_cy, prev_cx, prev_y and prev_x. I see that you do use these fields in a different context, but not in your parsing code. So I would create a new class for them instead of tagging them onto this one.

You should also remove last and change

while(hasNext() && !last)

to

while(hasNext())

Use methods that you have defined

You have getNext, so why not use it in skipSpaces instead of getting the character yourself and then increasing cur.

Same with peekNext in parseFloat.

Formating

In Java, it is customary to put the { on the same line, not the next.

Thank you for the feedback, I updated my question with the updated code.
@キキジキ sure, glad to help :) But I have to role back your edit (please see What you can and cannot do after receiving answers). You could wait for some more answers, and then post a follow-up question.

janos janos 113k15 gold badges154 silver badges396 bronze badges · Answer 2 · 2014-10-07 16:32:28Z

Parsing character by character is going to be slow. Perhaps you could consider alternative implementations:

Split the string with a regular expressions
Use a regular expression to match the numbers and iterate over the matches

Here's an example:

public static List<Float> parse(String data) {
 String[] parts = data.replaceAll("[^\\d.-]|(?=-)", " ").trim().split(" +");
 List<Float> numbers = new ArrayList<>(parts.length);
 for (String s : parts) {
 numbers.add(Float.parseFloat(s));
 }
 return numbers;
}

For the sample you gave, this method will return a list with numbers:

 245.792, 247.275, -0.828, 1.036, 0.794, 2.333, 0.924, 3.53, 0.161, 1.442, ...

For other approaches using regular expressions, see this related question on stackoverflow (I asked just for this question ;-)

I also need to get the letters that define curves inside the path, like "c" "C" "s" "S" ... For example, when a "c" comes I read the next 6 floats and generate a cubic spline, "v" I read one value and add a vertical line and so on.
It's probably not a good test but, I tried ignoring the problem above and feeding a file with only numbers to both versions. With regexp I get 2.5x worse performance. Source and data file -> pastebin.com/nV9eaaFW pastebin.com/ZUtgdvEY

Stack Exchange Network

Parse floats from a string

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parse floats from a string

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions