While fixing some old code I came up with a class that parses floats from a string.
This is used for parsing svg files in xml format. Floats can be separated by comma, space or anything at all in case the next number starts with a sign. They can also use exponential notation.
Sample:
<path d="
M245.792,247.275c0.828,1.036,0.794,2.333,0.924,3.53c0.161,1.442,0.122,2.924,0.016,4.377c-0.109,1.469-0.365,2.93-0.626,4.382
c-0.353,1.946-0.965,3.814-1.914,5.563c-0.365,0.67-0.647,1.397-1.086,2.019c-0.804,1.14-1.705,2.212-2.548,3.326
c-1.484,1.967-3.566,3.112-5.648,4.304c-1.356,0.778-2.707,1.553-4.194,1.98c-2.577,0.742-5.196,1.243-7.904,0.672
c-1.02-0.213-2.065-0.286-3.099-0.44c-2.901-0.436-4.463-2.366-5.446-4.917c-0.84-2.179-0.598-4.317,0.067-6.526
c0.899-2.974,2.905-5.095,4.931-7.278c1.564-1.688,3.005-3.498,4.384-5.339c1.188-1.581,0.951-2.972-0.513-4.263
c-1.453-0.238-1.456-0.238-2.139,1.176c-1.119,2.31-2.463,4.513-4.409,6.169c-2.983,2.536-6.378,4.271-10.45,4.197
c-3.49-0.062-6.985-0.234-10.473-0.174c-1.821,0.03-3.679,0.295-5.442,0.754c-1.948,0.508-3.091,2.061-3.743,3.915
c-0.615,1.764-0.973,3.642-1.755,5.324c-1.615,3.472-3.705,6.573-7.654,7.836c-0.857,0.273-1.614,0.861-2.471,1.143
c-2.482,0.814-5.011,0.832-7.585,0.362c-1.683-0.308-3.407-0.384-4.915-1.309c-1.85-1.138-3.589-2.405-4.922-4.163
c-0.35-0.46-0.777-0.859-1.171-1.287"/>
This is the class:
private class ParserData
{
String s;
int cur = 0;
int len = 0;
float prev_x = .0f;
float prev_y = .0f;
float prev_cx = .0f;
float prev_cy = .0f;
float start_x = .0f;
float start_y = .0f;
char prev_curve;
char next_curve = 0;
public void setData(String data)
{
s = data;
len = s.length();
cur = 0;
prev_x = 0;
prev_y = 0;
prev_cx = .0f;
prev_cy = .0f;
start_x = .0f;
start_y = .0f;
prev_curve = 0;
next_curve = 0;
}
public boolean hasNext()
{
return cur < len;
}
public char getNext()
{
return s.charAt(cur++);
}
public char peekNext()
{
return s.charAt(cur);
}
public void skipSpaces()
{
while(hasNext())
{
char c = s.charAt(cur);
switch(c)
{
case ' ':
case ',':
case '\n':
case '\t':
case '\r':
break;
default:
return;
}
cur++;
}
}
private float parseFloat()
{
boolean exp = false;
boolean last = false;
skipSpaces();
int j = cur;
if(s.charAt(j) == '-')
{
getNext();
}
while(hasNext() && !last)
{
char c = getNext();
switch(c)
{
case '0':
case '1':
case '2':
case '3':
case '4':
case '5':
case '6':
case '7':
case '8':
case '9':
case '+':
case '.':
exp = false;
break;
case '-':
if(exp)
{
exp = false;
}
else
{
cur--;
return Uti.safeParseFloat(s.substring(j, cur));
}
break;
case 'e':
case 'E':
exp = true;
break;
default:
cur--;
return Uti.safeParseFloat(s.substring(j, cur));
}
}
return Uti.safeParseFloat(s.substring(j, cur));
}
}
While hasNext
is true getNext
is used to read the type of element, then parseFloat
is called to get all the points and so on. For example, "c" is a cubic bezier curve and after it there are 3 control points (3 comma separated pairs, or 6 floats) to read.
I'm looking for some ways to speed up the process.
Here is the full code of the content handler.
-
\$\begingroup\$ Can you add a code example for how you use this class? \$\endgroup\$tim– tim2014年10月06日 19:03:18 +00:00Commented Oct 6, 2014 at 19:03
-
\$\begingroup\$ It is used in the handler of a sax parser, here is the code: pastebin.com/r7VM7NSp \$\endgroup\$キキジキ– キキジキ2014年10月06日 19:17:16 +00:00Commented Oct 6, 2014 at 19:17
2 Answers 2
At least for now, I did not look at improving performance or at the general approach, but at some more basic things I think would improve the readability of the code.
setData
vs Constructor
It seems to me that setData
takes the role of a constructor here. In that case, I would just make it a constructor instead of making it a method. Sure, this would result in the creation of a new object, but it would result in cleaner code (the ParserData
would always be a valid object).
Naming
First of all, in Java CamelCase is used instead of underscores. It's always good to follow the conventions of a language, so prev_curve
for example should be prevCurve
.
Variables should always have expressive names, but this is especially important at class level. Your names wouldn't tell me the difference between prev_x
and prev_cx
for example. s
is also not a very expressive name (I cannot think of a name that's a lot better, but even string
or inputString
would be better. For consistency, data
should then be changed to the same name).
len
could be length
, cur
currentPosition
, and c
could be currentChar
.
exp
should probably be exponent
and j
also should get a more expressive name (something like startFloatString
).
access modifiers
Your fields should be private
.
Your skipSpaces
should probably be private
as well.
The parseFloat
method is private, and not called from within the class. But it is the main parse method. This seems wrong, I would make it public.
Unused Fields
Defining fields and then not using them is not good practice. You should remove next_curve
, prev_curve
, start_y
, start_x
, prev_cy
, prev_cx
, prev_y
and prev_x
. I see that you do use these fields in a different context, but not in your parsing code. So I would create a new class for them instead of tagging them onto this one.
You should also remove last
and change
while(hasNext() && !last)
to
while(hasNext())
Use methods that you have defined
You have getNext
, so why not use it in skipSpaces
instead of getting the character yourself and then increasing cur
.
Same with peekNext
in parseFloat
.
Formating
In Java, it is customary to put the {
on the same line, not the next.
-
\$\begingroup\$ Thank you for the feedback, I updated my question with the updated code. \$\endgroup\$キキジキ– キキジキ2014年10月06日 20:24:04 +00:00Commented Oct 6, 2014 at 20:24
-
1\$\begingroup\$ @キキジキ sure, glad to help :) But I have to role back your edit (please see What you can and cannot do after receiving answers). You could wait for some more answers, and then post a follow-up question. \$\endgroup\$tim– tim2014年10月06日 20:25:54 +00:00Commented Oct 6, 2014 at 20:25
-
\$\begingroup\$ Oh sorry, got it. \$\endgroup\$キキジキ– キキジキ2014年10月06日 20:30:06 +00:00Commented Oct 6, 2014 at 20:30
Parsing character by character is going to be slow. Perhaps you could consider alternative implementations:
- Split the string with a regular expressions
- Use a regular expression to match the numbers and iterate over the matches
Here's an example:
public static List<Float> parse(String data) {
String[] parts = data.replaceAll("[^\\d.-]|(?=-)", " ").trim().split(" +");
List<Float> numbers = new ArrayList<>(parts.length);
for (String s : parts) {
numbers.add(Float.parseFloat(s));
}
return numbers;
}
For the sample you gave, this method will return a list with numbers:
245.792, 247.275, -0.828, 1.036, 0.794, 2.333, 0.924, 3.53, 0.161, 1.442, ...
For other approaches using regular expressions, see this related question on stackoverflow (I asked just for this question ;-)
-
\$\begingroup\$ I also need to get the letters that define curves inside the path, like "c" "C" "s" "S" ... For example, when a "c" comes I read the next 6 floats and generate a cubic spline, "v" I read one value and add a vertical line and so on. \$\endgroup\$キキジキ– キキジキ2014年10月08日 22:40:12 +00:00Commented Oct 8, 2014 at 22:40
-
\$\begingroup\$ It's probably not a good test but, I tried ignoring the problem above and feeding a file with only numbers to both versions. With regexp I get 2.5x worse performance. Source and data file -> pastebin.com/nV9eaaFW pastebin.com/ZUtgdvEY \$\endgroup\$キキジキ– キキジキ2014年10月08日 23:12:22 +00:00Commented Oct 8, 2014 at 23:12
Explore related questions
See similar questions with these tags.