Performance tuning on a text file to object conversion

Question 1

I'm using an API which returns text in the following format:

#start
#p 09060 20131010
#p 09180 AK
#p 01001 19110212982
#end
#start
#p 09060 20131110
#p 09180 AB
#p 01001 12110212982
#end

I'm converting this to a list of objects:

var result = data.match(/#start[\s\S]+?#end/ig).map(function(v){
 var lines = v.split('\n'),
 ret = {};
 $.each(lines, function(_, v2){
 var split = v2.split(' ');
 if(split[1] && split[2]) 
 ret[split[1]] = split[2];
 });
 return ret;
});

My concern is that the API returns quite a lot of data, therefore I would like some feedback regarding on how to improve the performance.

For instance, is there any way to reduce the mapping complexity from O(N²) to O(N)?

Also, please suggest regex improvements :)

Question 2

If you use regular expressions for parsing, then I would recommend using them for everything. Here's a solution that proceeds line by line, using capturing parentheses to see what the line contained.

function parse(data) {
 var re = /(#start)|(#end)|#p\s+(\S+)\s+(\S+)/ig;
 var results = [], match, obj;
 while (match = re.exec(data)) {
 if (match[1]) { // #start
 obj = {};
 } else if (match[2]) { // #end
 results.push(obj);
 obj = null; // ← Prevent accidental reuse if input is malformed
 } else { // #p something something
 obj[match[3]] = match[4];
 }
 }
 return results;
}

Question 3

Followup question: If a line looks like #p 09180 instead of #p 09180 AK, I get #p in match[4]. I've played around a bit with the regex but I can't get it right. Could you please help me adjust it?

Question 4

Changing the regular expression to /(#start)|(#end)|#p\s+(\S+)\s*(\S*)/ig would let the third field be optionally empty.

Question 5

Oh, nice, thanks. A final question which might be a bit trickier: Is there any way to allow the last group to contain spaces e.g. #p 09180 AK could also be #p 09180 AK 2 or #p 09180 AK B 2 where I would like to catch AK 2 and AK B 2. I guess you could say that anything between the third match and each line break would be the value I'm looking for.

Question 6

I know that I can do something like (\S*\s*\S*), but can it be more generic?

Question 7

Changing the regular expression to /(#start)|(#end)|#p\s+(\S+)\s*(.*)/ig would let the third field contain anything, including an empty string or a string containing spaces.

Question 8

I dislike regexes with a passion ;) Especially because sometimes they beat solutions that ought to be faster.

I would counter propose a solution where you keep using indexOf while keeping track where you are in the data. This way you only go thru the data once. I would also name your constants 0 and 1 so that the reader instinctively knows what you are doing. Furthermore, given that your script is horizontally quite short, I would spell out your variables. I am not a big fan of v, v2 , _ etc. Finally, if speed is important, then good old loops will always beat forEach.

function parseResults( data )
{
 var index = -1,
 lastIndex = -1,
 objects = [],
 object,
 line,
 parts,
 KEY = 0,
 VALUE = 1;
 //~ is a short circuit for comparing to -1
 while( ~ (index = data.indexOf('\n',index) ) )
 {
 line = data.substring( lastIndex , index );
 if( line == '#start')
 object = {};
 else if( line == '#end' )
 objects.push( object );
 else 
 {
 parts = line.split(' ');
 if( parts[KEY] && parts[VALUE] )
 object[ parts[KEY] ] = parts[VALUE];
 }
 //+1 because I dont want to do ++ in the while, another +1 to make substring work
 //admittedly not very elegant looking :\
 lastIndex = index + 2;
 }
 return objects;
}

I would be most curious if you run this version and you run the 200_success version which one would be more performing with large sets of data.

Question 9

Thank you for an alternative solution. Yes, regex is always my very last resort. I think this will be slower than @200's version due to the use of indexOf. Regarding index++ + 1; why not do index += 2? I only have dummy data to test with at the moment, but I'll try to remember to post some performance results here once I get some real data to play around with.

Question 10

for ++ +1, because I am an idiot ;) Also, for indexOf because of the startPosition, I am not convinced yet that it will be slower

Question 11

Hehe, well I'll try to remember to post the results ;) Thanks again

Question 12

Hi! Just FYI; I would say that the performance difference was neglectable (+- a few ms)

200_success 200_success 145k22 gold badges190 silver badges478 bronze badges · Accepted Answer · 2014-04-04 11:47:25Z

3

\$\begingroup\$

If you use regular expressions for parsing, then I would recommend using them for everything. Here's a solution that proceeds line by line, using capturing parentheses to see what the line contained.

function parse(data) {
 var re = /(#start)|(#end)|#p\s+(\S+)\s+(\S+)/ig;
 var results = [], match, obj;
 while (match = re.exec(data)) {
 if (match[1]) { // #start
 obj = {};
 } else if (match[2]) { // #end
 results.push(obj);
 obj = null; // ← Prevent accidental reuse if input is malformed
 } else { // #p something something
 obj[match[3]] = match[4];
 }
 }
 return results;
}

Share

edited Apr 4, 2014 at 12:24

answered Apr 4, 2014 at 11:47

200_success's user avatar

200_success 200_success

145k22 gold badges190 silver badges478 bronze badges

\$\endgroup\$

5

\$\begingroup\$ Followup question: If a line looks like #p 09180 instead of #p 09180 AK, I get #p in match[4]. I've played around a bit with the regex but I can't get it right. Could you please help me adjust it? \$\endgroup\$

Johan
– Johan

2014年04月09日 15:53:33 +00:00
Commented Apr 9, 2014 at 15:53
1

\$\begingroup\$ Changing the regular expression to /(#start)|(#end)|#p\s+(\S+)\s*(\S*)/ig would let the third field be optionally empty. \$\endgroup\$

200_success
– 200_success

2014年04月09日 16:29:52 +00:00
Commented Apr 9, 2014 at 16:29
\$\begingroup\$ Oh, nice, thanks. A final question which might be a bit trickier: Is there any way to allow the last group to contain spaces e.g. #p 09180 AK could also be #p 09180 AK 2 or #p 09180 AK B 2 where I would like to catch AK 2 and AK B 2. I guess you could say that anything between the third match and each line break would be the value I'm looking for. \$\endgroup\$

Johan
– Johan

2014年04月09日 16:36:58 +00:00
Commented Apr 9, 2014 at 16:36
\$\begingroup\$ I know that I can do something like (\S*\s*\S*), but can it be more generic? \$\endgroup\$

Johan
– Johan

2014年04月09日 16:44:26 +00:00
Commented Apr 9, 2014 at 16:44
1

\$\begingroup\$ Changing the regular expression to /(#start)|(#end)|#p\s+(\S+)\s*(.*)/ig would let the third field contain anything, including an empty string or a string containing spaces. \$\endgroup\$

200_success
– 200_success

2014年04月09日 17:27:20 +00:00
Commented Apr 9, 2014 at 17:27

Add a comment |

Stack Exchange Network

Performance tuning on a text file to object conversion

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Performance tuning on a text file to object conversion

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions