6
\$\begingroup\$

I'm using an API which returns text in the following format:

#start
#p 09060 20131010
#p 09180 AK
#p 01001 19110212982
#end
#start
#p 09060 20131110
#p 09180 AB
#p 01001 12110212982
#end

I'm converting this to a list of objects:

var result = data.match(/#start[\s\S]+?#end/ig).map(function(v){
 var lines = v.split('\n'),
 ret = {};
 $.each(lines, function(_, v2){
 var split = v2.split(' ');
 if(split[1] && split[2]) 
 ret[split[1]] = split[2];
 });
 return ret;
});

My concern is that the API returns quite a lot of data, therefore I would like some feedback regarding on how to improve the performance.

For instance, is there any way to reduce the mapping complexity from O(N2) to O(N)?

Also, please suggest regex improvements :)

Quill
12k5 gold badges41 silver badges93 bronze badges
asked Apr 4, 2014 at 10:32
\$\endgroup\$

2 Answers 2

3
\$\begingroup\$

If you use regular expressions for parsing, then I would recommend using them for everything. Here's a solution that proceeds line by line, using capturing parentheses to see what the line contained.

function parse(data) {
 var re = /(#start)|(#end)|#p\s+(\S+)\s+(\S+)/ig;
 var results = [], match, obj;
 while (match = re.exec(data)) {
 if (match[1]) { // #start
 obj = {};
 } else if (match[2]) { // #end
 results.push(obj);
 obj = null; // ← Prevent accidental reuse if input is malformed
 } else { // #p something something
 obj[match[3]] = match[4];
 }
 }
 return results;
}
answered Apr 4, 2014 at 11:47
\$\endgroup\$
5
  • \$\begingroup\$ Followup question: If a line looks like #p 09180 instead of #p 09180 AK, I get #p in match[4]. I've played around a bit with the regex but I can't get it right. Could you please help me adjust it? \$\endgroup\$ Commented Apr 9, 2014 at 15:53
  • 1
    \$\begingroup\$ Changing the regular expression to /(#start)|(#end)|#p\s+(\S+)\s*(\S*)/ig would let the third field be optionally empty. \$\endgroup\$ Commented Apr 9, 2014 at 16:29
  • \$\begingroup\$ Oh, nice, thanks. A final question which might be a bit trickier: Is there any way to allow the last group to contain spaces e.g. #p 09180 AK could also be #p 09180 AK 2 or #p 09180 AK B 2 where I would like to catch AK 2 and AK B 2. I guess you could say that anything between the third match and each line break would be the value I'm looking for. \$\endgroup\$ Commented Apr 9, 2014 at 16:36
  • \$\begingroup\$ I know that I can do something like (\S*\s*\S*), but can it be more generic? \$\endgroup\$ Commented Apr 9, 2014 at 16:44
  • 1
    \$\begingroup\$ Changing the regular expression to /(#start)|(#end)|#p\s+(\S+)\s*(.*)/ig would let the third field contain anything, including an empty string or a string containing spaces. \$\endgroup\$ Commented Apr 9, 2014 at 17:27
3
\$\begingroup\$

I dislike regexes with a passion ;) Especially because sometimes they beat solutions that ought to be faster.

I would counter propose a solution where you keep using indexOf while keeping track where you are in the data. This way you only go thru the data once. I would also name your constants 0 and 1 so that the reader instinctively knows what you are doing. Furthermore, given that your script is horizontally quite short, I would spell out your variables. I am not a big fan of v, v2 , _ etc. Finally, if speed is important, then good old loops will always beat forEach.

function parseResults( data )
{
 var index = -1,
 lastIndex = -1,
 objects = [],
 object,
 line,
 parts,
 KEY = 0,
 VALUE = 1;
 //~ is a short circuit for comparing to -1
 while( ~ (index = data.indexOf('\n',index) ) )
 {
 line = data.substring( lastIndex , index );
 if( line == '#start')
 object = {};
 else if( line == '#end' )
 objects.push( object );
 else 
 {
 parts = line.split(' ');
 if( parts[KEY] && parts[VALUE] )
 object[ parts[KEY] ] = parts[VALUE];
 }
 //+1 because I dont want to do ++ in the while, another +1 to make substring work
 //admittedly not very elegant looking :\
 lastIndex = index + 2;
 }
 return objects;
}

I would be most curious if you run this version and you run the 200_success version which one would be more performing with large sets of data.

answered Apr 4, 2014 at 14:44
\$\endgroup\$
4
  • \$\begingroup\$ Thank you for an alternative solution. Yes, regex is always my very last resort. I think this will be slower than @200's version due to the use of indexOf. Regarding index++ + 1; why not do index += 2? I only have dummy data to test with at the moment, but I'll try to remember to post some performance results here once I get some real data to play around with. \$\endgroup\$ Commented Apr 4, 2014 at 15:03
  • \$\begingroup\$ for ++ +1, because I am an idiot ;) Also, for indexOf because of the startPosition, I am not convinced yet that it will be slower \$\endgroup\$ Commented Apr 4, 2014 at 15:50
  • \$\begingroup\$ Hehe, well I'll try to remember to post the results ;) Thanks again \$\endgroup\$ Commented Apr 4, 2014 at 15:52
  • 1
    \$\begingroup\$ Hi! Just FYI; I would say that the performance difference was neglectable (+- a few ms) \$\endgroup\$ Commented Apr 9, 2014 at 18:16

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.