Parsing function is 50 lines long

Question 1

This is a parsing function that will add tildes (~) to the end of search terms in certain circumstances.

Example inputs and outputs:

Input: Output:
name:(john doe) name:(john~ doe~)
name:[andy TO charlie] name:[andy TO charlie]
john doe john~ doe~
james NOT jane james~ NOT jane
james NOT (james smith) james~ NOT (james smith)
james NOT jane smith james~ NOT jane smith~
name:"john doe" australia name:"john doe" australia~

function addTilde(string) {
 if (!/[\[\[\]~"(NOT)\-\!\d\(\)(OR)(AND)\&\|\: ]/.test(string)) {
 string = string.concat("~");
 }
 return string;
};
function fuzzQuery(rawQuery) {
 /*split the string into spaces, brackets, double quotes and words*/
 re = /(?=[()\[\] "])|(?=[^\W])\b/;
 strSplit = rawQuery.split(re);
 newQuery = "";
 for (var i = 0; i < strSplit.length; i++) {
 var s = strSplit[i];
 var newElement = "";
 /*if it contains a [ or "*/
 if (s.indexOf("\x22") != -1 || s.indexOf("[") != -1) {
 /*determine closing symbol*/
 var closingSymbol;
 if (s == "\x22") {
 closingSymbol = "\x22";
 newElement = newElement.concat(strSplit[i++]); /*need to skip opening one for double quotes*/
 } else closingSymbol = "]";
 /*concat elements together until closing element found)*/
 do {
 newElement = newElement.concat(strSplit[i]);
 }
 while (strSplit[i++] != closingSymbol)
 }
 /*if it contains a NOT*/
 else if (s.indexOf("NOT") != -1) {
 newElement = strSplit[i++]; /*concat the NOT*/
 /*concat any spaces*/
 while (strSplit[i] == " ") {
 newElement = newElement.concat(strSplit[i++]);
 }
 if (strSplit[i] == "(") {
 do {
 newElement = newElement.concat(strSplit[i]);
 }
 while (strSplit[i++] != ")")
 } else newElement = newElement.concat(strSplit[i++]);
 } else(newElement = strSplit[i]);
 newElement = addTilde(newElement);
 newQuery = newQuery.concat(newElement);
 }
 return newQuery;
};

Now fuzzQuery is quite a long method. It essentially has five parts.

Split the initial query out into elements.
Loop through each element.

a) Concat square brackets and double quotes.

else b) concat NOTs.

now add tilde to element if appropriate
Return the join the elements back together and return the new query.

What I was thinking is that you could pass off steps two and three to their own methods, so that the whole query looks something like (but not exactly like!):

function fuzzQuery(rawQuery)
{
 strSplit = splitQuery(rawQuery);
 concatSqrAndDblQuotes(strSplit);
 concatNots(strSplit);
 return putBackTogether(strSplit);
}

ie.

function doSquareAndDblQuotes(strSplit, i) {
 if (s.indexOf("\x22") != -1 || s.indexOf("[") != -1) {
 /*determine closing symbol*/
 var closingSymbol;
 if (s == "\x22") {
 closingSymbol = "\x22";
 newElement = newElement.concat(strSplit[i++]); /*need to skip opening one for double quotes*/
 } else closingSymbol = "]";
 /*concat elements together until closing element found)*/
 do {
 newElement = newElement.concat(strSplit[i]);
 }
 while (strSplit[i++] != closingSymbol)
 }
 return newElement;
}

But the problem is here that we'd need to be keeping track of a few variables being changed in this function. ie. the i counter, and whether or not that if statement was executed. So you could start using globals (is that even a thing in javascript?)... and it gets messy.

So possibly another way, would be to create an object that you pass in, and return, which keeps track of these variables.

What do you think?

Question 2

What would be an example of rawQuery? Can you provide input and output? I don't fully understand what you're doing here...

Question 3

Whats newString and you should probably deal with the following global variables: re, strSplit, newQuery. Also using str.prototype.concat is a poor convention (use str1 += str2;

Question 4

@elclanrs Have updated that now.

Question 5

@dwjohnston what is newString?

Question 6

@megawac Sorry - that's newQuery I refactored it when I pasted it in and missed that one... :S

Question 7

You aren't using regular expressions to your advantage. Capture, don't split. Capturing helps you analyze the tokens you are interested in. Splitting just gets you the location of the delimiters.

function fuzzQuery(rawQuery) {
 "use strict";
 // ( 1 ) ( 2 ) ( 3 ) ( 4 ) ( 5 ) ( 6 )
 var re = /\s*(?:(NOT)\s+)?([a-z]+:)?(?:("[^"]*")|(\([^)]*\))|(\[[^\]]*\])|([a-z]+))\s*/g;
 var matches;
 var lastIndex = -1;
 while (matches = re.exec(rawQuery)) {
 var relOp = matches[1],
 qualifier = matches[2],
 quotedStr = matches[3],
 parensStr = matches[4],
 bracketStr = matches[5],
 bareWord = matches[6];
 lastIndex = re.lastIndex;
 console.log("relOp=" + relOp +
 ", qualifier=" + qualifier +
 ", quotedStr=" + quotedStr +
 ", parensStr=" + parensStr +
 ", bracketStr=" + bracketStr +
 ", bareWord=" + bareWord);
 }
 if (lastIndex != rawQuery.length) {
 console.log("Junk=" + rawQuery.substring(lastIndex));
 }
}

Examples:

name:(john doe)

relOp=undefined, qualifier=name:, quotedStr=undefined, parensStr=(john doe), bracketStr=undefined, bareWord=undefined

name:[andy TO charlie]

relOp=undefined, qualifier=name:, quotedStr=undefined, parensStr=undefined, bracketStr=[andy TO charlie], bareWord=undefined

john doe

relOp=undefined, qualifier=undefined, quotedStr=undefined, parensStr=undefined, bracketStr=undefined, bareWord=john
relOp=undefined, qualifier=undefined, quotedStr=undefined, parensStr=undefined, bracketStr=undefined, bareWord=doe

james NOT jane

relOp=undefined, qualifier=undefined, quotedStr=undefined, parensStr=undefined, bracketStr=undefined, bareWord=james
relOp=NOT, qualifier=undefined, quotedStr=undefined, parensStr=undefined, bracketStr=undefined, bareWord=jane

james NOT (james smith)

relOp=undefined, qualifier=undefined, quotedStr=undefined, parensStr=undefined, bracketStr=undefined, bareWord=james
relOp=NOT, qualifier=undefined, quotedStr=undefined, parensStr=(james smith), bracketStr=undefined, bareWord=undefined

james NOT jane smith

relOp=undefined, qualifier=undefined, quotedStr=undefined, parensStr=undefined, bracketStr=undefined, bareWord=james
relOp=NOT, qualifier=undefined, quotedStr=undefined, parensStr=undefined, bracketStr=undefined, bareWord=jane
relOp=undefined, qualifier=undefined, quotedStr=undefined, parensStr=undefined, bracketStr=undefined, bareWord=smith

name:"john doe" australia

relOp=undefined, qualifier=undefined, quotedStr="john doe", parensStr=undefined, bracketStr=undefined, bareWord=undefined
relOp=undefined, qualifier=undefined, quotedStr=undefined, parensStr=undefined, bracketStr=undefined, bareWord=australia

Question 8

I would have asked this question right at the start of my career.

It just happened to come up as a notification, so here's my thoughts as someone with 12 years more experience.

For this kind of regex/pattern matching problem particularly, unless you're writing this kind of code day in and day out (eg. your job is analysing logs or other unstructured data) the code is always going to look ugly, it's going to be something of an effort to understand what each line is doing. And the reality is, once the code is written, unless it has a bug or requirements are going to change, you're not going to need to read the code again.

200_success's answer has good advice on using regexes, and take that to heart.

But my key bit of advice is that the way to make this code easy to use, is to basically do you've done:

Have a clear simple interface. String in, string out.
Have tests. You've got the example transformations, they make for good test cases. As the developer, I find reading the tests is how you can understand what a piece of code does. That's a sign of a good test.
Documentation. The example transformations would also make for good documentation for the function. It would be good to include the rationale/explanation for the scenarios where fuzzing is included and where it is not.

200_success 200_success 146k22 gold badges190 silver badges479 bronze badges · Answer 1 · 2014-01-21 19:43:11Z

You aren't using regular expressions to your advantage. Capture, don't split. Capturing helps you analyze the tokens you are interested in. Splitting just gets you the location of the delimiters.

function fuzzQuery(rawQuery) {
 "use strict";
 // ( 1 ) ( 2 ) ( 3 ) ( 4 ) ( 5 ) ( 6 )
 var re = /\s*(?:(NOT)\s+)?([a-z]+:)?(?:("[^"]*")|(\([^)]*\))|(\[[^\]]*\])|([a-z]+))\s*/g;
 var matches;
 var lastIndex = -1;
 while (matches = re.exec(rawQuery)) {
 var relOp = matches[1],
 qualifier = matches[2],
 quotedStr = matches[3],
 parensStr = matches[4],
 bracketStr = matches[5],
 bareWord = matches[6];
 lastIndex = re.lastIndex;
 console.log("relOp=" + relOp +
 ", qualifier=" + qualifier +
 ", quotedStr=" + quotedStr +
 ", parensStr=" + parensStr +
 ", bracketStr=" + bracketStr +
 ", bareWord=" + bareWord);
 }
 if (lastIndex != rawQuery.length) {
 console.log("Junk=" + rawQuery.substring(lastIndex));
 }
}

Examples:

name:(john doe)

relOp=undefined, qualifier=name:, quotedStr=undefined, parensStr=(john doe), bracketStr=undefined, bareWord=undefined

name:[andy TO charlie]

relOp=undefined, qualifier=name:, quotedStr=undefined, parensStr=undefined, bracketStr=[andy TO charlie], bareWord=undefined

john doe

relOp=undefined, qualifier=undefined, quotedStr=undefined, parensStr=undefined, bracketStr=undefined, bareWord=john
relOp=undefined, qualifier=undefined, quotedStr=undefined, parensStr=undefined, bracketStr=undefined, bareWord=doe

james NOT jane

relOp=undefined, qualifier=undefined, quotedStr=undefined, parensStr=undefined, bracketStr=undefined, bareWord=james
relOp=NOT, qualifier=undefined, quotedStr=undefined, parensStr=undefined, bracketStr=undefined, bareWord=jane

james NOT (james smith)

relOp=undefined, qualifier=undefined, quotedStr=undefined, parensStr=undefined, bracketStr=undefined, bareWord=james
relOp=NOT, qualifier=undefined, quotedStr=undefined, parensStr=(james smith), bracketStr=undefined, bareWord=undefined

james NOT jane smith

relOp=undefined, qualifier=undefined, quotedStr=undefined, parensStr=undefined, bracketStr=undefined, bareWord=james
relOp=NOT, qualifier=undefined, quotedStr=undefined, parensStr=undefined, bracketStr=undefined, bareWord=jane
relOp=undefined, qualifier=undefined, quotedStr=undefined, parensStr=undefined, bracketStr=undefined, bareWord=smith

name:"john doe" australia

relOp=undefined, qualifier=undefined, quotedStr="john doe", parensStr=undefined, bracketStr=undefined, bareWord=undefined
relOp=undefined, qualifier=undefined, quotedStr=undefined, parensStr=undefined, bracketStr=undefined, bareWord=australia

dwjohnston dwjohnston 1,3589 silver badges19 bronze badges · Answer 2 · 2025-04-19 02:39:28Z

I would have asked this question right at the start of my career.

It just happened to come up as a notification, so here's my thoughts as someone with 12 years more experience.

For this kind of regex/pattern matching problem particularly, unless you're writing this kind of code day in and day out (eg. your job is analysing logs or other unstructured data) the code is always going to look ugly, it's going to be something of an effort to understand what each line is doing. And the reality is, once the code is written, unless it has a bug or requirements are going to change, you're not going to need to read the code again.

200_success's answer has good advice on using regexes, and take that to heart.

But my key bit of advice is that the way to make this code easy to use, is to basically do you've done:

Have a clear simple interface. String in, string out.
Have tests. You've got the example transformations, they make for good test cases. As the developer, I find reading the tests is how you can understand what a piece of code does. That's a sign of a good test.
Documentation. The example transformations would also make for good documentation for the function. It would be good to include the rationale/explanation for the scenarios where fuzzing is included and where it is not.

Stack Exchange Network

Parsing function is 50 lines long

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parsing function is 50 lines long

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions