Trying to improve a working regex

Question 1

I am extracting data from a text file. Some of the lines from which I want to extract the data consist of a text description with single spaces, followed by a multiple-space gap preceding four fields containing the data, each separated by multiple spaces. A field might either contain the indicator "N/A" or else it will contain an integer < 10,000 (possibly comma-ed) such as 15 or 7,151 followed by a valid percentage in parentheses. The percentage will always have a single decimal point; for example, (0.0%) or (19.8%) or (100.0%) . If the field contains "N/A", then I want to write out "NA,NA" and if the field contains a number and percentage, then I want to write those two values out separated by a comma.

At the moment, I use the following regex to describe a single field:

$naNumberGroup = qr/(N\/A|(([0-9]{1,3}(,[0-9]{3})*) \(([0-9]{1,2}\.[0-9])\%\)))/

and then the following code to get the various pieces from the current line, which is $line :

if( $line =~ m/$naNumberGroup +$naNumberGroup +$naNumberGroup +$naNumberGroup/ ) {
 if("N/A" eq 1ドル) {
 print "NA,NA";
 } else {
 print ",3,ドル5ドル";
 }
 if("N/A" eq 6ドル) {
 print ",NA,NA";
 } else {
 print ",8,ドル10ドル";
 }
 if("N/A" eq 11ドル) {
 print ",NA,NA";
 } else {
 print ",13,ドル15ドル";
 }
 if("N/A" eq 16ドル) {
 print ",NA,NA";
 } else {
 print ",18,ドル20ドル";
 }
 print "\n";
}

It seems horribly clumsy; for example, it's easy to make a mistake in counting the parentheses and getting the pairs of fields correctly referenced ... but I am unsure of even what sorts of things I should be looking at to improve it (assuming that's possible). I would appreciate some guidance or comment. Even, "it seems fine" would at least indicate that I shouldn't waste time on improving it!

An example line of text is:

Adults who actively pursue work opportunities 
 197 (82.8%) 30 (12.6%) N/A N/A

The description at the beginning of the line changes depending on the data. The output that I want for this line is:

197,82.8,30,12,6,NA,NA,NA,NA

Similarly, if the line were:

Adults who actively pursue work opportunities 
 197 (82.8%) N/A 30 (12.6%) N/A

then I want the output:

197,82.8,NA,NA,30,12.6,NA,NA

Question 2

Can you provide sample input and expected output ? Also a single improvement to make your regex shorter would be to use \d instead of [0-9].

Question 3

Can't you just split the string by /\s{2,}+(?!\()/ pattern, then work with fields as normal arrays? As for groups mismatch, well, you can use named capture groups (with ?<name> notation).

Question 4

Using "split" is an option as other people have mentioned. However, using a regex has the added benefit of validating the input data while parsing, so is still worth considering depending on your use-case.

A regex is better used in loop here since we're matching the same pattern repeatedly. And you should use non-capturing parentheses for the bits you're not interested in capturing. E.g. changing nothing else your code would look like this:

$naNumberGroup = qr/(N\/A|(?:([0-9]{1,3}(?:,[0-9]{3})*) \(([0-9]{1,2}\.[0-9])\%\)))/;
my @outFields;
while ($line =~ m/\s\s+$naNumberGroup/g) {
 if("N/A" eq 1ドル) {
 push @outFields, 'NA', 'NA';
 } else {
 push @outFields, 2,ドル 3ドル;
 }
}
print join(',', @outfields),"\n";

It's worth noting that your code as-is would preserve any commas in the input, therefore breaking your output. And "100.0%" isn't handled.

If you're wanting to improve readability and maintainability of your regexes, here are some additional things worth changing:

Use the /x modifier to improve readability/maintainability.
Use more intermediate variables to build your regexes.
Avoid having to escape slashes by using qr{...} instead of qr/.../

E.g.

my $numberGroup = qr{
 (?<number> [0-9,]+ ) # Number with optional commas
 [ ] # Single space
 \( (?<percent> [0-9]+\.[0-9] ) %\) # Percentage in parens
}x;
my $naNumberGroup = qr{
 [ ]{2,} # Two or more spaces
 (?: $numberGroup | N/A ) # No need to capture "N/A"
}x;
my @outFields;
while ($line =~ m/$naNumberGroup/xg) {
 my $number = $+{number} // 'NA';
 my $percent = $+{percent} // 'NA';
 $number =~ tr/,//d; # Strip commas
 push @outFields, $number, $percent;
}
if (scalar @outFields == 8) {
 print join(',', @outFields),"\n";
} else {
 # Description line, or invalid line. You may be able to use
 # another regex to determine which.
}

Question 5

That's what longer regex should always look like! btw, % doesn't need backslashing.

Question 6

Thanks - good point re % - I've updated the answer.

Question 7

Many thanks for all the suggestions. I knew about the 'split' command but have never used it, so the suggestions about how to use it in this context are very educational as well as directly helpful. The suggestions about how to improve the formulation and use of the regex itself are more along the lines of my original expectation, and also very helpful. Thanks again.

Question 8

Splitting is much better solution than regex, as someone already mentioned.

my $line = "197 (82.8%) N/A 30 (12.6%) N/A";
my $result = 
 join ",",
 map {
 tr|()%||d;
 $_ eq "N/A" ? qw(NA NA) : $_;
 }
 split /\s+/, $line;
 print "$result\n";

gives

197,82.8,NA,NA,30,12.6,NA,NA

Question 9

Nice solution - simplifying the problem to: delete excessive characters and replace N/A to NA NA. Really nice.

Simon PooleSimon Poole · Accepted Answer · 2013-05-18 20:34:07Z

Using "split" is an option as other people have mentioned. However, using a regex has the added benefit of validating the input data while parsing, so is still worth considering depending on your use-case.

A regex is better used in loop here since we're matching the same pattern repeatedly. And you should use non-capturing parentheses for the bits you're not interested in capturing. E.g. changing nothing else your code would look like this:

$naNumberGroup = qr/(N\/A|(?:([0-9]{1,3}(?:,[0-9]{3})*) \(([0-9]{1,2}\.[0-9])\%\)))/;
my @outFields;
while ($line =~ m/\s\s+$naNumberGroup/g) {
 if("N/A" eq 1ドル) {
 push @outFields, 'NA', 'NA';
 } else {
 push @outFields, 2,ドル 3ドル;
 }
}
print join(',', @outfields),"\n";

It's worth noting that your code as-is would preserve any commas in the input, therefore breaking your output. And "100.0%" isn't handled.

If you're wanting to improve readability and maintainability of your regexes, here are some additional things worth changing:

Use the /x modifier to improve readability/maintainability.
Use more intermediate variables to build your regexes.
Avoid having to escape slashes by using qr{...} instead of qr/.../

E.g.

my $numberGroup = qr{
 (?<number> [0-9,]+ ) # Number with optional commas
 [ ] # Single space
 \( (?<percent> [0-9]+\.[0-9] ) %\) # Percentage in parens
}x;
my $naNumberGroup = qr{
 [ ]{2,} # Two or more spaces
 (?: $numberGroup | N/A ) # No need to capture "N/A"
}x;
my @outFields;
while ($line =~ m/$naNumberGroup/xg) {
 my $number = $+{number} // 'NA';
 my $percent = $+{percent} // 'NA';
 $number =~ tr/,//d; # Strip commas
 push @outFields, $number, $percent;
}
if (scalar @outFields == 8) {
 print join(',', @outFields),"\n";
} else {
 # Description line, or invalid line. You may be able to use
 # another regex to determine which.
}

That's what longer regex should always look like! btw, % doesn't need backslashing.
Many thanks for all the suggestions. I knew about the 'split' command but have never used it, so the suggestions about how to use it in this context are very educational as well as directly helpful. The suggestions about how to improve the formulation and use of the regex itself are more along the lines of my original expectation, and also very helpful. Thanks again.

Stack Exchange Network

Trying to improve a working regex

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Trying to improve a working regex

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions