2
\$\begingroup\$

I am extracting data from a text file. Some of the lines from which I want to extract the data consist of a text description with single spaces, followed by a multiple-space gap preceding four fields containing the data, each separated by multiple spaces. A field might either contain the indicator "N/A" or else it will contain an integer < 10,000 (possibly comma-ed) such as 15 or 7,151 followed by a valid percentage in parentheses. The percentage will always have a single decimal point; for example, (0.0%) or (19.8%) or (100.0%) . If the field contains "N/A", then I want to write out "NA,NA" and if the field contains a number and percentage, then I want to write those two values out separated by a comma.

At the moment, I use the following regex to describe a single field:

$naNumberGroup = qr/(N\/A|(([0-9]{1,3}(,[0-9]{3})*) \(([0-9]{1,2}\.[0-9])\%\)))/

and then the following code to get the various pieces from the current line, which is $line :

if( $line =~ m/$naNumberGroup +$naNumberGroup +$naNumberGroup +$naNumberGroup/ ) {
 if("N/A" eq 1ドル) {
 print "NA,NA";
 } else {
 print ",3,ドル5ドル";
 }
 if("N/A" eq 6ドル) {
 print ",NA,NA";
 } else {
 print ",8,ドル10ドル";
 }
 if("N/A" eq 11ドル) {
 print ",NA,NA";
 } else {
 print ",13,ドル15ドル";
 }
 if("N/A" eq 16ドル) {
 print ",NA,NA";
 } else {
 print ",18,ドル20ドル";
 }
 print "\n";
}

It seems horribly clumsy; for example, it's easy to make a mistake in counting the parentheses and getting the pairs of fields correctly referenced ... but I am unsure of even what sorts of things I should be looking at to improve it (assuming that's possible). I would appreciate some guidance or comment. Even, "it seems fine" would at least indicate that I shouldn't waste time on improving it!

An example line of text is:

Adults who actively pursue work opportunities 
 197 (82.8%) 30 (12.6%) N/A N/A

The description at the beginning of the line changes depending on the data. The output that I want for this line is:

197,82.8,30,12,6,NA,NA,NA,NA

Similarly, if the line were:

Adults who actively pursue work opportunities 
 197 (82.8%) N/A 30 (12.6%) N/A

then I want the output:

197,82.8,NA,NA,30,12.6,NA,NA
asked May 18, 2013 at 11:10
\$\endgroup\$
2
  • 2
    \$\begingroup\$ Can you provide sample input and expected output ? Also a single improvement to make your regex shorter would be to use \d instead of [0-9]. \$\endgroup\$ Commented May 18, 2013 at 11:14
  • 2
    \$\begingroup\$ Can't you just split the string by /\s{2,}+(?!\()/ pattern, then work with fields as normal arrays? As for groups mismatch, well, you can use named capture groups (with ?<name> notation). \$\endgroup\$ Commented May 18, 2013 at 11:18

2 Answers 2

4
\$\begingroup\$

Using "split" is an option as other people have mentioned. However, using a regex has the added benefit of validating the input data while parsing, so is still worth considering depending on your use-case.

A regex is better used in loop here since we're matching the same pattern repeatedly. And you should use non-capturing parentheses for the bits you're not interested in capturing. E.g. changing nothing else your code would look like this:

$naNumberGroup = qr/(N\/A|(?:([0-9]{1,3}(?:,[0-9]{3})*) \(([0-9]{1,2}\.[0-9])\%\)))/;
my @outFields;
while ($line =~ m/\s\s+$naNumberGroup/g) {
 if("N/A" eq 1ドル) {
 push @outFields, 'NA', 'NA';
 } else {
 push @outFields, 2,ドル 3ドル;
 }
}
print join(',', @outfields),"\n";

It's worth noting that your code as-is would preserve any commas in the input, therefore breaking your output. And "100.0%" isn't handled.

If you're wanting to improve readability and maintainability of your regexes, here are some additional things worth changing:

  1. Use the /x modifier to improve readability/maintainability.
  2. Use more intermediate variables to build your regexes.
  3. Avoid having to escape slashes by using qr{...} instead of qr/.../

E.g.

my $numberGroup = qr{
 (?<number> [0-9,]+ ) # Number with optional commas
 [ ] # Single space
 \( (?<percent> [0-9]+\.[0-9] ) %\) # Percentage in parens
}x;
my $naNumberGroup = qr{
 [ ]{2,} # Two or more spaces
 (?: $numberGroup | N/A ) # No need to capture "N/A"
}x;
my @outFields;
while ($line =~ m/$naNumberGroup/xg) {
 my $number = $+{number} // 'NA';
 my $percent = $+{percent} // 'NA';
 $number =~ tr/,//d; # Strip commas
 push @outFields, $number, $percent;
}
if (scalar @outFields == 8) {
 print join(',', @outFields),"\n";
} else {
 # Description line, or invalid line. You may be able to use
 # another regex to determine which.
}
answered May 18, 2013 at 20:34
\$\endgroup\$
3
  • 1
    \$\begingroup\$ That's what longer regex should always look like! btw, % doesn't need backslashing. \$\endgroup\$ Commented May 18, 2013 at 20:49
  • \$\begingroup\$ Thanks - good point re % - I've updated the answer. \$\endgroup\$ Commented May 18, 2013 at 20:55
  • \$\begingroup\$ Many thanks for all the suggestions. I knew about the 'split' command but have never used it, so the suggestions about how to use it in this context are very educational as well as directly helpful. The suggestions about how to improve the formulation and use of the regex itself are more along the lines of my original expectation, and also very helpful. Thanks again. \$\endgroup\$ Commented May 19, 2013 at 5:51
3
\$\begingroup\$

Splitting is much better solution than regex, as someone already mentioned.

my $line = "197 (82.8%) N/A 30 (12.6%) N/A";
my $result = 
 join ",",
 map {
 tr|()%||d;
 $_ eq "N/A" ? qw(NA NA) : $_;
 }
 split /\s+/, $line;
 print "$result\n";

gives

197,82.8,NA,NA,30,12.6,NA,NA
answered May 18, 2013 at 12:17
\$\endgroup\$
1
  • \$\begingroup\$ Nice solution - simplifying the problem to: delete excessive characters and replace N/A to NA NA. Really nice. \$\endgroup\$ Commented May 18, 2013 at 18:16

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.