I am extracting data from a text file. Some of the lines from which I want to extract the data consist of a text description with single spaces, followed by a multiple-space gap preceding four fields containing the data, each separated by multiple spaces. A field might either contain the indicator "N/A" or else it will contain an integer < 10,000 (possibly comma-ed) such as 15 or 7,151 followed by a valid percentage in parentheses. The percentage will always have a single decimal point; for example, (0.0%) or (19.8%) or (100.0%) . If the field contains "N/A", then I want to write out "NA,NA" and if the field contains a number and percentage, then I want to write those two values out separated by a comma.
At the moment, I use the following regex to describe a single field:
$naNumberGroup = qr/(N\/A|(([0-9]{1,3}(,[0-9]{3})*) \(([0-9]{1,2}\.[0-9])\%\)))/
and then the following code to get the various pieces from the current line, which is $line :
if( $line =~ m/$naNumberGroup +$naNumberGroup +$naNumberGroup +$naNumberGroup/ ) {
if("N/A" eq 1ドル) {
print "NA,NA";
} else {
print ",3,ドル5ドル";
}
if("N/A" eq 6ドル) {
print ",NA,NA";
} else {
print ",8,ドル10ドル";
}
if("N/A" eq 11ドル) {
print ",NA,NA";
} else {
print ",13,ドル15ドル";
}
if("N/A" eq 16ドル) {
print ",NA,NA";
} else {
print ",18,ドル20ドル";
}
print "\n";
}
It seems horribly clumsy; for example, it's easy to make a mistake in counting the parentheses and getting the pairs of fields correctly referenced ... but I am unsure of even what sorts of things I should be looking at to improve it (assuming that's possible). I would appreciate some guidance or comment. Even, "it seems fine" would at least indicate that I shouldn't waste time on improving it!
An example line of text is:
Adults who actively pursue work opportunities
197 (82.8%) 30 (12.6%) N/A N/A
The description at the beginning of the line changes depending on the data. The output that I want for this line is:
197,82.8,30,12,6,NA,NA,NA,NA
Similarly, if the line were:
Adults who actively pursue work opportunities
197 (82.8%) N/A 30 (12.6%) N/A
then I want the output:
197,82.8,NA,NA,30,12.6,NA,NA
2 Answers 2
Using "split" is an option as other people have mentioned. However, using a regex has the added benefit of validating the input data while parsing, so is still worth considering depending on your use-case.
A regex is better used in loop here since we're matching the same pattern repeatedly. And you should use non-capturing parentheses for the bits you're not interested in capturing. E.g. changing nothing else your code would look like this:
$naNumberGroup = qr/(N\/A|(?:([0-9]{1,3}(?:,[0-9]{3})*) \(([0-9]{1,2}\.[0-9])\%\)))/;
my @outFields;
while ($line =~ m/\s\s+$naNumberGroup/g) {
if("N/A" eq 1ドル) {
push @outFields, 'NA', 'NA';
} else {
push @outFields, 2,ドル 3ドル;
}
}
print join(',', @outfields),"\n";
It's worth noting that your code as-is would preserve any commas in the input, therefore breaking your output. And "100.0%" isn't handled.
If you're wanting to improve readability and maintainability of your regexes, here are some additional things worth changing:
- Use the /x modifier to improve readability/maintainability.
- Use more intermediate variables to build your regexes.
- Avoid having to escape slashes by using
qr{...}
instead ofqr/.../
E.g.
my $numberGroup = qr{
(?<number> [0-9,]+ ) # Number with optional commas
[ ] # Single space
\( (?<percent> [0-9]+\.[0-9] ) %\) # Percentage in parens
}x;
my $naNumberGroup = qr{
[ ]{2,} # Two or more spaces
(?: $numberGroup | N/A ) # No need to capture "N/A"
}x;
my @outFields;
while ($line =~ m/$naNumberGroup/xg) {
my $number = $+{number} // 'NA';
my $percent = $+{percent} // 'NA';
$number =~ tr/,//d; # Strip commas
push @outFields, $number, $percent;
}
if (scalar @outFields == 8) {
print join(',', @outFields),"\n";
} else {
# Description line, or invalid line. You may be able to use
# another regex to determine which.
}
-
1\$\begingroup\$ That's what longer regex should always look like! btw,
%
doesn't need backslashing. \$\endgroup\$mpapec– mpapec2013年05月18日 20:49:06 +00:00Commented May 18, 2013 at 20:49 -
\$\begingroup\$ Thanks - good point re
%
- I've updated the answer. \$\endgroup\$Simon Poole– Simon Poole2013年05月18日 20:55:35 +00:00Commented May 18, 2013 at 20:55 -
\$\begingroup\$ Many thanks for all the suggestions. I knew about the 'split' command but have never used it, so the suggestions about how to use it in this context are very educational as well as directly helpful. The suggestions about how to improve the formulation and use of the regex itself are more along the lines of my original expectation, and also very helpful. Thanks again. \$\endgroup\$user02814– user028142013年05月19日 05:51:59 +00:00Commented May 19, 2013 at 5:51
Splitting is much better solution than regex, as someone already mentioned.
my $line = "197 (82.8%) N/A 30 (12.6%) N/A";
my $result =
join ",",
map {
tr|()%||d;
$_ eq "N/A" ? qw(NA NA) : $_;
}
split /\s+/, $line;
print "$result\n";
gives
197,82.8,NA,NA,30,12.6,NA,NA
-
\$\begingroup\$ Nice solution - simplifying the problem to: delete excessive characters and replace
N/A
toNA NA
. Really nice. \$\endgroup\$clt60– clt602013年05月18日 18:16:55 +00:00Commented May 18, 2013 at 18:16
\d
instead of[0-9]
. \$\endgroup\$/\s{2,}+(?!\()/
pattern, then work with fields as normal arrays? As for groups mismatch, well, you can use named capture groups (with?<name>
notation). \$\endgroup\$