regex matching and substring extraction

Question 1

I have these files, manually created by many different people are coming in. Formatting, although following a certain rule, is not uniform.

think of these three lines below

"erroneous_data_F08R16_recordeded_by_tech21"
"erroneous_data_F8R16_recordeded_by_tech021"
"erroneous_data_F008R016_recordeded_by_tech21"

they all point to the same thing F008 or F08 or F8 means File number 8 R16 or R016 or (R[single-digit] where possible) means Row number 16

There are any number of these lines in a given file, which will be scanned using while read line loop.

What I want to do is to make the File and Row number section uniform, such as F008R016 for the above three lines of examples, as my file numbers are no more than 3 digits (it rolls over after reaching 999 and number of lines is never more than a handful in each file, but for the sake of consistency, lets say it is 3 digits. In this file I need to process, there also exists unstructured comments. So first order of business is detecting the lines and separating them in to a different temporary file, then making them uniform.

In order to accomplish this, my plan is to echo the line and grep for regex matching the pattern. Unfortunately, regex is not my strong point.

So far I am stuck at detection of the file#row# structure in the line

cat InputFile | while read line
do
 echo $line | grep '[F,f]\d\d[R,r]\d\d' >/dev/null #this is assuming two digit file number and 2 digit row number 
 result=$?
 if [ $result -eq 0 ]
 then
 echo $line >tempfile
 fi
done

this regex matching on the grep command fails all the time, even if the line contains F08R16 pattern.

After accomplishing this, I want to extract this substring into a variable and analyze the structure of the variable and add leading zeros where necessary to make it uniform.

Any suggestions to correct my regex and accomplish my higher goal of extracting into variable, is greatly appreciated.

For what it is worth, I am working on a CentOS release 6.7 box at the time but I have other distros at my disposal.

Question 2

Your 3 lines of sample data are FnnRnn, but your grep regex has SnnEnn; is one of them wrong? Otherwise, that's why you're not matching.

Question 3

are you trying to match a comma? and i don't think \d means anything. match [0-9].

Question 4

Sorry I copied that line from an online regex builder web page and modified them on my Linux box. But did not reflect the changes here. Now it is corrected and still not matching. I tried single quotes and double quotes on the grep's matching string but result was no different

Question 5

@mikeserv: that pattern I am trying to match, came from an oline regex builder. Not from my knowledge.

Question 6

I'd suggest a whole different approach - perhaps a perl one-liner using sprintf with suitable zero-padded formats for the digits

Question 7

I will assume you want to match an f or an F, then 1, 2 or 3 numbers followed by an r or R and then 1, 2 or 3 numbers again until a _. If so, you can do (with GNU grep):

grep -iP 'f\d{1,3}r\d{1,3}_' InputFile > tmpfile

Or, with non-GNU grep:

grep -iE 'f[0-9]{1,3}r[0-9]{1,3}_' InputFile > tmpfile

However, this is almost certainly an XY problem. You really don't want to be doing this sort of thing in the shell. For example, this perl one-liner will format all the relevant lines correctly:

$ perl -pe 's/_f(\d+)r(\d+)_/sprintf("_F%03dR%03d_",1,ドル2ドル)/ei' file
"erroneous_data_F008R016_recordeded_by_tech21"
"erroneous_data_F008R016_recordeded_by_tech021"
"erroneous_data_F008R016_recordeded_by_tech21"

That's just to give you an idea of the sort of tricks you can use to avoid this type of issue.

Question 8

Don't echo it into grep like that - that's crazy.

<infile grep -iE '([fr][0-9]+){2}' >outfile

...should get you the lines you're asking about. Calling cat to write a file to your shell over a pipe that you then read byte for byte which you then copy out over another pipe after interpreting and eliding various shell syntax characters with echo byte for byte so that you can silently grep those bytes for success... well...

grep will just write the matches out to you. If you want a count of matching lines or something use -c. If you want the line numbers for matching lines use -n. If you want case-insensitive matches use -i. Maybe try man grep for more.

To live-edit the stream you might use sed:

sed -Ee:t -e's/((_)[Ff]|[0-9]{3,}[Rr])([0-9]{1,2}(2円|[Rr]))/10円3円/g;tt'

You'll need a GNU/BSD/AST sed for that to work. But it works pretty well:

sed -Ee:t -e's/((_)[Ff]|[0-9]{3,}[Rr])([0-9]{1,2}(2円|[Rr]))/10円3円/g;tt' \
<<""
"erroneous_data_F08R16_recordeded_by_tech21"
"erroneous_data_F8R16_recordeded_by_tech021"
"erroneous_data_F008R016_recordeded_by_tech21"

"erroneous_data_F008R016_recordeded_by_tech21"
"erroneous_data_F008R016_recordeded_by_tech021"
"erroneous_data_F008R016_recordeded_by_tech21"

You're not the first guy to come here complaining about that tech 21, either. Somebody should straighten that guy out.

Question 9

terdon's perl answer is certainly elegant, and I agree: if the goal is to make all the data formatted uniformly/consistently, there's no need to separate out the lines need to be changed. In case you don't like perl (or in the unlikely event that you don't have it), here's a sed solution:

sed -re 's/_[Ff]([0-9]+)[Rr]([0-9]+)_/_F001円R002円_/' \
 -e 's/_F0*([0-9]{3})R0*([0-9]{3})_/_F1円R2円_/'

This can be typed as all one line (leave out the \ at the end of the first line). I admit, this is not as elegant as the perl solution. It works in two steps:

Add 00 after every F or R (or f or r) in the _ F file_number R file_number _ pattern. This changes single-digit 8 to 008, double-digit 08 to 0008, and triple-digit 008 to 00008.
(The first step also capitalizes f or r.)
After every F or R in the _ F file_number R file_number _ pattern, delete however many zeroes appear before the last three digits. So 008 is left alone, while 0008 and 00008 are changed to 008.

If your version of sed doesn't support the -r (use extended regular expressions) option, use

sed -e 's/_[Ff]\([0-9]*\)[Rr]\([0-9]*\)_/_F001円R002円_/' \
 -e 's/_F0*\([0-9]{3}\)R0*\([0-9]{3}\)_/_F1円R2円_/'

using \(...\) instead of (...) and * instead of +. (* and + don't mean the same thing, but they're close enough in this case, unless there are lines with strings like _FR42_ or _F17R_. In fact, you could use * instead of + in the first command, too.)

How to use these

sed option(s) scripts InputFile
or
sed option(s) scripts < InputFile
to process the input file and see the results on the screen.
sed option(s) scripts InputFile > output_file
or
sed option(s) scripts < InputFile > output_file
to process the input file and send the results to a new file.
sed -i option(s) scripts InputFile
to process the file and modify it in-place; i.e., send the results back into the original file.

terdon ♦terdon 252k69 gold badges480 silver badges716 bronze badges · Accepted Answer · 2016-01-06 16:15:27Z

I will assume you want to match an f or an F, then 1, 2 or 3 numbers followed by an r or R and then 1, 2 or 3 numbers again until a _. If so, you can do (with GNU grep):

grep -iP 'f\d{1,3}r\d{1,3}_' InputFile > tmpfile

Or, with non-GNU grep:

grep -iE 'f[0-9]{1,3}r[0-9]{1,3}_' InputFile > tmpfile

However, this is almost certainly an XY problem. You really don't want to be doing this sort of thing in the shell. For example, this perl one-liner will format all the relevant lines correctly:

$ perl -pe 's/_f(\d+)r(\d+)_/sprintf("_F%03dR%03d_",1,ドル2ドル)/ei' file
"erroneous_data_F008R016_recordeded_by_tech21"
"erroneous_data_F008R016_recordeded_by_tech021"
"erroneous_data_F008R016_recordeded_by_tech21"

That's just to give you an idea of the sort of tricks you can use to avoid this type of issue.

Stack Exchange Network

regex matching and substring extraction

3 Answers 3

How to use these

You must log in to answer this question.

Hot Network Questions

regex matching and substring extraction

3 Answers 3

How to use these

You must log in to answer this question.

Related

Hot Network Questions