I have these files, manually created by many different people are coming in. Formatting, although following a certain rule, is not uniform.
think of these three lines below
"erroneous_data_F08R16_recordeded_by_tech21"
"erroneous_data_F8R16_recordeded_by_tech021"
"erroneous_data_F008R016_recordeded_by_tech21"
they all point to the same thing F008 or F08 or F8 means File number 8 R16 or R016 or (R[single-digit] where possible) means Row number 16
There are any number of these lines in a given file, which will be scanned using while read line
loop.
What I want to do is to make the File and Row number section uniform, such as F008R016 for the above three lines of examples, as my file numbers are no more than 3 digits (it rolls over after reaching 999 and number of lines is never more than a handful in each file, but for the sake of consistency, lets say it is 3 digits. In this file I need to process, there also exists unstructured comments. So first order of business is detecting the lines and separating them in to a different temporary file, then making them uniform.
In order to accomplish this, my plan is to echo the line and grep for regex matching the pattern. Unfortunately, regex is not my strong point.
So far I am stuck at detection of the file#row# structure in the line
cat InputFile | while read line
do
echo $line | grep '[F,f]\d\d[R,r]\d\d' >/dev/null #this is assuming two digit file number and 2 digit row number
result=$?
if [ $result -eq 0 ]
then
echo $line >tempfile
fi
done
this regex matching on the grep command fails all the time, even if the line contains F08R16 pattern.
After accomplishing this, I want to extract this substring into a variable and analyze the structure of the variable and add leading zeros where necessary to make it uniform.
Any suggestions to correct my regex and accomplish my higher goal of extracting into variable, is greatly appreciated.
For what it is worth, I am working on a CentOS release 6.7 box at the time but I have other distros at my disposal.
3 Answers 3
I will assume you want to match an f
or an F
, then 1, 2 or 3 numbers followed by an r
or R
and then 1, 2 or 3 numbers again until a _
. If so, you can do (with GNU grep
):
grep -iP 'f\d{1,3}r\d{1,3}_' InputFile > tmpfile
Or, with non-GNU grep
:
grep -iE 'f[0-9]{1,3}r[0-9]{1,3}_' InputFile > tmpfile
However, this is almost certainly an XY problem. You really don't want to be doing this sort of thing in the shell. For example, this perl
one-liner will format all the relevant lines correctly:
$ perl -pe 's/_f(\d+)r(\d+)_/sprintf("_F%03dR%03d_",1,ドル2ドル)/ei' file
"erroneous_data_F008R016_recordeded_by_tech21"
"erroneous_data_F008R016_recordeded_by_tech021"
"erroneous_data_F008R016_recordeded_by_tech21"
That's just to give you an idea of the sort of tricks you can use to avoid this type of issue.
Don't echo
it into grep
like that - that's crazy.
<infile grep -iE '([fr][0-9]+){2}' >outfile
...should get you the lines you're asking about. Calling cat
to write a file to your shell over a pipe that you then read
byte for byte which you then copy out over another pipe after interpreting and eliding various shell syntax characters with echo
byte for byte so that you can silently grep
those bytes for success... well...
grep
will just write the matches out to you. If you want a count of matching lines or something use -c
. If you want the line numbers for matching lines use -n
. If you want case-insensitive matches use -i
. Maybe try man grep
for more.
To live-edit the stream you might use sed
:
sed -Ee:t -e's/((_)[Ff]|[0-9]{3,}[Rr])([0-9]{1,2}(2円|[Rr]))/10円3円/g;tt'
You'll need a GNU/BSD/AST sed
for that to work. But it works pretty well:
sed -Ee:t -e's/((_)[Ff]|[0-9]{3,}[Rr])([0-9]{1,2}(2円|[Rr]))/10円3円/g;tt' \
<<""
"erroneous_data_F08R16_recordeded_by_tech21"
"erroneous_data_F8R16_recordeded_by_tech021"
"erroneous_data_F008R016_recordeded_by_tech21"
"erroneous_data_F008R016_recordeded_by_tech21"
"erroneous_data_F008R016_recordeded_by_tech021"
"erroneous_data_F008R016_recordeded_by_tech21"
You're not the first guy to come here complaining about that tech 21, either. Somebody should straighten that guy out.
terdon's perl
answer is certainly elegant, and I agree:
if the goal is to make all the data formatted uniformly/consistently,
there's no need to separate out the lines need to be changed.
In case you don't like perl
(or in the unlikely event that you don't have it),
here's a sed
solution:
sed -re 's/_[Ff]([0-9]+)[Rr]([0-9]+)_/_F001円R002円_/' \
-e 's/_F0*([0-9]{3})R0*([0-9]{3})_/_F1円R2円_/'
This can be typed as all one line
(leave out the \
at the end of the first line).
I admit, this is not as elegant as the perl
solution.
It works in two steps:
- Add
00
after everyF
orR
(orf
orr
) in the_ F file_number R file_number _
pattern. This changes single-digit8
to008
, double-digit08
to0008
, and triple-digit008
to00008
.
(The first step also capitalizesf
orr
.) - After every
F
orR
in the_ F file_number R file_number _
pattern, delete however many zeroes appear before the last three digits. So008
is left alone, while0008
and00008
are changed to008
.
If your version of sed
doesn't support the -r
(use extended regular expressions) option, use
sed -e 's/_[Ff]\([0-9]*\)[Rr]\([0-9]*\)_/_F001円R002円_/' \
-e 's/_F0*\([0-9]{3}\)R0*\([0-9]{3}\)_/_F1円R2円_/'
using \(...\)
instead of (...)
and *
instead of +
.
(*
and +
don't mean the same thing,
but they're close enough in this case, unless there are lines with strings like _FR42_
or _F17R_
.
In fact, you could use *
instead of +
in the first command, too.)
How to use these
sed option(s) scripts InputFile
or
sed option(s) scripts < InputFile
to process the input file and see the results on the screen.sed option(s) scripts InputFile > output_file
or
sed option(s) scripts < InputFile > output_file
to process the input file and send the results to a new file.sed -i option(s) scripts InputFile
to process the file and modify it in-place; i.e., send the results back into the original file.
\d
means anything. match[0-9]
.perl
one-liner usingsprintf
with suitable zero-padded formats for the digits