1
\$\begingroup\$

I have the code below, which works successfully, and is used to parse, clean log files (very large in size) and output into smaller sized files. Output filename is the first 2 characters of each line. However, if there is a special character in these 2 characters, then it needs to be replaced with a '_'. This will help ensure there is no illegal character in the filename.

This would take about 12-14 mins to process 1 GB worth of logs (on my laptop). Can this be made faster?

For example, would it help to run this parallel? I am aware I could do }' "$FILE" &. However, I tested and that does not help much. Perhaps AWK itself could run in parallel (the equivalent of print 0ドル >> Fpath & )?

Sample log file

[email protected]:datahere2 
[email protected]:datahere2
[email protected] datahere2
[email protected];dtat'ah'ere2 
wrongemailfoo.com
[email protected];data.is.junk-Œœ
[email protected]:datahere2

Expected Output

# cat em 
[email protected]:datahere2 
[email protected]:datahere2
[email protected]:datahere2
[email protected]:dtat'ah'ere2 
[email protected]:datahere2
# cat errorfile
wrongemailfoo.com
[email protected];data.is.junk-Œœ

Code:

#/bin/sh
pushd "_test2" > /dev/null
for FILE in *
do
 awk '
 BEGIN {
 FS=":"
 }
 {
 gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
 0ドル=gensub("[,|;: \t]+",":",1,0ドル)
 if (NF>1 && 1ドル ~ /^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+$/ && 0ドル ~ /^[\x00-\x7F]*$/)
 {
 Fpath=tolower(substr(1,1,2ドル))
 Fpath=gensub("[^[:alnum:]]","_","g",Fpath)
 print 0ドル >> Fpath
 }
 else
 print 0ドル >> "errorfile"
 }' "$FILE"
done
popd > /dev/null
Mast
13.8k12 gold badges56 silver badges127 bronze badges
asked Jun 18, 2020 at 2:48
\$\endgroup\$
4
  • \$\begingroup\$ To the reviewers: I think this question has been edited into shape good enough that it's within the scope of the site. Please leave a comment if you disagree. \$\endgroup\$ Commented Jul 3, 2020 at 6:07
  • \$\begingroup\$ Mast: i disagree. your edits wont make the code run any faster. in future pls post your answer separately, and not edit the original ques \$\endgroup\$ Commented Jul 3, 2020 at 13:56
  • \$\begingroup\$ @rogerwhite You may find this blog post helpful: aosabook.org/en/posa/… .. \$\endgroup\$ Commented Jul 5, 2020 at 11:45
  • \$\begingroup\$ @akki - sorry for the slow revert. all good, but i need help with the code !! \$\endgroup\$ Commented Jul 25, 2020 at 14:17

1 Answer 1

1
\$\begingroup\$

Perhaps switching the outputfile for each line can be avoided by sorting the inputfiles first or with some changes in your script:

# Store the 2-letter filenames
outfile[Fpath];
# Store the highest index for given outputfile
i[Fpath]++;
# Store current line in output array for that file
a[Fpath][i[Fpath]]=0ドル
# And in the END block print for array per output file
 for (out in outfile) {
 for (j=1;j<=i[out]; j++) {
 print a[out][j] >> out;
 }
 }

This results in

awk '
 BEGIN {
 FS=":"
 }
 {
 gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
 0ドル=gensub("[,|;: \t]+",":",1,0ドル)
 if (NF>1 && 1ドル ~ /^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+$/ && 0ドル ~ /^[\x00-\x7F]*$/)
 {
 Fpath=tolower(substr(1,1,2ドル))
 Fpath=gensub("[^[:alnum:]]","_","g",Fpath);
 outfile[Fpath];
 i[Fpath]++;
 a[Fpath][i[Fpath]]=0ドル
 }
 else
 print 0ドル >> "errorfile"
 }
 END {
 for (out in outfile) {
 for (j=1;j<=i[out]; j++) {
 print a[out][j] >> out;
 }
 }
 } ' "$FILE"
answered Apr 18, 2021 at 18:13
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.