I have the code below, which works successfully, and is used to parse, clean log files (very large in size) and output into smaller sized files. Output filename is the first 2 characters of each line. However, if there is a special character in these 2 characters, then it needs to be replaced with a '_'. This will help ensure there is no illegal character in the filename.
This would take about 12-14 mins to process 1 GB worth of logs (on my laptop). Can this be made faster?
For example, would it help to run this parallel? I am aware I could do }' "$FILE" &
. However, I tested and that does not help much. Perhaps AWK itself could run in parallel (the equivalent of print 0ドル >> Fpath &
)?
Sample log file
[email protected]:datahere2
[email protected]:datahere2
[email protected] datahere2
[email protected];dtat'ah'ere2
wrongemailfoo.com
[email protected];data.is.junk-Œœ
[email protected]:datahere2
Expected Output
# cat em
[email protected]:datahere2
[email protected]:datahere2
[email protected]:datahere2
[email protected]:dtat'ah'ere2
[email protected]:datahere2
# cat errorfile
wrongemailfoo.com
[email protected];data.is.junk-Œœ
Code:
#/bin/sh
pushd "_test2" > /dev/null
for FILE in *
do
awk '
BEGIN {
FS=":"
}
{
gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
0ドル=gensub("[,|;: \t]+",":",1,0ドル)
if (NF>1 && 1ドル ~ /^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+$/ && 0ドル ~ /^[\x00-\x7F]*$/)
{
Fpath=tolower(substr(1,1,2ドル))
Fpath=gensub("[^[:alnum:]]","_","g",Fpath)
print 0ドル >> Fpath
}
else
print 0ドル >> "errorfile"
}' "$FILE"
done
popd > /dev/null
-
\$\begingroup\$ To the reviewers: I think this question has been edited into shape good enough that it's within the scope of the site. Please leave a comment if you disagree. \$\endgroup\$Mast– Mast ♦2020年07月03日 06:07:29 +00:00Commented Jul 3, 2020 at 6:07
-
\$\begingroup\$ Mast: i disagree. your edits wont make the code run any faster. in future pls post your answer separately, and not edit the original ques \$\endgroup\$rogerwhite– rogerwhite2020年07月03日 13:56:42 +00:00Commented Jul 3, 2020 at 13:56
-
\$\begingroup\$ @rogerwhite You may find this blog post helpful: aosabook.org/en/posa/… .. \$\endgroup\$aki– aki2020年07月05日 11:45:26 +00:00Commented Jul 5, 2020 at 11:45
-
\$\begingroup\$ @akki - sorry for the slow revert. all good, but i need help with the code !! \$\endgroup\$rogerwhite– rogerwhite2020年07月25日 14:17:39 +00:00Commented Jul 25, 2020 at 14:17
1 Answer 1
Perhaps switching the outputfile for each line can be avoided by sorting the inputfiles first or with some changes in your script:
# Store the 2-letter filenames
outfile[Fpath];
# Store the highest index for given outputfile
i[Fpath]++;
# Store current line in output array for that file
a[Fpath][i[Fpath]]=0ドル
# And in the END block print for array per output file
for (out in outfile) {
for (j=1;j<=i[out]; j++) {
print a[out][j] >> out;
}
}
This results in
awk '
BEGIN {
FS=":"
}
{
gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
0ドル=gensub("[,|;: \t]+",":",1,0ドル)
if (NF>1 && 1ドル ~ /^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+$/ && 0ドル ~ /^[\x00-\x7F]*$/)
{
Fpath=tolower(substr(1,1,2ドル))
Fpath=gensub("[^[:alnum:]]","_","g",Fpath);
outfile[Fpath];
i[Fpath]++;
a[Fpath][i[Fpath]]=0ドル
}
else
print 0ドル >> "errorfile"
}
END {
for (out in outfile) {
for (j=1;j<=i[out]; j++) {
print a[out][j] >> out;
}
}
} ' "$FILE"