Skip to main content
Code Review

Return to Question

deleted 45 characters in body; edited tags; edited title
Source Link
Mast
  • 13.8k
  • 12
  • 56
  • 127

Bash awk run in parallel Parse and clean large log files

I have the code below, which works successfully, and is used to parse, clean log files (very large in size) and output into smaller sized files. Output filename is the first 2 characters of each line. However, if there is a special character in these 2 characters, then it needs to be replaced with a '_'. This will help ensure there is no illegal character in the filename.

This would take about 12-14 mins to process 1 GB worth of logs (on my laptop). Can this be made faster?

IsFor example, would it possiblehelp to run this is parallel? I am aware I could do }' "$FILE" &. However, I tested and that does not help much. Is it possible to ask awk to outputPerhaps AWK itself could run in parallel - what is the(the equivalent of print 0ドル >> Fpath & )?

Any help will be appreciated.

Sample log file

"email1@fooemail1@foo.com:datahere2 
[email protected]:datahere2
[email protected] datahere2
[email protected];dtat'ah'ere2 
wrongemailfoo.com
[email protected];data.is.junk-Œœ
[email protected]:datahere2

Expected Output

# cat em 
[email protected]:datahere2 
[email protected]:datahere2
[email protected]:datahere2
[email protected]:dtat'ah'ere2 
[email protected]:datahere2
# cat errorfile
wrongemailfoo.com
[email protected];data.is.junk-Œœ

Code:

#/bin/sh
pushd "_test2" > /dev/null
for FILE in *
do
 awk '
 BEGIN {
 FS=":"
 }
 {
 gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
 0ドル=gensub("[,|;: \t]+",":",1,0ドル)
 if (NF>1 && 1ドル ~ /^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+$/ && 0ドル ~ /^[\x00-\x7F]*$/)
 {
 Fpath=tolower(substr(1,1,2ドル))
 Fpath=gensub("[^[:alnum:]]","_","g",Fpath)
 print 0ドル >> Fpath
 }
 else
 print 0ドル >> "errorfile"
 }' "$FILE"
done
popd > /dev/null

Bash awk run in parallel

I have the code below, which works successfully, and is used to parse, clean log files (very large in size) and output into smaller sized files. Output filename is the first 2 characters of each line. However, if there is a special character in these 2 characters, then it needs to be replaced with a '_'. This will help ensure there is no illegal character in the filename.

This would take about 12-14 mins to process 1 GB worth of logs (on my laptop). Can this be made faster?

Is it possible to run this is parallel? I am aware I could do }' "$FILE" &. However, I tested and that does not help much. Is it possible to ask awk to output in parallel - what is the equivalent of print 0ドル >> Fpath & ?

Any help will be appreciated.

Sample log file

"email1@foo.com:datahere2 
[email protected]:datahere2
[email protected] datahere2
[email protected];dtat'ah'ere2 
wrongemailfoo.com
[email protected];data.is.junk-Œœ
[email protected]:datahere2

Expected Output

# cat em 
[email protected]:datahere2 
[email protected]:datahere2
[email protected]:datahere2
[email protected]:dtat'ah'ere2 
[email protected]:datahere2
# cat errorfile
wrongemailfoo.com
[email protected];data.is.junk-Œœ

Code:

#/bin/sh
pushd "_test2" > /dev/null
for FILE in *
do
 awk '
 BEGIN {
 FS=":"
 }
 {
 gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
 0ドル=gensub("[,|;: \t]+",":",1,0ドル)
 if (NF>1 && 1ドル ~ /^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+$/ && 0ドル ~ /^[\x00-\x7F]*$/)
 {
 Fpath=tolower(substr(1,1,2ドル))
 Fpath=gensub("[^[:alnum:]]","_","g",Fpath)
 print 0ドル >> Fpath
 }
 else
 print 0ドル >> "errorfile"
 }' "$FILE"
done
popd > /dev/null

Parse and clean large log files

I have the code below, which works successfully, and is used to parse, clean log files (very large in size) and output into smaller sized files. Output filename is the first 2 characters of each line. However, if there is a special character in these 2 characters, then it needs to be replaced with a '_'. This will help ensure there is no illegal character in the filename.

This would take about 12-14 mins to process 1 GB worth of logs (on my laptop). Can this be made faster?

For example, would it help to run this parallel? I am aware I could do }' "$FILE" &. However, I tested and that does not help much. Perhaps AWK itself could run in parallel (the equivalent of print 0ドル >> Fpath & )?

Sample log file

email1@foo.com:datahere2 
[email protected]:datahere2
[email protected] datahere2
[email protected];dtat'ah'ere2 
wrongemailfoo.com
[email protected];data.is.junk-Œœ
[email protected]:datahere2

Expected Output

# cat em 
[email protected]:datahere2 
[email protected]:datahere2
[email protected]:datahere2
[email protected]:dtat'ah'ere2 
[email protected]:datahere2
# cat errorfile
wrongemailfoo.com
[email protected];data.is.junk-Œœ

Code:

#/bin/sh
pushd "_test2" > /dev/null
for FILE in *
do
 awk '
 BEGIN {
 FS=":"
 }
 {
 gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
 0ドル=gensub("[,|;: \t]+",":",1,0ドル)
 if (NF>1 && 1ドル ~ /^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+$/ && 0ドル ~ /^[\x00-\x7F]*$/)
 {
 Fpath=tolower(substr(1,1,2ドル))
 Fpath=gensub("[^[:alnum:]]","_","g",Fpath)
 print 0ドル >> Fpath
 }
 else
 print 0ドル >> "errorfile"
 }' "$FILE"
done
popd > /dev/null
added 76 characters in body
Source Link

I have the code below, which works successfully, and is used to parse, clean log files (very large in size) and output into smaller sized files. Output filename is the first 2 characters of each line. However, if there is a special character in these 2 characters, then it needs to be replaced with a '_'. This will help ensure there is no illegal character in the filename.

This would take about 12-14 mins to process 1 GB worth of logs (on my laptop). Can this be made faster?

Is it possible to run this is parallel? I am aware I could do }' "$FILE" &. However, I tested and that does not help much. Is it possible to ask awk to output in parallel - what is the equivalent of print 0ドル >> Fpath & ?

Any help will be appreciated.

Sample log file

"[email protected]:datahere2 
[email protected]:datahere2
[email protected] datahere2
[email protected];dtat'ah'ere2 
wrongemailfoo.com
[email protected];data.is.junk-Œœ
[email protected]:datahere2

Expected Output

# cat em 
[email protected]:datahere2 
[email protected]:datahere2
[email protected]:datahere2
[email protected]:dtat'ah'ere2 
[email protected]:datahere2
# cat errorfile
wrongemailfoo.com
[email protected];data.is.junk-Œœ

Code:

#/bin/sh
pushd "_test2" > /dev/null
for FILE in *
do
 awk '
 BEGIN {
 FS=":"
 }
 {
 gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
 0ドル=gensub("[,|;: \t]+",":",1,0ドル)
 if (NF>1 && 1ドル ~ /^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+$/ && 0ドル ~ /^[\x00-\x7F]*$/)
 {
 Fpath=tolower(substr(1,1,2ドル))
 Fpath=gensub("[^[:alnum:]]","_","g",Fpath)
 print 0ドル >> Fpath
 }
 else
 print 0ドル >> "errorfile"
 }' "$FILE"
done
popd > /dev/null

I have the code below, which works successfully, and is used to parse, clean log files (very large in size) and output into smaller sized files. Output filename is the first 2 characters of each line. However, if there is a special character in these 2 characters, then it needs to be replaced with a '_'. This will help ensure there is no illegal character in the filename.

This would take about 12-14 mins to process 1 GB worth of logs (on my laptop). Can this be made faster?

Is it possible to run this is parallel? I am aware I could do }' "$FILE" &. However, I tested and that does not help much. Is it possible to ask awk to output in parallel - what is the equivalent of print 0ドル >> Fpath & ?

Any help will be appreciated.

Sample log file

"[email protected]:datahere2 
[email protected]:datahere2
[email protected] datahere2
[email protected];dtat'ah'ere2 
wrongemailfoo.com
[email protected]:datahere2

Expected Output

# cat em 
[email protected]:datahere2 
[email protected]:datahere2
[email protected]:datahere2
[email protected]:dtat'ah'ere2 
[email protected]:datahere2
# cat errorfile
wrongemailfoo.com

Code:

#/bin/sh
pushd "_test2" > /dev/null
for FILE in *
do
 awk '
 BEGIN {
 FS=":"
 }
 {
 gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
 0ドル=gensub("[,|;: \t]+",":",1,0ドル)
 if (NF>1 && 1ドル ~ /^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+$/ && 0ドル ~ /^[\x00-\x7F]*$/)
 {
 Fpath=tolower(substr(1,1,2ドル))
 Fpath=gensub("[^[:alnum:]]","_","g",Fpath)
 print 0ドル >> Fpath
 }
 else
 print 0ドル >> "errorfile"
 }' "$FILE"
done
popd > /dev/null

I have the code below, which works successfully, and is used to parse, clean log files (very large in size) and output into smaller sized files. Output filename is the first 2 characters of each line. However, if there is a special character in these 2 characters, then it needs to be replaced with a '_'. This will help ensure there is no illegal character in the filename.

This would take about 12-14 mins to process 1 GB worth of logs (on my laptop). Can this be made faster?

Is it possible to run this is parallel? I am aware I could do }' "$FILE" &. However, I tested and that does not help much. Is it possible to ask awk to output in parallel - what is the equivalent of print 0ドル >> Fpath & ?

Any help will be appreciated.

Sample log file

"[email protected]:datahere2 
[email protected]:datahere2
[email protected] datahere2
[email protected];dtat'ah'ere2 
wrongemailfoo.com
[email protected];data.is.junk-Œœ
[email protected]:datahere2

Expected Output

# cat em 
[email protected]:datahere2 
[email protected]:datahere2
[email protected]:datahere2
[email protected]:dtat'ah'ere2 
[email protected]:datahere2
# cat errorfile
wrongemailfoo.com
[email protected];data.is.junk-Œœ

Code:

#/bin/sh
pushd "_test2" > /dev/null
for FILE in *
do
 awk '
 BEGIN {
 FS=":"
 }
 {
 gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
 0ドル=gensub("[,|;: \t]+",":",1,0ドル)
 if (NF>1 && 1ドル ~ /^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+$/ && 0ドル ~ /^[\x00-\x7F]*$/)
 {
 Fpath=tolower(substr(1,1,2ドル))
 Fpath=gensub("[^[:alnum:]]","_","g",Fpath)
 print 0ドル >> Fpath
 }
 else
 print 0ドル >> "errorfile"
 }' "$FILE"
done
popd > /dev/null
Source Link

Bash awk run in parallel

I have the code below, which works successfully, and is used to parse, clean log files (very large in size) and output into smaller sized files. Output filename is the first 2 characters of each line. However, if there is a special character in these 2 characters, then it needs to be replaced with a '_'. This will help ensure there is no illegal character in the filename.

This would take about 12-14 mins to process 1 GB worth of logs (on my laptop). Can this be made faster?

Is it possible to run this is parallel? I am aware I could do }' "$FILE" &. However, I tested and that does not help much. Is it possible to ask awk to output in parallel - what is the equivalent of print 0ドル >> Fpath & ?

Any help will be appreciated.

Sample log file

"[email protected]:datahere2 
[email protected]:datahere2
[email protected] datahere2
[email protected];dtat'ah'ere2 
wrongemailfoo.com
[email protected]:datahere2

Expected Output

# cat em 
[email protected]:datahere2 
[email protected]:datahere2
[email protected]:datahere2
[email protected]:dtat'ah'ere2 
[email protected]:datahere2
# cat errorfile
wrongemailfoo.com

Code:

#/bin/sh
pushd "_test2" > /dev/null
for FILE in *
do
 awk '
 BEGIN {
 FS=":"
 }
 {
 gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
 0ドル=gensub("[,|;: \t]+",":",1,0ドル)
 if (NF>1 && 1ドル ~ /^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+$/ && 0ドル ~ /^[\x00-\x7F]*$/)
 {
 Fpath=tolower(substr(1,1,2ドル))
 Fpath=gensub("[^[:alnum:]]","_","g",Fpath)
 print 0ドル >> Fpath
 }
 else
 print 0ドル >> "errorfile"
 }' "$FILE"
done
popd > /dev/null
lang-bash

AltStyle によって変換されたページ (->オリジナル) /