Skip to main content
Code Review

Return to Question

Bumped by Community user
Bumped by Community user
Bumped by Community user
Bumped by Community user
Bumped by Community user
Bumped by Community user
Bumped by Community user
Bumped by Community user
Bumped by Community user
Bumped by Community user
deleted 138 characters in body; edited tags; edited title
Source Link
Peilonrayz
  • 44.4k
  • 7
  • 80
  • 157

Better way - Parsing Large Log Files Extracting emails from log files

NOTE: This question is about making my script better/faster. This is not about extracting emails. So, please do not mark it as a duplicate

I have huge number of text files, each several hundred MB in size. Unfortunately, they are not all fully standardized in any one format. Plus there is a lot of legacy in here, and a lot of junk and garbled text. I wish to check all of these files to find rows with a valid email ID, and if it exists then print it to a file named the first-char of the email ID. Hence, multiple text files get parsed and organized into files named a-z and 0-9. In case the email address starts with a special character, then it will get written into a file called "_" underscore. The script also trims the rows to remove whitespaces; and replaces single and double quotes (this is an application requirement)

My script works fine. There isare no error/bugerrors or bugs in hereit. But it is incredibly slow. My question: is there a more efficient way to achieve this? Parsing 30 GB logs takes me about 12 hrs - way too much! Will grep/cut/sed/another be any faster?

Sample txt File

[email protected],address
#[email protected];address
[email protected];address μÖ
[email protected];username;address
[email protected];username
 [email protected],username;address [spaces at the start of the row]
 [email protected]|username|address [tabs at the start of the row]

My Code:

awk -F'[,|;: \t]+' '{
 gsub(/^[ \t]+|[ \t]+$/, "")
 if (NF>1 && 1ドル ~ /^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+$/)
 {
 gsub(/"/, "DQUOTES")
 gsub("047円", "SQUOTES")
 r=gensub("[,|;: \t]+",":",1,0ドル)
 a=tolower(substr(r,1,1))
 if (a ~ /^[[:alnum:]]/)
 print r > a
 else
 print r > "_"
 }
 else
 print 0ドル > "ErrorFile"
}' *.txt

Better way - Parsing Large Log Files

NOTE: This question is about making my script better/faster. This is not about extracting emails. So, please do not mark it as a duplicate

I have huge number of text files, each several hundred MB in size. Unfortunately, they are not all fully standardized in any one format. Plus there is a lot of legacy in here, and a lot of junk and garbled text. I wish to check all of these files to find rows with a valid email ID, and if it exists then print it to a file named the first-char of the email ID. Hence, multiple text files get parsed and organized into files named a-z and 0-9. In case the email address starts with a special character, then it will get written into a file called "_" underscore. The script also trims the rows to remove whitespaces; and replaces single and double quotes (this is an application requirement)

My script works fine. There is no error/bug in here. But it is incredibly slow. My question: is there a more efficient way to achieve this? Parsing 30 GB logs takes me about 12 hrs - way too much! Will grep/cut/sed/another be any faster?

Sample txt File

[email protected],address
#[email protected];address
[email protected];address μÖ
[email protected];username;address
[email protected];username
 [email protected],username;address [spaces at the start of the row]
 [email protected]|username|address [tabs at the start of the row]

My Code:

awk -F'[,|;: \t]+' '{
 gsub(/^[ \t]+|[ \t]+$/, "")
 if (NF>1 && 1ドル ~ /^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+$/)
 {
 gsub(/"/, "DQUOTES")
 gsub("047円", "SQUOTES")
 r=gensub("[,|;: \t]+",":",1,0ドル)
 a=tolower(substr(r,1,1))
 if (a ~ /^[[:alnum:]]/)
 print r > a
 else
 print r > "_"
 }
 else
 print 0ドル > "ErrorFile"
}' *.txt

Extracting emails from log files

I have huge number of text files, each several hundred MB in size. Unfortunately, they are not all fully standardized in any one format. Plus there is a lot of legacy in here, and a lot of junk and garbled text. I wish to check all of these files to find rows with a valid email ID, and if it exists then print it to a file named the first-char of the email ID. Hence, multiple text files get parsed and organized into files named a-z and 0-9. In case the email address starts with a special character, then it will get written into a file called "_" underscore. The script also trims the rows to remove whitespaces; and replaces single and double quotes (this is an application requirement)

My script works fine. There are no errors or bugs in it. But it is incredibly slow. My question: is there a more efficient way to achieve this? Parsing 30 GB logs takes me about 12 hrs - way too much! Will grep/cut/sed/another be any faster?

Sample txt File

[email protected],address
#[email protected];address
[email protected];address μÖ
[email protected];username;address
[email protected];username
 [email protected],username;address [spaces at the start of the row]
 [email protected]|username|address [tabs at the start of the row]

My Code:

awk -F'[,|;: \t]+' '{
 gsub(/^[ \t]+|[ \t]+$/, "")
 if (NF>1 && 1ドル ~ /^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+$/)
 {
 gsub(/"/, "DQUOTES")
 gsub("047円", "SQUOTES")
 r=gensub("[,|;: \t]+",":",1,0ドル)
 a=tolower(substr(r,1,1))
 if (a ~ /^[[:alnum:]]/)
 print r > a
 else
 print r > "_"
 }
 else
 print 0ドル > "ErrorFile"
}' *.txt
edited tags
Link
Ben A
  • 10.8k
  • 5
  • 38
  • 103
Source Link

Better way - Parsing Large Log Files

NOTE: This question is about making my script better/faster. This is not about extracting emails. So, please do not mark it as a duplicate

I have huge number of text files, each several hundred MB in size. Unfortunately, they are not all fully standardized in any one format. Plus there is a lot of legacy in here, and a lot of junk and garbled text. I wish to check all of these files to find rows with a valid email ID, and if it exists then print it to a file named the first-char of the email ID. Hence, multiple text files get parsed and organized into files named a-z and 0-9. In case the email address starts with a special character, then it will get written into a file called "_" underscore. The script also trims the rows to remove whitespaces; and replaces single and double quotes (this is an application requirement)

My script works fine. There is no error/bug in here. But it is incredibly slow. My question: is there a more efficient way to achieve this? Parsing 30 GB logs takes me about 12 hrs - way too much! Will grep/cut/sed/another be any faster?

Sample txt File

[email protected],address
#[email protected];address
[email protected];address μÖ
[email protected];username;address
[email protected];username
 [email protected],username;address [spaces at the start of the row]
 [email protected]|username|address [tabs at the start of the row]

My Code:

awk -F'[,|;: \t]+' '{
 gsub(/^[ \t]+|[ \t]+$/, "")
 if (NF>1 && 1ドル ~ /^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+$/)
 {
 gsub(/"/, "DQUOTES")
 gsub("047円", "SQUOTES")
 r=gensub("[,|;: \t]+",":",1,0ドル)
 a=tolower(substr(r,1,1))
 if (a ~ /^[[:alnum:]]/)
 print r > a
 else
 print r > "_"
 }
 else
 print 0ドル > "ErrorFile"
}' *.txt
lang-bash

AltStyle によって変換されたページ (->オリジナル) /