Return to Question

Bumped by Community user

occurred Jun 21, 2023 at 19:07

Bumped by Community user

occurred Feb 21, 2023 at 19:01

Bumped by Community user

occurred Oct 24, 2022 at 18:08

Bumped by Community user

occurred Jun 26, 2022 at 17:03

Bumped by Community user

occurred Feb 26, 2022 at 16:03

Bumped by Community user

occurred Oct 29, 2021 at 16:00

Bumped by Community user

occurred Jul 1, 2021 at 15:03

Bumped by Community user

occurred Mar 3, 2021 at 14:02

Bumped by Community user

occurred Nov 3, 2020 at 13:06

Bumped by Community user

occurred Jul 6, 2020 at 13:05

deleted 138 characters in body; edited tags; edited title

Source Link

edited Jun 6, 2020 at 12:43

Peilonrayz ♦

edited Jun 6, 2020 at 12:43

Peilonrayz ♦

44.4k
7
80
157

Better way - Parsing Large Log Files Extracting emails from log files

NOTE: This question is about making my script better/faster. This is not about extracting emails. So, please do not mark it as a duplicate

I have huge number of text files, each several hundred MB in size. Unfortunately, they are not all fully standardized in any one format. Plus there is a lot of legacy in here, and a lot of junk and garbled text. I wish to check all of these files to find rows with a valid email ID, and if it exists then print it to a file named the first-char of the email ID. Hence, multiple text files get parsed and organized into files named a-z and 0-9. In case the email address starts with a special character, then it will get written into a file called "_" underscore. The script also trims the rows to remove whitespaces; and replaces single and double quotes (this is an application requirement)

My script works fine. There isare no error/bugerrors or bugs in hereit. But it is incredibly slow. My question: is there a more efficient way to achieve this? Parsing 30 GB logs takes me about 12 hrs - way too much! Will grep/cut/sed/another be any faster?

Sample txt File

[email protected],address
#[email protected];address
[email protected];address μÖ
[email protected];username;address
[email protected];username
 [email protected],username;address [spaces at the start of the row]
 [email protected]|username|address [tabs at the start of the row]

My Code:

awk -F'[,|;: \t]+' '{
 gsub(/^[ \t]+|[ \t]+$/, "")
 if (NF>1 && 1ドル ~ /^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+$/)
 {
 gsub(/"/, "DQUOTES")
 gsub("047円", "SQUOTES")
 r=gensub("[,|;: \t]+",":",1,0ドル)
 a=tolower(substr(r,1,1))
 if (a ~ /^[[:alnum:]]/)
 print r > a
 else
 print r > "_"
 }
 else
 print 0ドル > "ErrorFile"
}' *.txt

Better way - Parsing Large Log Files

NOTE: This question is about making my script better/faster. This is not about extracting emails. So, please do not mark it as a duplicate

My script works fine. There is no error/bug in here. But it is incredibly slow. My question: is there a more efficient way to achieve this? Parsing 30 GB logs takes me about 12 hrs - way too much! Will grep/cut/sed/another be any faster?

Sample txt File

[email protected],address
#[email protected];address
[email protected];address μÖ
[email protected];username;address
[email protected];username
 [email protected],username;address [spaces at the start of the row]
 [email protected]|username|address [tabs at the start of the row]

My Code:

awk -F'[,|;: \t]+' '{
 gsub(/^[ \t]+|[ \t]+$/, "")
 if (NF>1 && 1ドル ~ /^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+$/)
 {
 gsub(/"/, "DQUOTES")
 gsub("047円", "SQUOTES")
 r=gensub("[,|;: \t]+",":",1,0ドル)
 a=tolower(substr(r,1,1))
 if (a ~ /^[[:alnum:]]/)
 print r > a
 else
 print r > "_"
 }
 else
 print 0ドル > "ErrorFile"
}' *.txt

Extracting emails from log files

My script works fine. There are no errors or bugs in it. But it is incredibly slow. My question: is there a more efficient way to achieve this? Parsing 30 GB logs takes me about 12 hrs - way too much! Will grep/cut/sed/another be any faster?

Sample txt File

[email protected],address
#[email protected];address
[email protected];address μÖ
[email protected];username;address
[email protected];username
 [email protected],username;address [spaces at the start of the row]
 [email protected]|username|address [tabs at the start of the row]

My Code:

awk -F'[,|;: \t]+' '{
 gsub(/^[ \t]+|[ \t]+$/, "")
 if (NF>1 && 1ドル ~ /^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+$/)
 {
 gsub(/"/, "DQUOTES")
 gsub("047円", "SQUOTES")
 r=gensub("[,|;: \t]+",":",1,0ドル)
 a=tolower(substr(r,1,1))
 if (a ~ /^[[:alnum:]]/)
 print r > a
 else
 print r > "_"
 }
 else
 print 0ドル > "ErrorFile"
}' *.txt

edited tags

Link

edited Jun 6, 2020 at 7:02

Ben A

edited Jun 6, 2020 at 7:02

Ben A

10.8k
5
38
103

performance bash awk

Source Link

asked Jun 6, 2020 at 6:48

rogerwhite

asked Jun 6, 2020 at 6:48

rogerwhite

Better way - Parsing Large Log Files

NOTE: This question is about making my script better/faster. This is not about extracting emails. So, please do not mark it as a duplicate

Sample txt File

[email protected],address
#[email protected];address
[email protected];address μÖ
[email protected];username;address
[email protected];username
 [email protected],username;address [spaces at the start of the row]
 [email protected]|username|address [tabs at the start of the row]

My Code:

awk -F'[,|;: \t]+' '{
 gsub(/^[ \t]+|[ \t]+$/, "")
 if (NF>1 && 1ドル ~ /^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+$/)
 {
 gsub(/"/, "DQUOTES")
 gsub("047円", "SQUOTES")
 r=gensub("[,|;: \t]+",":",1,0ドル)
 a=tolower(substr(r,1,1))
 if (a ~ /^[[:alnum:]]/)
 print r > a
 else
 print r > "_"
 }
 else
 print 0ドル > "ErrorFile"
}' *.txt

bash awk

lang-bash