- 44.4k
- 7
- 80
- 157
Better way - Parsing Large Log Files Extracting emails from log files
NOTE: This question is about making my script better/faster. This is not about extracting emails. So, please do not mark it as a duplicate
I have huge number of text files, each several hundred MB in size. Unfortunately, they are not all fully standardized in any one format. Plus there is a lot of legacy in here, and a lot of junk and garbled text. I wish to check all of these files to find rows with a valid email ID, and if it exists then print it to a file named the first-char of the email ID. Hence, multiple text files get parsed and organized into files named a-z and 0-9. In case the email address starts with a special character, then it will get written into a file called "_" underscore. The script also trims the rows to remove whitespaces; and replaces single and double quotes (this is an application requirement)
My script works fine. There isare no error/bugerrors or bugs in hereit. But it is incredibly slow. My question: is there a more efficient way to achieve this? Parsing 30 GB logs takes me about 12 hrs - way too much! Will grep/cut/sed/another be any faster?
Sample txt File
[email protected],address
#[email protected];address
[email protected];address μÖ
[email protected];username;address
[email protected];username
[email protected],username;address [spaces at the start of the row]
[email protected]|username|address [tabs at the start of the row]
My Code:
awk -F'[,|;: \t]+' '{
gsub(/^[ \t]+|[ \t]+$/, "")
if (NF>1 && 1ドル ~ /^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+$/)
{
gsub(/"/, "DQUOTES")
gsub("047円", "SQUOTES")
r=gensub("[,|;: \t]+",":",1,0ドル)
a=tolower(substr(r,1,1))
if (a ~ /^[[:alnum:]]/)
print r > a
else
print r > "_"
}
else
print 0ドル > "ErrorFile"
}' *.txt
Better way - Parsing Large Log Files
NOTE: This question is about making my script better/faster. This is not about extracting emails. So, please do not mark it as a duplicate
I have huge number of text files, each several hundred MB in size. Unfortunately, they are not all fully standardized in any one format. Plus there is a lot of legacy in here, and a lot of junk and garbled text. I wish to check all of these files to find rows with a valid email ID, and if it exists then print it to a file named the first-char of the email ID. Hence, multiple text files get parsed and organized into files named a-z and 0-9. In case the email address starts with a special character, then it will get written into a file called "_" underscore. The script also trims the rows to remove whitespaces; and replaces single and double quotes (this is an application requirement)
My script works fine. There is no error/bug in here. But it is incredibly slow. My question: is there a more efficient way to achieve this? Parsing 30 GB logs takes me about 12 hrs - way too much! Will grep/cut/sed/another be any faster?
Sample txt File
[email protected],address
#[email protected];address
[email protected];address μÖ
[email protected];username;address
[email protected];username
[email protected],username;address [spaces at the start of the row]
[email protected]|username|address [tabs at the start of the row]
My Code:
awk -F'[,|;: \t]+' '{
gsub(/^[ \t]+|[ \t]+$/, "")
if (NF>1 && 1ドル ~ /^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+$/)
{
gsub(/"/, "DQUOTES")
gsub("047円", "SQUOTES")
r=gensub("[,|;: \t]+",":",1,0ドル)
a=tolower(substr(r,1,1))
if (a ~ /^[[:alnum:]]/)
print r > a
else
print r > "_"
}
else
print 0ドル > "ErrorFile"
}' *.txt
Extracting emails from log files
I have huge number of text files, each several hundred MB in size. Unfortunately, they are not all fully standardized in any one format. Plus there is a lot of legacy in here, and a lot of junk and garbled text. I wish to check all of these files to find rows with a valid email ID, and if it exists then print it to a file named the first-char of the email ID. Hence, multiple text files get parsed and organized into files named a-z and 0-9. In case the email address starts with a special character, then it will get written into a file called "_" underscore. The script also trims the rows to remove whitespaces; and replaces single and double quotes (this is an application requirement)
My script works fine. There are no errors or bugs in it. But it is incredibly slow. My question: is there a more efficient way to achieve this? Parsing 30 GB logs takes me about 12 hrs - way too much! Will grep/cut/sed/another be any faster?
Sample txt File
[email protected],address
#[email protected];address
[email protected];address μÖ
[email protected];username;address
[email protected];username
[email protected],username;address [spaces at the start of the row]
[email protected]|username|address [tabs at the start of the row]
My Code:
awk -F'[,|;: \t]+' '{
gsub(/^[ \t]+|[ \t]+$/, "")
if (NF>1 && 1ドル ~ /^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+$/)
{
gsub(/"/, "DQUOTES")
gsub("047円", "SQUOTES")
r=gensub("[,|;: \t]+",":",1,0ドル)
a=tolower(substr(r,1,1))
if (a ~ /^[[:alnum:]]/)
print r > a
else
print r > "_"
}
else
print 0ドル > "ErrorFile"
}' *.txt
Better way - Parsing Large Log Files
NOTE: This question is about making my script better/faster. This is not about extracting emails. So, please do not mark it as a duplicate
I have huge number of text files, each several hundred MB in size. Unfortunately, they are not all fully standardized in any one format. Plus there is a lot of legacy in here, and a lot of junk and garbled text. I wish to check all of these files to find rows with a valid email ID, and if it exists then print it to a file named the first-char of the email ID. Hence, multiple text files get parsed and organized into files named a-z and 0-9. In case the email address starts with a special character, then it will get written into a file called "_" underscore. The script also trims the rows to remove whitespaces; and replaces single and double quotes (this is an application requirement)
My script works fine. There is no error/bug in here. But it is incredibly slow. My question: is there a more efficient way to achieve this? Parsing 30 GB logs takes me about 12 hrs - way too much! Will grep/cut/sed/another be any faster?
Sample txt File
[email protected],address
#[email protected];address
[email protected];address μÖ
[email protected];username;address
[email protected];username
[email protected],username;address [spaces at the start of the row]
[email protected]|username|address [tabs at the start of the row]
My Code:
awk -F'[,|;: \t]+' '{
gsub(/^[ \t]+|[ \t]+$/, "")
if (NF>1 && 1ドル ~ /^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+$/)
{
gsub(/"/, "DQUOTES")
gsub("047円", "SQUOTES")
r=gensub("[,|;: \t]+",":",1,0ドル)
a=tolower(substr(r,1,1))
if (a ~ /^[[:alnum:]]/)
print r > a
else
print r > "_"
}
else
print 0ドル > "ErrorFile"
}' *.txt