Inside of a folder named txt
I have 138 text files (totaling 349MB) full of email addresses. I have no idea (yet) how many addresses there are. They are separated from one another by line breaks. I created the following script to read all of these files into an array, dismiss the duplicates, then sort alphabetically and save in groups of 10K per csv file. It works correctly, but it has also been running for over 8 hours (dual core i3 w/ 4 gigabizzles of ram, sata 7200 hdd) which seems excessive to me. Top
also tells me that my program's CPU usage is 100% and it's been like that the whole while it's been running. Give my script a looksie and advise me on where I've gone so terribly wrong.
function writeFile($fileName, $fileData)
{
$writeFileOpen = fopen('csv/' . $fileName, 'w');
fwrite($writeFileOpen, $fileData) or die('Unable to write file: ' . $fileName);
fclose($writeFileOpen);
}
function openFiles()
{
$addressList = array();
$preventRepeat = array();
if ($handle = opendir('txt')) {
while (false !== ($file = readdir($handle))) {
if ($file != '.' && $file != '..') {
$newList = explode("\n", trim(file_get_contents('txt/' . $file)));
foreach ($newList as $key => $val) {
$val = str_replace(array(',', '"'), '', $val);
if (in_array($val, $preventRepeat) || !strpos($val, '@') || !$val) {
unset($newList[$key]);
}
$preventRepeat[] = $val;
}
if (empty($addressList)) {
$addressList = $newList;
} else {
$addressList = array_merge($addressList, $newList);
}
unset($newList);
}
}
closedir($handle);
} else {
echo 'Unable to Read Directory';
}
$lineNum = 1;
$fileNum = 1;
$fileData = '"Email Address"' . "\n";
sort($addressList);
$lastKey = count($addressList) - 1;
foreach ($addressList as $key => $val) {
if ($lineNum > 10000) {
writeFile('emailList-' . $fileNum . '.csv', trim($fileData));
$lineNum = 1;
$fileNum++;
$fileData = '"Email Address"' . "\n";
} elseif ($key == $lastKey) {
writeFile('emailList-' . $fileNum . '.csv', trim($fileData));
echo 'Complete';
}
$fileData .= '"' . trim($val) . '"' . "\n";
$lineNum++;
}
}
openFiles();
PS. I wrote this script in haste to accomplish a task. I can understand if you wouldn't consider this script distribution worthy.
EDIT: So I arrived at work today to find my script at some point exceeded 536MB of memory and quit on fatal error. I had already increased my ini file's memory_limit
parameter from 128MB to 512MB. This doesn't necessarily reflect an issue with the script itself. I just wanted to share my frustration with the task.
3 Answers 3
It's been a long time since I've written PHP, but I think I can give you a few pointers anyhow.
I believe the following line is problematic speed-wise:
if (in_array($val, $preventRepeat) || !strpos($val, '@') || !$val) {
...
The in_array
function does a sequential case sensitive lookup across all currently stored addresses. This call will become slower and slower the more addresses you store.
The solution is to use hash table like lookup. I'm not entirely sure but I believe the PHP equivalent can be achieved by using the keys of the array instead of the values.
// Store a certain address.
$addressList[$val] = true;
Checking whether a value is present for a given key indicates whether it has already been stored. Notice how $preventRepeat
can be removed and everything is stored in $addressList
. This removes the need of the array_merge
, again resulting in more performance.
Beware, this is speculation, so I'm hoping someone who is certain can verify this. :)
Relating to my earlier comment:
You can probably make the script already twice as fast by utilizing both cores. Separate the processing across two processes, or use two threads. I find it strange that it indicates CPU usage of 100% at the moment. Shouldn't it just run on one core?
PHP doesn't seem to support multithreading, so the only option would be to logically split the script to separate work into two different executions. If my previous comments don't improve the speed much, it's probably advisable to use a different language than PHP for these purposes.
-
\$\begingroup\$ Because I'm using PHP's
explode()
function to generate my arrays, I have no way to know or guarantee what position or key each specific value will take. Great point on having double sets of the same data, though. I removed my$addressList
array and instead am storing those values in a temporary text file. I've also made sure to unset$preventRepeat
before importing and array-ifying my temp file. I think my ultimate issue is that I'm trying to process way too much data at once (I'm guessing close to a billion addresses). I want them all in alphabetical order, though! \$\endgroup\$65Fbef05– 65Fbef052011年03月23日 15:01:26 +00:00Commented Mar 23, 2011 at 15:01 -
\$\begingroup\$ PHP allows to use strings as keys in its associative arrays. What I intented with my answer is to store the addresses -so,
$val
in your code- as keys. Keys are unique, so you don't even have to check whether the key is already present, just set the value of the key totrue
or some other relevant value. Worry about sorting later. If I were you though, I would really prefer not processing a billion addresses in PHP. \$\endgroup\$Steven Jeuris– Steven Jeuris2011年03月23日 15:16:33 +00:00Commented Mar 23, 2011 at 15:16 -
\$\begingroup\$ Ahh... so I should
if ($preventRepeat[$val]) { unset($newList[$key]); }
and then$preventRepeat[$val] = TRUE;
instead ofin_array($val, $preventRepeat)
and$preventRepeat[] = $val;
? That's brilliant! \$\endgroup\$65Fbef05– 65Fbef052011年03月23日 15:31:07 +00:00Commented Mar 23, 2011 at 15:31 -
\$\begingroup\$ I believe
unset
isn't even required. I would just use one list of which you set the keys, and skip the merging. Be sure to try this out on just one small file before executing it on all of the entries. ;p \$\endgroup\$Steven Jeuris– Steven Jeuris2011年03月23日 15:38:01 +00:00Commented Mar 23, 2011 at 15:38 -
1\$\begingroup\$ I'm picking up what you're putting down. Store all values in array as keys (which doubles as duplicate prevention), forget about making a temp file, and later reverse the keys out into array values for processing. \$\endgroup\$65Fbef05– 65Fbef052011年03月23日 15:48:56 +00:00Commented Mar 23, 2011 at 15:48
This will be much more efficient:
$result = array();
if (($handle = opendir('./txt/')) !== false)
{
set_time_limit(0);
ini_set('memory_limit', -1);
while (($file = readdir($handle)) !== false)
{
if (($file != '.') && ($file != '..'))
{
if (is_resource($file = fopen('./txt/' . $file, 'rb')) === true)
{
while (($email = fgets($file)) !== false)
{
$email = trim(str_replace(array(',', '"'), '', $email));
if (filter_var($email, FILTER_VALIDATE_EMAIL) !== false)
{
$result[strtolower($email)] = true;
}
}
fclose($file);
}
}
}
closedir($handle);
if (empty($result) !== true)
{
ksort($result);
foreach (array_chunk($result, 10000, true) as $key => $value)
{
file_put_contents('./emailList-' . ($key + 1) . '.csv', implode("\n", array_keys($value)), LOCK_EX);
}
}
echo 'Done!';
}
-
\$\begingroup\$ +1 for using
filter_var()
(never occurred to me) - but why the identical comparison operator? Not-equal operators test againstbool
just fine without the extra overhead of type matching. \$\endgroup\$65Fbef05– 65Fbef052011年03月23日 20:14:38 +00:00Commented Mar 23, 2011 at 20:14 -
\$\begingroup\$ @65Fbef05: Force of habit, ignore it. @Steven Jeuris: Thank you! =) \$\endgroup\$Alix Axel– Alix Axel2011年03月24日 01:08:40 +00:00Commented Mar 24, 2011 at 1:08
Definitely use the command line sort
tool, and look in to sed
and grep
as well. You will find that it is generally easier to use fast, well-tested, pre-built Unix tools to perform any large text operations than to write a higher-level program to do the same.
If you are just getting in to Unix, also check out:
- the imagemagick set of utilities which provide fantastic image processing
- the
file
tool (based onlibmagic
) which provides proper file type checking for uploaded files accepted from the public
Also, just in case those emails are going to be used to distribute unsolicited information... don't spam people: it's bad karma.
-
\$\begingroup\$ I appreciate the advice. I'm working my way through a LPIC-1 study guide in an attempt to quickly get a thorough overview (oxymoron, right?) of Linux features. Also, worry not about your e-mail inbox - I don't want to spam people. I won't go into detail, but I was using the addresses to "test" an account feature on a music/social networking site. This feature turned out to not be completely developed or broken. :) \$\endgroup\$65Fbef05– 65Fbef052011年04月12日 12:10:30 +00:00Commented Apr 12, 2011 at 12:10
$addressList
and once in$preventRepeat
? Assuming all addresses are unique this would already take up 349MB * 2 of memory. \$\endgroup\$time LC_ALL=C sort -u txt/* > email.sorted
. Bonus features: still fast if the files don't fit in ram (with in-memory compression and tempfiles), and multicore since 8.6. \$\endgroup\$