73 Lines of Mayhem - Parse, Sort and Save to CSV in PHP CLI

Question 1

Inside of a folder named txt I have 138 text files (totaling 349MB) full of email addresses. I have no idea (yet) how many addresses there are. They are separated from one another by line breaks. I created the following script to read all of these files into an array, dismiss the duplicates, then sort alphabetically and save in groups of 10K per csv file. It works correctly, but it has also been running for over 8 hours (dual core i3 w/ 4 gigabizzles of ram, sata 7200 hdd) which seems excessive to me. Top also tells me that my program's CPU usage is 100% and it's been like that the whole while it's been running. Give my script a looksie and advise me on where I've gone so terribly wrong.

function writeFile($fileName, $fileData)
{
 $writeFileOpen = fopen('csv/' . $fileName, 'w');
 fwrite($writeFileOpen, $fileData) or die('Unable to write file: ' . $fileName);
 fclose($writeFileOpen);
}
function openFiles()
{
 $addressList = array();
 $preventRepeat = array();
 if ($handle = opendir('txt')) {
 while (false !== ($file = readdir($handle))) {
 if ($file != '.' && $file != '..') {
 $newList = explode("\n", trim(file_get_contents('txt/' . $file)));
 foreach ($newList as $key => $val) {
 $val = str_replace(array(',', '"'), '', $val);
 if (in_array($val, $preventRepeat) || !strpos($val, '@') || !$val) {
 unset($newList[$key]);
 }
 $preventRepeat[] = $val;
 }
 if (empty($addressList)) {
 $addressList = $newList;
 } else {
 $addressList = array_merge($addressList, $newList);
 }
 unset($newList);
 }
 }
 closedir($handle);
 } else {
 echo 'Unable to Read Directory';
 }
 $lineNum = 1;
 $fileNum = 1;
 $fileData = '"Email Address"' . "\n";
 sort($addressList);
 $lastKey = count($addressList) - 1;
 foreach ($addressList as $key => $val) {
 if ($lineNum > 10000) {
 writeFile('emailList-' . $fileNum . '.csv', trim($fileData));
 $lineNum = 1;
 $fileNum++;
 $fileData = '"Email Address"' . "\n";
 } elseif ($key == $lastKey) {
 writeFile('emailList-' . $fileNum . '.csv', trim($fileData));
 echo 'Complete'; 
 }
 $fileData .= '"' . trim($val) . '"' . "\n";
 $lineNum++;
 }
}
openFiles();

PS. I wrote this script in haste to accomplish a task. I can understand if you wouldn't consider this script distribution worthy.

EDIT: So I arrived at work today to find my script at some point exceeded 536MB of memory and quit on fatal error. I had already increased my ini file's memory_limit parameter from 128MB to 512MB. This doesn't necessarily reflect an issue with the script itself. I just wanted to share my frustration with the task.

Question 2

You can probably make the script already twice as fast by utilizing both cores. Separate the processing across two processes, or use two threads. I find it strange that it indicates CPU usage of 100% at the moment. Shouldn't it just run on one core?

Question 3

PHP doesn't support multi-threading that I'm aware of. By 100% CPU usage, I'm guessing that it's taking up a whole core - not both. I'm able to run other processes without excessive clocking.

Question 4

Regarding to your edit: Don't you store all addresses twice? One in $addressList and once in $preventRepeat? Assuming all addresses are unique this would already take up 349MB * 2 of memory.

Question 5

How about a quick GNU sort: time LC_ALL=C sort -u txt/* > email.sorted. Bonus features: still fast if the files don't fit in ram (with in-memory compression and tempfiles), and multicore since 8.6.

Question 6

I just transitioned from Windows to Linux six (or so) months ago and am still getting acquainted with the ever-expansive bash and gnu tool set - will definitely check out sort's man page. What purpose does time serve in your example?

Question 7

It's been a long time since I've written PHP, but I think I can give you a few pointers anyhow.

I believe the following line is problematic speed-wise:

if (in_array($val, $preventRepeat) || !strpos($val, '@') || !$val) {
...

The in_array function does a sequential case sensitive lookup across all currently stored addresses. This call will become slower and slower the more addresses you store.

The solution is to use hash table like lookup. I'm not entirely sure but I believe the PHP equivalent can be achieved by using the keys of the array instead of the values.

// Store a certain address.
$addressList[$val] = true;

Checking whether a value is present for a given key indicates whether it has already been stored. Notice how $preventRepeat can be removed and everything is stored in $addressList. This removes the need of the array_merge, again resulting in more performance.

Beware, this is speculation, so I'm hoping someone who is certain can verify this. :)

Relating to my earlier comment:

You can probably make the script already twice as fast by utilizing both cores. Separate the processing across two processes, or use two threads. I find it strange that it indicates CPU usage of 100% at the moment. Shouldn't it just run on one core?

PHP doesn't seem to support multithreading, so the only option would be to logically split the script to separate work into two different executions. If my previous comments don't improve the speed much, it's probably advisable to use a different language than PHP for these purposes.

Question 8

Because I'm using PHP's explode() function to generate my arrays, I have no way to know or guarantee what position or key each specific value will take. Great point on having double sets of the same data, though. I removed my $addressList array and instead am storing those values in a temporary text file. I've also made sure to unset $preventRepeat before importing and array-ifying my temp file. I think my ultimate issue is that I'm trying to process way too much data at once (I'm guessing close to a billion addresses). I want them all in alphabetical order, though!

Question 9

PHP allows to use strings as keys in its associative arrays. What I intented with my answer is to store the addresses -so, $val in your code- as keys. Keys are unique, so you don't even have to check whether the key is already present, just set the value of the key to true or some other relevant value. Worry about sorting later. If I were you though, I would really prefer not processing a billion addresses in PHP.

Question 10

Ahh... so I should if ($preventRepeat[$val]) { unset($newList[$key]); } and then $preventRepeat[$val] = TRUE; instead of in_array($val, $preventRepeat) and $preventRepeat[] = $val;? That's brilliant!

Question 11

I believe unset isn't even required. I would just use one list of which you set the keys, and skip the merging. Be sure to try this out on just one small file before executing it on all of the entries. ;p

Question 12

I'm picking up what you're putting down. Store all values in array as keys (which doubles as duplicate prevention), forget about making a temp file, and later reverse the keys out into array values for processing.

Question 13

This will be much more efficient:

$result = array();
if (($handle = opendir('./txt/')) !== false)
{
 set_time_limit(0);
 ini_set('memory_limit', -1);
 while (($file = readdir($handle)) !== false)
 {
 if (($file != '.') && ($file != '..'))
 {
 if (is_resource($file = fopen('./txt/' . $file, 'rb')) === true)
 {
 while (($email = fgets($file)) !== false)
 {
 $email = trim(str_replace(array(',', '"'), '', $email));
 if (filter_var($email, FILTER_VALIDATE_EMAIL) !== false)
 {
 $result[strtolower($email)] = true;
 }
 }
 fclose($file);
 }
 }
 }
 closedir($handle);
 if (empty($result) !== true)
 {
 ksort($result);
 foreach (array_chunk($result, 10000, true) as $key => $value)
 {
 file_put_contents('./emailList-' . ($key + 1) . '.csv', implode("\n", array_keys($value)), LOCK_EX);
 }
 }
 echo 'Done!';
}

Question 14

+1 for using filter_var() (never occurred to me) - but why the identical comparison operator? Not-equal operators test against bool just fine without the extra overhead of type matching.

Question 15

@65Fbef05: Force of habit, ignore it. @Steven Jeuris: Thank you! =)

Question 16

Definitely use the command line sort tool, and look in to sed and grep as well. You will find that it is generally easier to use fast, well-tested, pre-built Unix tools to perform any large text operations than to write a higher-level program to do the same.

If you are just getting in to Unix, also check out:

the imagemagick set of utilities which provide fantastic image processing
the file tool (based on libmagic) which provides proper file type checking for uploaded files accepted from the public

Also, just in case those emails are going to be used to distribute unsolicited information... don't spam people: it's bad karma.

Question 17

I appreciate the advice. I'm working my way through a LPIC-1 study guide in an attempt to quickly get a thorough overview (oxymoron, right?) of Linux features. Also, worry not about your e-mail inbox - I don't want to spam people. I won't go into detail, but I was using the addresses to "test" an account feature on a music/social networking site. This feature turned out to not be completely developed or broken. :)

Steven Jeuris Steven Jeuris 2,6623 gold badges21 silver badges34 bronze badges · Accepted Answer · 2011-03-23 14:37:46Z

It's been a long time since I've written PHP, but I think I can give you a few pointers anyhow.

I believe the following line is problematic speed-wise:

if (in_array($val, $preventRepeat) || !strpos($val, '@') || !$val) {
...

The in_array function does a sequential case sensitive lookup across all currently stored addresses. This call will become slower and slower the more addresses you store.

The solution is to use hash table like lookup. I'm not entirely sure but I believe the PHP equivalent can be achieved by using the keys of the array instead of the values.

// Store a certain address.
$addressList[$val] = true;

Checking whether a value is present for a given key indicates whether it has already been stored. Notice how $preventRepeat can be removed and everything is stored in $addressList. This removes the need of the array_merge, again resulting in more performance.

Beware, this is speculation, so I'm hoping someone who is certain can verify this. :)

Relating to my earlier comment:

You can probably make the script already twice as fast by utilizing both cores. Separate the processing across two processes, or use two threads. I find it strange that it indicates CPU usage of 100% at the moment. Shouldn't it just run on one core?

PHP doesn't seem to support multithreading, so the only option would be to logically split the script to separate work into two different executions. If my previous comments don't improve the speed much, it's probably advisable to use a different language than PHP for these purposes.

Because I'm using PHP's explode() function to generate my arrays, I have no way to know or guarantee what position or key each specific value will take. Great point on having double sets of the same data, though. I removed my $addressList array and instead am storing those values in a temporary text file. I've also made sure to unset $preventRepeat before importing and array-ifying my temp file. I think my ultimate issue is that I'm trying to process way too much data at once (I'm guessing close to a billion addresses). I want them all in alphabetical order, though!
PHP allows to use strings as keys in its associative arrays. What I intented with my answer is to store the addresses -so, $val in your code- as keys. Keys are unique, so you don't even have to check whether the key is already present, just set the value of the key to true or some other relevant value. Worry about sorting later. If I were you though, I would really prefer not processing a billion addresses in PHP.
Ahh... so I should if ($preventRepeat[$val]) { unset($newList[$key]); } and then $preventRepeat[$val] = TRUE; instead of in_array($val, $preventRepeat) and $preventRepeat[] = $val;? That's brilliant!
I believe unset isn't even required. I would just use one list of which you set the keys, and skip the merging. Be sure to try this out on just one small file before executing it on all of the entries. ;p
I'm picking up what you're putting down. Store all values in array as keys (which doubles as duplicate prevention), forget about making a temp file, and later reverse the keys out into array values for processing.

Stack Exchange Network

73 Lines of Mayhem - Parse, Sort and Save to CSV in PHP CLI

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

73 Lines of Mayhem - Parse, Sort and Save to CSV in PHP CLI

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions