I have the following code which works great to do what it is meant to do. It scrapes the bodies of all e-mails in the inbox at an e-mail address using IMAP and retrieves the unique code sent inside the body and stores it in a database with the amount paid. I'm hoping to make it so when a user purchases something they can send an Interac e-transfer and then enter the code that both of us receive via e-mail in the website and it will apply the credit of the e-transfer to their account/purchase.
However, after setting it up on a cron job to cycle every few minutes so the content in the database stays fresh it eventually exceeds the bandwidth for the account within a day or so (not positive on how long it took but it didn't take long). Now, like I said the code works it's just very resource intensive apparently.
I do not pipe the script because the e-mail account has already been set up a while ago and we do use the account for other e-transfers which I review manually for other transactions.
- Is there any way to clean it up so it works the same but uses less resources/bandwidth?
- Would it be better to run it when a user enters their payment code? Depending on the number of users running it this could also lead to bandwidth troubles.
- Are there any improvements or what ideas do you have?
define("MAX_EMAIL_COUNT", $_POST['maxcount']);
/* took from https://gist.github.com/agarzon/3123118 */
function extractEmail($content) {
$regexp = '/([a-z0-9_\.\-])+\@(([a-z0-9\-])+\.)+([a-z0-9]{2,4})+/i';
preg_match_all($regexp, $content, $m);
return isset($m[0]) ? $m[0] : array ();
}
function getAddressText(&$emailList, &$nameList, $addressObject) {
$emailList = '';
$nameList = '';
foreach ($addressObject as $object) {
$emailList .= ';';
if (isset($object->personal)) {
$emailList .= $object->personal;
}
$nameList .= ';';
if (isset($object->mailbox) && isset($object->host)) {
$nameList .= $object->mailbox . "@" . $object->host;
}
}
$emailList = ltrim($emailList, ';');
$nameList = ltrim($nameList, ';');
}
function processMessage($mbox, $messageNumber) {
global $db;
// get imap_fetch header and put single lines into array
$header = imap_rfc822_parse_headers(imap_fetchheader($mbox, $messageNumber));
$timestamp = strtotime($header->Date);
$fromEmailList = '';
$fromNameList = '';
if (isset($header->from)) {
getAddressText($fromEmailList, $fromNameList, $header->from);
}
$toEmailList = '';
$toNameList = '';
if (isset($header->to)) {
getAddressText($toEmailList, $toNameList, $header->to);
}
$body = imap_fetchbody($mbox, $messageNumber, 1);
//echo "<pre>".print_r($body,true)."</pre>";
/* Find Reference Number */
//echo "<pre style='background-color: #A2A2A2; border: 1px solid black'>$body</pre>";
$searchfor = 'Reference Number';
// get the file contents, assuming the file to be readable (and exist)
$contents = $body;
// escape special characters in the query
$pattern = preg_quote($searchfor, '/');
// finalise the regular expression, matching the whole line
$pattern = "/^.*$pattern.*\$/m";
// search, and store all matching occurences in $matches
if(preg_match_all($pattern, $contents, $matches)){
$reference = trim(str_replace('Reference Number : ','',$matches[0][0]));
}
else{
}
/* Find Amount Paid */
//echo "<pre style='background-color: #A2A2A2; border: 1px solid black'>$body</pre>";
$searchfor = 'has sent you a money transfer for the amount of';
// get the file contents, assuming the file to be readable (and exist)
$contents = $body;
// escape special characters in the query
$pattern = preg_quote($searchfor, '/');
// finalise the regular expression, matching the whole line
$pattern = "/^.*$pattern.*\$/m";
// search, and store all matching occurences in $matches
if(preg_match_all($pattern, $contents, $matches)){
$amount = trim(preg_replace("/[^0-9\.]/", "",$matches[0][0]),'.');
}
else{
}
$bodyEmailList = implode(';', extractEmail($body));
// Delete all messages older than one year (31557600 seconds). Divide it by two for six months.
if($timestamp < time()-31557600) {
if(imap_delete($mbox,$messageNumber)) {
/*echo "<strong>";
print_r($messageNumber . ' , ' . date("F j, Y g:i A",$timestamp).' , ' . 'Deleted' . "\n");
echo "</strong>";*/
}
}
else {
if(!empty($reference) && !empty($amount)) {
if($fromNameList == "[email protected]" && $toNameList!='[email protected]') {
$query = "SELECT * FROM `".$db->prefix."payments_etransfer` WHERE `reference_id` = '".$reference."'";
$select = $db->select($query);
if($db->num_rows($select) > 0) {
}
else {
$do = $db->insert_sql("INSERT INTO `".$db->prefix."payments_etransfer` SET
`email_id` = '".$messageNumber."',
`timestamp` = '".$timestamp."',
`reference_id` = '".$reference."',
`amount` = '".$amount."',
`sender` = '".$fromEmailList."'");
if($do) {
}
else {
echo "Error<br><blockquote><pre>";
print_r($messageNumber . ',' . $timestamp. ',' . $reference . ',$' . $amount .
',' . $fromEmailList . ',' . $fromNameList
. ',' . $toEmailList . ',' . $toNameList
. ',' . $bodyEmailList . "\n"
);
echo "</pre></blockquote>";
}
}
}
}
}
}
// imap_timeout(IMAP_OPENTIMEOUT, 300);
// Open pop mailbox
if (!$mbox = imap_open($_POST['mailbox'], $_POST['login'], $_POST['password'])) {
die('Cannot connect/check pop mail! Exiting');
}
if ($hdr = imap_check($mbox)) {
$msgCount = $hdr->Nmsgs;
} else {
echo "Failed to get mail";
exit;
}
/* echo "<pre>";
echo 'emails count=' . $msgCount . "\n\n\n";
echo "record number,from emails list,from names list,to emails list, to names list,extracted from body\n";
*/
/* might improve performance according to
http://www.php.net/manual/en/function.imap-headerinfo.php#98809
imap_headers($mbox);
*/
for ($X = $msgCount; $X > 0; $X--) {
if($X > 0) {
processMessage($mbox, $X);
}
}
/*echo "</pre>";*/
imap_expunge($mbox);
imap_close($mbox);
/*
-
\$\begingroup\$ I have no idea what you are actually using this for, but to scrape financial details from within an e-mail box, which you do not own, seems a recipe for disaster. It sounds like it doesn't matter that this will be highly unreliable? \$\endgroup\$KIKO Software– KIKO Software2021年09月12日 11:23:17 +00:00Commented Sep 12, 2021 at 11:23
-
\$\begingroup\$ @mdfst13 This is the entire code ran by cron. It circulates through all the e-mails in the inbox so yes, it reads all of them multiple times. Marks any that are 1 year old set for deletion. I could set that limit lower to a week or so. I could also do it so it deletes it right away but as I want a chance to view the actual e-mail before it removes it from the system. \$\endgroup\$Dawson Irvine– Dawson Irvine2021年09月13日 02:33:12 +00:00Commented Sep 13, 2021 at 2:33
-
\$\begingroup\$ @KIKOSoftware I do own the e-mail box and the e-mail box is on my server as part of my website (my domain). It is reliable and works great. It does exactly what I want except for running up the bandwidth and then shutting down the website due to that fact; which is why I posted here to see how I can improve on it. There are a few unreliable areas that could potentially cause problems but at the moment I am not worried about those. \$\endgroup\$Dawson Irvine– Dawson Irvine2021年09月13日 02:38:27 +00:00Commented Sep 13, 2021 at 2:38
2 Answers 2
Only read new messages
From a comment:
It circulates through all the e-mails in the inbox so yes, it reads all of them multiple times.
I think that this is the source of your bandwidth problem. Either
- Turn down the cron frequency (easy but least effective).
- Track which emails you've already read and don't read them again.
- Change the program to only look at emails that arrived since you last looked (most difficult but also most efficient in terms of bandwidth).
Tracking seen messages
So in this code:
for ($X = $msgCount; $X > 0; $X--) { if($X > 0) { processMessage($mbox, $X); } }
First, it could just be
for ($X = $msgCount; $X > 0; $X--) {
processMessage($mbox, $X);
}
The if
doesn't do anything.
But more importantly, you want to recognize if you've seen a message previously.
$seen = false;
for ($X = $msgCount; !$seen && ($X > 0); $X--) {
$seen = processMessage($mbox, $X);
}
Then change processMessage
to return false;
at the end. And earlier, when you process an email, save the email identifier so that the next time you see that email, you can return true
.
$header = imap_rfc822_parse_headers(imap_fetchheader($mbox, $messageNumber));
This could be simplified to
$header = imap_headerinfo($mbox, $messageNumber);
That will also give you access to $header->message_id
, which is the thing that you can store to see if you've seen a message previously. Alternately, imap_uid
also gives the message ID.
Move the message
Another simple option would be to move the message out of the inbox and into a separate folder. So messages in inbox need to be processed. Messages in the folder have already been processed. The imap_mail_move
function moves messages.
extractEmail()
is actually generating an array of emails, so the function name should be pluralized./[a-z0-9_]/i
is more simply expressed as\w
. Inside of a character class, a dot doesn't need to be escaped. If a hyphen is the first or final character in a character class or if it immediately follows a range of characters it does not need to be escaped. An@
symbol is not a special character in regex, so it doesn't need escaping. You aren't using these captured substrings, so it seems logical to omit them or make them non-capturing where appropriate. Just so you know, there will be valid emails (fringe cases) that will not be matched by your email pattern. If this is sufficient for your project scope, then I suppose you don't need to spend anymore time on it -- just be careful where else you use this pattern.function extractEmails(string $content): array { $regexp = '/[\w.-]+@(?:[a-z\d-]+\.)+[a-z\d]{2,4}+/i'; return preg_match_all($regexp, $content, $m) ? $m[0] : []; }
I find function name
getAddressText()
to be semantically misleading -- it is not "getting" anything, it is mutating the first two arguments which are reference variables. Rather than unconditionally wiping the data clean on the first two lines, then doing a bunch of string concatenation, then trimming the excess, I find it tidier to create arrays, then implode them. I suppose, if I am being honest, you never need to respect the data coming in via the first two parameters, so I think I'd just pass in$addressObject
and return a two-element array.function getAddressEmailsAndLists(array $addressObjects): array { $emails = []; $names = []; foreach ($addressObject as $object) { $emails[] = $object->personal ?? ''; $names[] = isset($object->mailbox, $object->host) ? $object->mailbox . "@" . $object->host ''; } return [implode(';', $emails), implode(';', $names)], }
One call of it would look like:
[$fromEmailList, $fromNameList] = getAddressText($header->from);
I am concerned by both occurrences of
$pattern = "/^.*$pattern.*\$/m";
. First, the escaping slash before$
means that you are matching a literal dollar sign -- but your comment suggests you mean to mark the end of the line. Do you mean to remove the slash or match a literal dollar sign?[^0-9\.]
can be simplified to[^\d.]+
-- this will match the same list of characters, but it will make potentially longer matches and therefore fewer replacements.Your sql does not look stable or secure to me because you are not using prepared statements with placeholders. This looks like a major problem for your database query wrappers.
I agree with @mdfst13's assessment of your
for()
loop logic.