I have a use case where I need to process about 80,000 emails per day. The client has an existing process that takes about 1/sec per email and is running out of time before the next batch comes in. I implemented a solution using the MailKit library and I am trying to implement some parallel tasks using TPL. I tried both a Pop3 client and IMAP client and the IMAP client seems to be faster. Plus, since IMAP supports multiple clients connecting at the same time so I figured that would be better for TPL.
I tested the code with a small sample of 100 emails and trying different numbers of parallel tasks. The results are below.
I'd appreciate any advice on my code below or other improvement suggestions on how to make it faster.
private static int _num_of_threads = 1;
private static int _num_of_messages = 100;
public override async void DoWork()
{
try
{
IList<UniqueId> uids = new List<UniqueId>();
uids = GetUids();
//shrink the list for now
uids = uids.Take(_num_of_messages).ToList();
//create {_num_of_threads} lists
List<List<UniqueId>> list_of_list_of_uids = new List<List<UniqueId>>();
list_of_list_of_uids = SplitUids(uids);
// create the TPL tasks
Task[] tasks = new Task[_num_of_threads];
for (int index = 0; index < tasks.Length; index++)
{
tasks[index] = Task.Factory.StartNew((Object obj) =>
{
int i = (int)obj;
GetMessages(list_of_list_of_uids[i]);
}, index);
}
Task.WaitAll(tasks);
}
catch (Exception ex)
{
}
}
private List<UniqueId> GetUids()
{
IList<UniqueId> uids = new List<UniqueId>();
using (var client = new ImapClient())
{
try
{
client.Connect("<emailserver>", 993, SecureSocketOptions.SslOnConnect);
client.Authenticate(<username>, <password>);
client.Inbox.Open(FolderAccess.ReadOnly);
uids = client.Inbox.Search(SearchQuery.All);
}
finally
{
if (client.IsConnected)
{
client.Disconnect(true);
}
}
}
return uids.ToList();
}
private List<List<UniqueId>> SplitUids(IList<UniqueId> uids)
{
List<List<UniqueId>> list_of_list_of_uids = new List<List<UniqueId>>();
for (int list_index = 0; list_index < _num_of_threads; list_index++)
{
List<UniqueId> list_of_uids = new List<UniqueId>();
list_of_list_of_uids.Add(list_of_uids);
}
// should have {_num_of_threads} lists now
//now spread the uids across the lists
int list_iterator = 0;
for (int uid_index = 0; uid_index < uids.Count(); uid_index++)
{
list_of_list_of_uids[list_iterator].Add(uids[uid_index]);
list_iterator++;
if (list_iterator == _num_of_threads)
{
list_iterator = 0;
}
}
return list_of_list_of_uids;
}
private async Task<int> GetMessages(IList<UniqueId> uids)
{
int count = 0;
using (var client = new ImapClient())
{
try
{
client.Connect("<emailserver>", 993, SecureSocketOptions.SslOnConnect);
client.Authenticate(<username>, <password>);
client.Inbox.Open(FolderAccess.ReadOnly);
foreach (UniqueId uid in uids)
{
var message = client.Inbox.GetMessage(uid);
if (message != null)
{
// process the email
// process aggregates a bunch of data and sends the result off to a web service to be put into a DB.
count++;
}
}
}
finally
{
if (client.IsConnected)
{
client.Disconnect(true);
}
}
}
return count;
}
Best Result:
1 Thread = ~54 seconds
2 Threads = ~21 seconds
3 Threads = ~15 seconds
4 Threads = ~14 seconds
10 Threads = ~14 seconds
-
\$\begingroup\$ Thanks for doing the legwork to figure out how to efficiently parallelize this, Jay! \$\endgroup\$jstedfast– jstedfast2016年07月26日 17:54:35 +00:00Commented Jul 26, 2016 at 17:54
-
1\$\begingroup\$ 80k emails per day? It looks like someone is abusing emails as an error log, isn't he? \$\endgroup\$t3chb0t– t3chb0t2016年07月26日 18:27:38 +00:00Commented Jul 26, 2016 at 18:27
-
\$\begingroup\$ I suggest you do some load testing before you run with it. A sample of 100 when you get 80K a day may not be significant. 1M might be a better test. \$\endgroup\$pacmaninbw– pacmaninbw ♦2016年07月26日 19:46:13 +00:00Commented Jul 26, 2016 at 19:46
-
\$\begingroup\$ @-t3chb0t - Not an error log. The client has 80,000+ devices deployed in the field and they are sending status updates every day. Data like counters, measurements, etc. It's unfortunate they weren't designed to just send the data to a web service to be processed, but it is what it is for now. \$\endgroup\$Jay Mathis– Jay Mathis2016年07月27日 19:59:43 +00:00Commented Jul 27, 2016 at 19:59
-
\$\begingroup\$ @pacmaninbw - thanks, will do some testing on larger data sets as well. \$\endgroup\$Jay Mathis– Jay Mathis2016年07月27日 20:00:49 +00:00Commented Jul 27, 2016 at 20:00
2 Answers 2
My suggestions
- Use
var
. For example in this piece of codeList<List<UniqueId>> list_of_list_of_uids = new List<List<UniqueId>>();
is will be much more readable I think the code under is redundant. Because
using
keyword (when it is compiled into IL code) already containsfinally
statement withDispose
inside. I am pretty sure thatDispose
ofImapClient
class will disconnect(if needed) itself, so you don't need to do it once again.finally { if (client.IsConnected) { client.Disconnect(true); } }
I do not see how to significantly improve performance of your code. Only minor suggestion. The capacity of your lists in method
SplitUids
can be calculated in advance, so use constructor with capacity hereIList<UniqueId> uids = new List<UniqueId>();
It will improve performance a little bit
If performance is critical in your case you should paste your for this // process aggregates a bunch of data and sends the result off to a web service to be put into a DB.
I suppose these operations are quite time-consuming(because of database and web service) and might be improved.
-
\$\begingroup\$ Thanks! The code behind the '// process' is actually a black box DLL call for now. I can't change it (yet), but I do think there is a lot of opportunity there as well. \$\endgroup\$Jay Mathis– Jay Mathis2016年07月27日 20:04:09 +00:00Commented Jul 27, 2016 at 20:04
-
\$\begingroup\$ @Disappointed Do you have a source for the
finally/ImapClient.Dispose
suggestion? I can't seem to find anything that explicitly states it disconnects. I would naturally assume this myself if it weren't for theDisconnect(true)
parameter which sends a LOGOUT command. Which I would guessDispose
probably doesn't do that. mimekit.net/docs/html/… \$\endgroup\$Shelby115– Shelby1152016年07月27日 20:41:33 +00:00Commented Jul 27, 2016 at 20:41 -
\$\begingroup\$ @Shelby115 No, i haven't source. \$\endgroup\$Disappointed– Disappointed2016年07月27日 20:51:03 +00:00Commented Jul 27, 2016 at 20:51
Just some minor suggestions to add.
Double Initialization
You're initializing your lists twice. First on declaration, then when you call the function on the next line. Just initialize to the functions. e.g. var uids = GetUids();
var
(As Disappointed suggested) This is mostly preference but List<UniqueId> list_of_uids = new List<UniqueId>();
is less keystrokes and simpler as var list_of_uids = new List<UniqueId>();
without reducing any information/readability.
Naming
Typically in C# it is convention to use PascalCasing for class names and members and camelCasing for local variables.
Consistency is always helpful. So since you're already calling a(n) IList<UniqueId>
'uids' then why include those ugly underscores?
Explore related questions
See similar questions with these tags.