0
\$\begingroup\$

For example, if I have the following data table layout:

dt.Columns.Add("ticketid", typeof(Int32));
dt.Columns.Add("createtime", typeof(DateTime));
dt.Columns.Add("creator", typeof(string));
dt.Columns.Add("ticketText", typeof(string));

If I want to add a new entry I need to make sure that there exists none already that has the same primary key (ticketid, createtime, creator).

The check is as follows:

if (
 dt.Rows.Count <= 0 ||
 (
 dt.Rows.Count > 0 &&
 null == dt.AsEnumerable().Where
 (
 r => r.Field<int>("ticketid") == newEntry.TicketId
 && DateTime.Equals(r.Field<DateTime>("createtime"), newEntry.CreateTime)
 && r.creator == newEntry.Creator
 )
 .FirstOrDefault()
 ))

The code does what it is meant for, BUT my problem is that it is SLOW. In my case I need to read 2 million entries from the file and put them into the database as fast as possible at which this code fails (as soon as I use the code the runtime goes from just 1 minute to over 30 minutes). Is there any faster way to do this?

Site-note: the datatable is read from a file and the above is a check to make sure the data is ok before it is bulk inserted into the db.

Example with happenings:

if (
 dt.Rows.Count <= 0 ||
 (
 dt.Rows.Count > 0 &&
 null == dt.AsEnumerable().Where
 (
 r => r.Field<int>("ticketid") == newEntry.TicketId
 && DateTime.Equals(r.Field<DateTime>("createtime"), newEntry.CreateTime)
 && r.creator == newEntry.Creator
 )
 .FirstOrDefault()
 ))
{
 // No duplicate found so insert the data into the data table
 dt.Rows.Add(new object[] {newEntry.TicketId, newEntry.CreateTime, newEntry.Creator});
}
Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Sep 29, 2015 at 9:14
\$\endgroup\$
8
  • \$\begingroup\$ Do you mean like the addon I added on the bottom? \$\endgroup\$ Commented Sep 29, 2015 at 9:24
  • \$\begingroup\$ kk if any more info is needed or could help just let me know (wasn't and am not sure how much is needed, so put in the infos I thought as being needed but as we have seen not necessarily a 100% hit on what info is needed or not) \$\endgroup\$ Commented Sep 29, 2015 at 9:26
  • \$\begingroup\$ One alternative approach could be to use a stream-based solution instead of a data table.. that would, at the very least, prevent you from having to store all of the records in memory at once, which could potentially increase the performance of your code. You may also want to consider an enumerable instead of the data size is not important, as an enumerable would ensure that all operations on the data (filter/map/contains) are executed in a single iteration. These would both drastically improve performance. Don't use DataTables for data comprehensions. \$\endgroup\$ Commented Sep 29, 2015 at 9:29
  • \$\begingroup\$ Hmm would that not postbone parts of the original problem? Atm I'm using the datatable to make bulk inserts into a database where ticketid, createtime and creator are primary keys and which naturally fails for identical entries. If I use a stream approach then I would have less entries taht I transmit per insert and additionally I would have to let the database tell me which are duplicates and thus should not be transmitted to it. or am I mistaken? \$\endgroup\$ Commented Sep 29, 2015 at 9:33
  • \$\begingroup\$ It looks like you're only checking that the newEntry primary key isn't in the data table - is that right? \$\endgroup\$ Commented Sep 29, 2015 at 9:36

2 Answers 2

3
\$\begingroup\$

Introduce a new object:

public class PrimaryKey
{
 public int TicketId { get; set; }
 public DateTime CreateTime { get; set; }
 public string Creator { get; set; }
 public PrimaryKey(int ticketId, DateTime createTime, string creator)
 {
 TicketId = ticketId;
 CreateTime = createTime;
 Creator = creator;
 }
 public override int GetHashCode()
 {
 int hash = 13;
 hash = (hash * 7) + TicketId.GetHashCode();
 hash = (hash * 7) + CreateTime.GetHashCode();
 hash = (hash * 7) + Creator.GetHashCode();
 return hash;
 }
 public override bool Equals(object value)
 {
 var other = value as PrimaryKey;
 return other != null 
 && other.TicketId == TicketId 
 // ... etc.
 }
}

Use this in a hashset:

// right at the beginning...
var dedupeList = new Hashset<PrimaryKey>();
// Populate hashset from data table 
if (dedupeList.Add(new PrimaryKey(newEntry.TicketId, newEntry.CreateTime, newEntry.Creator))
{
 // New
}
else
{
 // Already added.
}

You wont have to search through the whole data table each time which should save a lot of time when the data table is large. However, you will use more memory.

answered Sep 29, 2015 at 9:50
\$\endgroup\$
6
  • \$\begingroup\$ The datatable can be about 200k entries large. So in essence the hashset does the whole "already exists" calculation and ignores all possible entries if they are existent already. What about the bulkinsert? A hashset per se can't be bulkinserted can it? Does that mean in the end I would have to retransform it into a datatable for a bulk insert? \$\endgroup\$ Commented Sep 29, 2015 at 9:56
  • \$\begingroup\$ @Thomas - Well, you'd keep the hashset in addition to the datatable. Heslacher's answer using the PrimaryKeys property of the datatable is a much better idea. \$\endgroup\$ Commented Sep 29, 2015 at 10:12
  • \$\begingroup\$ Is there any possible reason for this variant to fail? I tried to do it the same way with putting the add to the dt into the "new" part but a few times I suddenly get duplicates still. \$\endgroup\$ Commented Sep 29, 2015 at 11:44
  • \$\begingroup\$ I even tried: return ( _TicketId.GetHashCode().ToString() + _CreateTime.GetHashCode().ToString() + _Creator.ToString().GetHashCode().ToString() ).GetHashCode(); but a few times I still get into the "new" part despite the entry already existing in the datatable \$\endgroup\$ Commented Sep 29, 2015 at 12:00
  • \$\begingroup\$ If I use instead: myHashSet.Add(newEntry.TicketId.ToString()+newEntry.CreateTime.ToString()+newEntry.Creator.ToString()) in the if it works. Any idea there? \$\endgroup\$ Commented Sep 29, 2015 at 12:18
3
\$\begingroup\$

If we use a more classic variant of just a if..else if.. construct we see that there is something just not needed

if (dt.Rows.Count <= 0)
{
 // No duplicate found so insert the data into the data table
 dt.Rows.Add(new object[] {newEntry.TicketId, newEntry.CreateTime, newEntry.Creator});
} 
else if (dt.Rows.Count > 0 &&
 null == dt.AsEnumerable().Where
 (
 r => r.Field<int>("ticketid") == newEntry.TicketId
 && DateTime.Equals(r.Field<DateTime>("createtime"), newEntry.CreateTime)
 && r.creator == newEntry.Creator
 )
 .FirstOrDefault())
{
 dt.Rows.Add(new object[] {newEntry.TicketId, newEntry.CreateTime, newEntry.Creator});
}

So no matter if the datatable contains zero rows or more than zero rows a new row should be added. The only penalty would be for the very first time where the datatable contains no rows at all. So skipping the check about about any rows will get you a micro optimization like so

if (null == dt.AsEnumerable().Where
 (
 r => r.Field<int>("ticketid") == newEntry.TicketId
 && DateTime.Equals(r.Field<DateTime>("createtime"), newEntry.CreateTime)
 && r.creator == newEntry.Creator
 )
 .FirstOrDefault())
{
 dt.Rows.Add(new object[] {newEntry.TicketId, newEntry.CreateTime, newEntry.Creator});
}

By using the Contains() method of the DataRowCollection and setting the PrimaryKey property of the DataTable dt will at least make your code clearer and maybe faster too but you need to measure this by yourself.

answered Sep 29, 2015 at 9:59
\$\endgroup\$
1
  • \$\begingroup\$ It looks like the data table creates an index of sorts over the primary key columns - this is a really good answer. \$\endgroup\$ Commented Sep 29, 2015 at 10:10

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.