Determining if a possible new entry already exists in a dataset

Question 1

For example, if I have the following data table layout:

dt.Columns.Add("ticketid", typeof(Int32));
dt.Columns.Add("createtime", typeof(DateTime));
dt.Columns.Add("creator", typeof(string));
dt.Columns.Add("ticketText", typeof(string));

If I want to add a new entry I need to make sure that there exists none already that has the same primary key (ticketid, createtime, creator).

The check is as follows:

if (
 dt.Rows.Count <= 0 ||
 (
 dt.Rows.Count > 0 &&
 null == dt.AsEnumerable().Where
 (
 r => r.Field<int>("ticketid") == newEntry.TicketId
 && DateTime.Equals(r.Field<DateTime>("createtime"), newEntry.CreateTime)
 && r.creator == newEntry.Creator
 )
 .FirstOrDefault()
 ))

The code does what it is meant for, BUT my problem is that it is SLOW. In my case I need to read 2 million entries from the file and put them into the database as fast as possible at which this code fails (as soon as I use the code the runtime goes from just 1 minute to over 30 minutes). Is there any faster way to do this?

Site-note: the datatable is read from a file and the above is a check to make sure the data is ok before it is bulk inserted into the db.

Example with happenings:

if (
 dt.Rows.Count <= 0 ||
 (
 dt.Rows.Count > 0 &&
 null == dt.AsEnumerable().Where
 (
 r => r.Field<int>("ticketid") == newEntry.TicketId
 && DateTime.Equals(r.Field<DateTime>("createtime"), newEntry.CreateTime)
 && r.creator == newEntry.Creator
 )
 .FirstOrDefault()
 ))
{
 // No duplicate found so insert the data into the data table
 dt.Rows.Add(new object[] {newEntry.TicketId, newEntry.CreateTime, newEntry.Creator});
}

Question 2

Do you mean like the addon I added on the bottom?

Question 3

kk if any more info is needed or could help just let me know (wasn't and am not sure how much is needed, so put in the infos I thought as being needed but as we have seen not necessarily a 100% hit on what info is needed or not)

Question 4

One alternative approach could be to use a stream-based solution instead of a data table.. that would, at the very least, prevent you from having to store all of the records in memory at once, which could potentially increase the performance of your code. You may also want to consider an enumerable instead of the data size is not important, as an enumerable would ensure that all operations on the data (filter/map/contains) are executed in a single iteration. These would both drastically improve performance. Don't use DataTables for data comprehensions.

Question 5

Hmm would that not postbone parts of the original problem? Atm I'm using the datatable to make bulk inserts into a database where ticketid, createtime and creator are primary keys and which naturally fails for identical entries. If I use a stream approach then I would have less entries taht I transmit per insert and additionally I would have to let the database tell me which are duplicates and thus should not be transmitted to it. or am I mistaken?

Question 6

It looks like you're only checking that the newEntry primary key isn't in the data table - is that right?

Question 7

Introduce a new object:

public class PrimaryKey
{
 public int TicketId { get; set; }
 public DateTime CreateTime { get; set; }
 public string Creator { get; set; }
 public PrimaryKey(int ticketId, DateTime createTime, string creator)
 {
 TicketId = ticketId;
 CreateTime = createTime;
 Creator = creator;
 }
 public override int GetHashCode()
 {
 int hash = 13;
 hash = (hash * 7) + TicketId.GetHashCode();
 hash = (hash * 7) + CreateTime.GetHashCode();
 hash = (hash * 7) + Creator.GetHashCode();
 return hash;
 }
 public override bool Equals(object value)
 {
 var other = value as PrimaryKey;
 return other != null 
 && other.TicketId == TicketId 
 // ... etc.
 }
}

Use this in a hashset:

// right at the beginning...
var dedupeList = new Hashset<PrimaryKey>();
// Populate hashset from data table 
if (dedupeList.Add(new PrimaryKey(newEntry.TicketId, newEntry.CreateTime, newEntry.Creator))
{
 // New
}
else
{
 // Already added.
}

You wont have to search through the whole data table each time which should save a lot of time when the data table is large. However, you will use more memory.

Question 8

The datatable can be about 200k entries large. So in essence the hashset does the whole "already exists" calculation and ignores all possible entries if they are existent already. What about the bulkinsert? A hashset per se can't be bulkinserted can it? Does that mean in the end I would have to retransform it into a datatable for a bulk insert?

Question 9

@Thomas - Well, you'd keep the hashset in addition to the datatable. Heslacher's answer using the PrimaryKeys property of the datatable is a much better idea.

Question 10

Is there any possible reason for this variant to fail? I tried to do it the same way with putting the add to the dt into the "new" part but a few times I suddenly get duplicates still.

Question 11

I even tried: return ( _TicketId.GetHashCode().ToString() + _CreateTime.GetHashCode().ToString() + _Creator.ToString().GetHashCode().ToString() ).GetHashCode(); but a few times I still get into the "new" part despite the entry already existing in the datatable

Question 12

If I use instead: myHashSet.Add(newEntry.TicketId.ToString()+newEntry.CreateTime.ToString()+newEntry.Creator.ToString()) in the if it works. Any idea there?

Question 13

If we use a more classic variant of just a if..else if.. construct we see that there is something just not needed

if (dt.Rows.Count <= 0)
{
 // No duplicate found so insert the data into the data table
 dt.Rows.Add(new object[] {newEntry.TicketId, newEntry.CreateTime, newEntry.Creator});
} 
else if (dt.Rows.Count > 0 &&
 null == dt.AsEnumerable().Where
 (
 r => r.Field<int>("ticketid") == newEntry.TicketId
 && DateTime.Equals(r.Field<DateTime>("createtime"), newEntry.CreateTime)
 && r.creator == newEntry.Creator
 )
 .FirstOrDefault())
{
 dt.Rows.Add(new object[] {newEntry.TicketId, newEntry.CreateTime, newEntry.Creator});
}

So no matter if the datatable contains zero rows or more than zero rows a new row should be added. The only penalty would be for the very first time where the datatable contains no rows at all. So skipping the check about about any rows will get you a micro optimization like so

if (null == dt.AsEnumerable().Where
 (
 r => r.Field<int>("ticketid") == newEntry.TicketId
 && DateTime.Equals(r.Field<DateTime>("createtime"), newEntry.CreateTime)
 && r.creator == newEntry.Creator
 )
 .FirstOrDefault())
{
 dt.Rows.Add(new object[] {newEntry.TicketId, newEntry.CreateTime, newEntry.Creator});
}

By using the Contains() method of the DataRowCollection and setting the PrimaryKey property of the DataTable dt will at least make your code clearer and maybe faster too but you need to measure this by yourself.

Question 14

It looks like the data table creates an index of sorts over the primary key columns - this is a really good answer.

RobH RobH 17.1k6 gold badges38 silver badges73 bronze badges · Answer 1 · 2015-09-29 09:50:38Z

Introduce a new object:

public class PrimaryKey
{
 public int TicketId { get; set; }
 public DateTime CreateTime { get; set; }
 public string Creator { get; set; }
 public PrimaryKey(int ticketId, DateTime createTime, string creator)
 {
 TicketId = ticketId;
 CreateTime = createTime;
 Creator = creator;
 }
 public override int GetHashCode()
 {
 int hash = 13;
 hash = (hash * 7) + TicketId.GetHashCode();
 hash = (hash * 7) + CreateTime.GetHashCode();
 hash = (hash * 7) + Creator.GetHashCode();
 return hash;
 }
 public override bool Equals(object value)
 {
 var other = value as PrimaryKey;
 return other != null 
 && other.TicketId == TicketId 
 // ... etc.
 }
}

Use this in a hashset:

// right at the beginning...
var dedupeList = new Hashset<PrimaryKey>();
// Populate hashset from data table 
if (dedupeList.Add(new PrimaryKey(newEntry.TicketId, newEntry.CreateTime, newEntry.Creator))
{
 // New
}
else
{
 // Already added.
}

You wont have to search through the whole data table each time which should save a lot of time when the data table is large. However, you will use more memory.

The datatable can be about 200k entries large. So in essence the hashset does the whole "already exists" calculation and ignores all possible entries if they are existent already. What about the bulkinsert? A hashset per se can't be bulkinserted can it? Does that mean in the end I would have to retransform it into a datatable for a bulk insert?
@Thomas - Well, you'd keep the hashset in addition to the datatable. Heslacher's answer using the PrimaryKeys property of the datatable is a much better idea.
Is there any possible reason for this variant to fail? I tried to do it the same way with putting the add to the dt into the "new" part but a few times I suddenly get duplicates still.
I even tried: return ( _TicketId.GetHashCode().ToString() + _CreateTime.GetHashCode().ToString() + _Creator.ToString().GetHashCode().ToString() ).GetHashCode(); but a few times I still get into the "new" part despite the entry already existing in the datatable
If I use instead: myHashSet.Add(newEntry.TicketId.ToString()+newEntry.CreateTime.ToString()+newEntry.Creator.ToString()) in the if it works. Any idea there?

Heslacher Heslacher 50.9k5 gold badges83 silver badges177 bronze badges · Answer 2 · 2015-09-29 09:59:18Z

If we use a more classic variant of just a if..else if.. construct we see that there is something just not needed

if (dt.Rows.Count <= 0)
{
 // No duplicate found so insert the data into the data table
 dt.Rows.Add(new object[] {newEntry.TicketId, newEntry.CreateTime, newEntry.Creator});
} 
else if (dt.Rows.Count > 0 &&
 null == dt.AsEnumerable().Where
 (
 r => r.Field<int>("ticketid") == newEntry.TicketId
 && DateTime.Equals(r.Field<DateTime>("createtime"), newEntry.CreateTime)
 && r.creator == newEntry.Creator
 )
 .FirstOrDefault())
{
 dt.Rows.Add(new object[] {newEntry.TicketId, newEntry.CreateTime, newEntry.Creator});
}

So no matter if the datatable contains zero rows or more than zero rows a new row should be added. The only penalty would be for the very first time where the datatable contains no rows at all. So skipping the check about about any rows will get you a micro optimization like so

if (null == dt.AsEnumerable().Where
 (
 r => r.Field<int>("ticketid") == newEntry.TicketId
 && DateTime.Equals(r.Field<DateTime>("createtime"), newEntry.CreateTime)
 && r.creator == newEntry.Creator
 )
 .FirstOrDefault())
{
 dt.Rows.Add(new object[] {newEntry.TicketId, newEntry.CreateTime, newEntry.Creator});
}

By using the Contains() method of the DataRowCollection and setting the PrimaryKey property of the DataTable dt will at least make your code clearer and maybe faster too but you need to measure this by yourself.

It looks like the data table creates an index of sorts over the primary key columns - this is a really good answer.

Stack Exchange Network

Determining if a possible new entry already exists in a dataset

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Determining if a possible new entry already exists in a dataset

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions