For example, if I have the following data table layout:
dt.Columns.Add("ticketid", typeof(Int32));
dt.Columns.Add("createtime", typeof(DateTime));
dt.Columns.Add("creator", typeof(string));
dt.Columns.Add("ticketText", typeof(string));
If I want to add a new entry I need to make sure that there exists none already that has the same primary key (ticketid
, createtime
, creator
).
The check is as follows:
if (
dt.Rows.Count <= 0 ||
(
dt.Rows.Count > 0 &&
null == dt.AsEnumerable().Where
(
r => r.Field<int>("ticketid") == newEntry.TicketId
&& DateTime.Equals(r.Field<DateTime>("createtime"), newEntry.CreateTime)
&& r.creator == newEntry.Creator
)
.FirstOrDefault()
))
The code does what it is meant for, BUT my problem is that it is SLOW. In my case I need to read 2 million entries from the file and put them into the database as fast as possible at which this code fails (as soon as I use the code the runtime goes from just 1 minute to over 30 minutes). Is there any faster way to do this?
Site-note: the datatable is read from a file and the above is a check to make sure the data is ok before it is bulk inserted into the db.
Example with happenings:
if (
dt.Rows.Count <= 0 ||
(
dt.Rows.Count > 0 &&
null == dt.AsEnumerable().Where
(
r => r.Field<int>("ticketid") == newEntry.TicketId
&& DateTime.Equals(r.Field<DateTime>("createtime"), newEntry.CreateTime)
&& r.creator == newEntry.Creator
)
.FirstOrDefault()
))
{
// No duplicate found so insert the data into the data table
dt.Rows.Add(new object[] {newEntry.TicketId, newEntry.CreateTime, newEntry.Creator});
}
2 Answers 2
Introduce a new object:
public class PrimaryKey
{
public int TicketId { get; set; }
public DateTime CreateTime { get; set; }
public string Creator { get; set; }
public PrimaryKey(int ticketId, DateTime createTime, string creator)
{
TicketId = ticketId;
CreateTime = createTime;
Creator = creator;
}
public override int GetHashCode()
{
int hash = 13;
hash = (hash * 7) + TicketId.GetHashCode();
hash = (hash * 7) + CreateTime.GetHashCode();
hash = (hash * 7) + Creator.GetHashCode();
return hash;
}
public override bool Equals(object value)
{
var other = value as PrimaryKey;
return other != null
&& other.TicketId == TicketId
// ... etc.
}
}
Use this in a hashset:
// right at the beginning...
var dedupeList = new Hashset<PrimaryKey>();
// Populate hashset from data table
if (dedupeList.Add(new PrimaryKey(newEntry.TicketId, newEntry.CreateTime, newEntry.Creator))
{
// New
}
else
{
// Already added.
}
You wont have to search through the whole data table each time which should save a lot of time when the data table is large. However, you will use more memory.
-
\$\begingroup\$ The datatable can be about 200k entries large. So in essence the hashset does the whole "already exists" calculation and ignores all possible entries if they are existent already. What about the bulkinsert? A hashset per se can't be bulkinserted can it? Does that mean in the end I would have to retransform it into a datatable for a bulk insert? \$\endgroup\$Thomas– Thomas2015年09月29日 09:56:03 +00:00Commented Sep 29, 2015 at 9:56
-
\$\begingroup\$ @Thomas - Well, you'd keep the hashset in addition to the datatable. Heslacher's answer using the PrimaryKeys property of the datatable is a much better idea. \$\endgroup\$RobH– RobH2015年09月29日 10:12:11 +00:00Commented Sep 29, 2015 at 10:12
-
\$\begingroup\$ Is there any possible reason for this variant to fail? I tried to do it the same way with putting the add to the dt into the "new" part but a few times I suddenly get duplicates still. \$\endgroup\$Thomas– Thomas2015年09月29日 11:44:17 +00:00Commented Sep 29, 2015 at 11:44
-
\$\begingroup\$ I even tried: return ( _TicketId.GetHashCode().ToString() + _CreateTime.GetHashCode().ToString() + _Creator.ToString().GetHashCode().ToString() ).GetHashCode(); but a few times I still get into the "new" part despite the entry already existing in the datatable \$\endgroup\$Thomas– Thomas2015年09月29日 12:00:07 +00:00Commented Sep 29, 2015 at 12:00
-
\$\begingroup\$ If I use instead: myHashSet.Add(newEntry.TicketId.ToString()+newEntry.CreateTime.ToString()+newEntry.Creator.ToString()) in the if it works. Any idea there? \$\endgroup\$Thomas– Thomas2015年09月29日 12:18:48 +00:00Commented Sep 29, 2015 at 12:18
If we use a more classic variant of just a if..else if..
construct we see that there is something just not needed
if (dt.Rows.Count <= 0)
{
// No duplicate found so insert the data into the data table
dt.Rows.Add(new object[] {newEntry.TicketId, newEntry.CreateTime, newEntry.Creator});
}
else if (dt.Rows.Count > 0 &&
null == dt.AsEnumerable().Where
(
r => r.Field<int>("ticketid") == newEntry.TicketId
&& DateTime.Equals(r.Field<DateTime>("createtime"), newEntry.CreateTime)
&& r.creator == newEntry.Creator
)
.FirstOrDefault())
{
dt.Rows.Add(new object[] {newEntry.TicketId, newEntry.CreateTime, newEntry.Creator});
}
So no matter if the datatable contains zero rows or more than zero rows a new row should be added. The only penalty would be for the very first time where the datatable contains no rows at all. So skipping the check about about any rows will get you a micro optimization like so
if (null == dt.AsEnumerable().Where
(
r => r.Field<int>("ticketid") == newEntry.TicketId
&& DateTime.Equals(r.Field<DateTime>("createtime"), newEntry.CreateTime)
&& r.creator == newEntry.Creator
)
.FirstOrDefault())
{
dt.Rows.Add(new object[] {newEntry.TicketId, newEntry.CreateTime, newEntry.Creator});
}
By using the Contains()
method of the DataRowCollection
and setting the PrimaryKey
property of the DataTable dt
will at least make your code clearer and maybe faster too but you need to measure this by yourself.
-
\$\begingroup\$ It looks like the data table creates an index of sorts over the primary key columns - this is a really good answer. \$\endgroup\$RobH– RobH2015年09月29日 10:10:55 +00:00Commented Sep 29, 2015 at 10:10
newEntry
primary key isn't in the data table - is that right? \$\endgroup\$