5

I have a class called Customer that has several string properties like

firstName, lastName, email, etc. 

I read in the customer information from a csv file that creates an array of the class:

Customer[] customers 

I need to remove the duplicate customers having the same email address, leaving only 1 customer record for each particular email address.

I have done this using 2 loops but it takes nearly 5 minutes as there are usually 50,000+ customer records. Once I am done removing the duplicates, I need to write the customer information to another csv file (no help needed here).

If I did a Distinct in a loop how would I remove the other string variables that are a part of the class for that particular customer as well?

Thanks, Andrew

Arghya C
10.1k4 gold badges51 silver badges70 bronze badges
asked Dec 7, 2015 at 20:21
8
  • Is the idea to run this daily/weekly/quarterly? Frequency of this task will likely dictate the permanence of a solution. Commented Dec 7, 2015 at 20:25
  • 1
    Distinct will not work for custom types without using new equality comparer. use DistinctBy from MoreLinq. btw this operation will not take much time for 50k items since distinct is O(n) Commented Dec 7, 2015 at 20:25
  • My choice would probably be to sort the input file by duplicate key (email in your case) and do a simple previous to current value comparison before adding to your object. Commented Dec 7, 2015 at 20:27
  • 1
    I'd use a KeyedCollection (in System.Collectons.ObjectModell). Let the eMail be the Key and insert after checking with Contains. This is very fast... Commented Dec 7, 2015 at 20:28
  • Possibly related / helpful: stackoverflow.com/questions/2537823/… Commented Dec 7, 2015 at 20:32

2 Answers 2

10

With Linq, you can do this in O(n) time (single level loop) with a GroupBy

var uniquePersons = persons.GroupBy(p => p.Email)
 .Select(grp => grp.First())
 .ToArray();

Update

A bit on O(n) behavior of GroupBy.

GroupBy is implemented in Linq (Enumerable.cs) as this -

The IEnumerable is iterated only once to create the grouping. A Hash of the key provided (e.g. "Email" here) is used to find unique keys, and the elements are added in the Grouping corresponding to the keys.

Please see this GetGrouping code. And some old posts for reference.

Then Select is obviously an O(n) code, making the above code O(n) overall.

Update 2

To handle empty/null values.

So, if there are instances where the value of Email is null or empty, the simple GroupBy will take just one of those objects from null & empty each.

One quick way to include all those objects with null/empty value is to use some unique keys at the run time for those objects, like

var tempEmailIndex = 0;
var uniqueNullAndEmpty = persons
 .GroupBy(p => string.IsNullOrEmpty(p.Email) 
 ? (++tempEmailIndex).ToString() : p.Email)
 .Select(grp => grp.First())
 .ToArray();
answered Dec 7, 2015 at 20:43
Sign up to request clarification or add additional context in comments.

11 Comments

"As Linq is using Reflection" - do you have a reference for this?
@Shnugo - LINQ doesn't use reflection.
Can you please tell us how you know that this will execute in O(n) ?
This is working extremely well, but for some reason it is not writing anything to the csv file now. I see that after deleting the duplicates it returns the new array of customers and has data in it, and I have not changed an of the code that writes to the file...
@MikeNakis please see I had updated the answer with explanation and references.
|
0

I'd do it like this:

public class Person {
 public Person(string eMail, string Name) {
 this.eMail = eMail;
 this.Name = Name;
 }
 public string eMail { get; set; }
 public string Name { get; set; }
}
public class eMailKeyedCollection : System.Collections.ObjectModel.KeyedCollection<string, Person> {
 protected override string GetKeyForItem(Person item) {
 return item.eMail;
 }
}
public void testIt() {
 var testArr = new Person[5];
 testArr[0] = new Person("[email protected]", "Jon Mullen");
 testArr[1] = new Person("[email protected]", "Jane Cullen");
 testArr[2] = new Person("[email protected]", "Jon Cullen");
 testArr[3] = new Person("[email protected]", "John Mullen");
 testArr[4] = new Person("[email protected]", "Test Other"); //same eMail as index 0...
 var targetList = new eMailKeyedCollection();
 foreach (var p in testArr) {
 if (!targetList.Contains(p.eMail))
 targetList.Add(p);
 }
}

If the item is found in the collection, you could easily pick (and eventually modify) it with:

 if (!targetList.Contains(p.eMail))
 targetList.Add(p);
 else {
 var currentPerson=targetList[p.eMail];
 //modify Name, Address whatever... 
 }
answered Dec 7, 2015 at 20:39

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.