Pythonic way to find duplicate objects in a set

Question 1

I have tons of location points with several attributes like time_stamp, latitude, longitude, accuracy, velocity etc. Now I want to remove duplicate location points on these conditions:

Latitude is same
Longitude is same
Month is same
Day is same

The other attributes dont matter at all for comparison. My strategy is to modify __hash__, __eq__ and __ne__ methods to include the above conditions and feed them into set function to remove the duplicates.

item1 = Location(1510213074679, 286220203, 772454413, 1414, None, None, None, 78, None)
item2 = Location(1510213074679, 286220203, 772454413, 5, 6, 80, 226, None, None)
item3 = Location(1523620644975, 286265651, 772427842, 65, None, None, 193, 10, None)
x = set()
x.add(item1)
x.add(item2)
x.add(item3)
print(x)
{<__main__.Location at 0x7fd725559eb8>, <__main__.Location at 0x7fd725604dd8>}

Length of set is as expected i.e 2 (two elements with same attributes).

The result which I expect, can be obtained as follows

[f.data() for f in x]

Is there any other pythonic way to achieve the same result?

class Location(object):
 def __init__(self, time_stamp, latitude, longitude, accuracy, velocity,
 heading, altitude, vertical_accuracy, activity):
 self.time_stamp = float(time_stamp) / 1000
 self.latitude = float(latitude) / 10000000
 self.longitude = float(longitude) / 10000000
 self.accuracy = accuracy
 self.velocity = velocity
 self.heading = heading
 self.altitude = altitude
 self.vertical_accuracy = vertical_accuracy
 self.activity = activity
 self.timestamp, self.year, self.month = month_aware_time_stamp(
 self.time_stamp)
 self.hash = self.hashed()
 def data(self):
 return self.__dict__
 def hashed(self):
 string = str(self.latitude) + str(self.longitude)
 return hashlib.sha3_224(string.encode()).hexdigest()
 def __hash__(self):
 string = str(self.latitude) + str(self.longitude)
 return hash(string.encode())
 def __eq__(self, other):
 """Override the default Equals behavior"""
 if isinstance(other, self.__class__):
 return self.hash == other.hash and self.month == other.month and self.year == other.year
 return False
 def __ne__(self, other):
 """Override the default Unequal behavior"""
 return self.hash != other.hash or self.month != other.month or self.year != other.year

Question 2

Where are your data coming from? Is the Location object used for any other purpose in your code, or is it solely for deduplication?

Question 3

The data is in JSON file, each JSON will have more than 100000 entries. My task at hand is to find the most visited locations in a month of a year.

Question 4

Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers .

Question 5

`hash`

Since tuples are hashable if their elements are hashable, you can just do

def __hash__(self):
 return hash((self.latitude, self.longitude, self.year, self.month))

If you want to have locations that are near each other be set as the same, you might have to round the coordinates

`repr`

for debugging, adding a repr can be handy:

def __repr__(self):
 return (
 "Position("
 f"lat: {self.latitude}, "
 f"lon: {self.longitude}, "
 f"year: {self.year}, "
 f"month: {self.month}, "
 ")"
 )

Counting

To count, you can use a collections.Counter, and just feed it an iterable of Positions.

counter = Counter(locations)
most_visited = counter.most_common(<n>)

You will need one Counter per year/month, but this can be done with a defaultdict(Counter)

Question 6

I exactly did it like this, rounding off latitude and longitude to three decimal places, thank you.

Question 7

First - you are expected to provide complete and working code. To fully understand your code month_aware_time_stamp is missing (I'll try to guess what it probably returns).

Second - I think your approach and your class Location is broken by design. Also the implementation is a mess but let's stick to the design first. What your class does

>>> item1 = Location(1510213074679, 286220203, 772454413, 1414, None, None, None, 78, None)
>>> item2 = Location(1510213074679, 286220203, 772454413, 5, 6, 80, 226, None, None)
>>> item1 == item2
True

This is weird behavior. Nobody would expect that. Your code is not maintainable and will cause surprising behavior when being maintained.

If you do not need all the other attributes - do not store them at all. Just store the attributes you want to compare and do your stuff without strange hash implementations
If you need all that attributes in your class, then your implementation will render the class useless for any other use. Do not overwrite the comparison behavior but either extract the relevant data to a second data structure (tuple, ...) or provide an external comparison function for calling functions like sort(). You can always do comprehension to extract relevant attributes to a tuple

.

some_set = {(s.latitude, s.longitude, s.year, s.month) for s in x}

or you immediately count with collections.Counter

import collections
c = collections.Counter((s.latitude, s.longitude, s.year, s.month) for s in x)

I will skip the review of the concrete implementetion as the design has to change. But

never implement some homebrew hash if the builtin hash() is sufficient.
if you think you have to implement __hash__() and/or `eq() please follow the guidelines in [https://docs.python.org/3/reference/datamodel.html#object.hash]
never keep hash values unless you can guarantee consistency

Question 8

I have changed my code, never keep hash values unless you can guarantee consistency but couldnt understand your statement , Thanks.

Maarten Fabré Maarten Fabré 9,3901 gold badge15 silver badges27 bronze badges · Accepted Answer · 2019-05-10 08:43:50Z

`hash`

Since tuples are hashable if their elements are hashable, you can just do

def __hash__(self):
 return hash((self.latitude, self.longitude, self.year, self.month))

If you want to have locations that are near each other be set as the same, you might have to round the coordinates

`repr`

for debugging, adding a repr can be handy:

def __repr__(self):
 return (
 "Position("
 f"lat: {self.latitude}, "
 f"lon: {self.longitude}, "
 f"year: {self.year}, "
 f"month: {self.month}, "
 ")"
 )

Counting

To count, you can use a collections.Counter, and just feed it an iterable of Positions.

counter = Counter(locations)
most_visited = counter.most_common(<n>)

You will need one Counter per year/month, but this can be done with a defaultdict(Counter)

I exactly did it like this, rounding off latitude and longitude to three decimal places, thank you.

Stack Exchange Network

Pythonic way to find duplicate objects in a set

2 Answers 2

`hash`

`repr`

Counting

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Pythonic way to find duplicate objects in a set

2 Answers 2

hash

repr

Counting

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

`hash`

`repr`