I have tons of location points with several attributes like time_stamp
, latitude
, longitude
, accuracy
, velocity
etc. Now I want to remove duplicate location points on these conditions:
- Latitude is same
- Longitude is same
- Month is same
- Day is same
The other attributes dont matter at all for comparison. My strategy is to modify __hash__
, __eq__
and __ne__
methods to include the above conditions and feed them into set
function to remove the duplicates.
item1 = Location(1510213074679, 286220203, 772454413, 1414, None, None, None, 78, None)
item2 = Location(1510213074679, 286220203, 772454413, 5, 6, 80, 226, None, None)
item3 = Location(1523620644975, 286265651, 772427842, 65, None, None, 193, 10, None)
x = set()
x.add(item1)
x.add(item2)
x.add(item3)
print(x)
{<__main__.Location at 0x7fd725559eb8>, <__main__.Location at 0x7fd725604dd8>}
Length of set is as expected i.e 2 (two elements with same attributes).
The result which I expect, can be obtained as follows
[f.data() for f in x]
Is there any other pythonic way to achieve the same result?
class Location(object):
def __init__(self, time_stamp, latitude, longitude, accuracy, velocity,
heading, altitude, vertical_accuracy, activity):
self.time_stamp = float(time_stamp) / 1000
self.latitude = float(latitude) / 10000000
self.longitude = float(longitude) / 10000000
self.accuracy = accuracy
self.velocity = velocity
self.heading = heading
self.altitude = altitude
self.vertical_accuracy = vertical_accuracy
self.activity = activity
self.timestamp, self.year, self.month = month_aware_time_stamp(
self.time_stamp)
self.hash = self.hashed()
def data(self):
return self.__dict__
def hashed(self):
string = str(self.latitude) + str(self.longitude)
return hashlib.sha3_224(string.encode()).hexdigest()
def __hash__(self):
string = str(self.latitude) + str(self.longitude)
return hash(string.encode())
def __eq__(self, other):
"""Override the default Equals behavior"""
if isinstance(other, self.__class__):
return self.hash == other.hash and self.month == other.month and self.year == other.year
return False
def __ne__(self, other):
"""Override the default Unequal behavior"""
return self.hash != other.hash or self.month != other.month or self.year != other.year
2 Answers 2
hash
Since tuples are hashable if their elements are hashable, you can just do
def __hash__(self):
return hash((self.latitude, self.longitude, self.year, self.month))
If you want to have locations that are near each other be set as the same, you might have to round the coordinates
repr
for debugging, adding a repr can be handy:
def __repr__(self):
return (
"Position("
f"lat: {self.latitude}, "
f"lon: {self.longitude}, "
f"year: {self.year}, "
f"month: {self.month}, "
")"
)
Counting
To count, you can use a collections.Counter
, and just feed it an iterable of Positions.
counter = Counter(locations)
most_visited = counter.most_common(<n>)
You will need one Counter per year/month, but this can be done with a defaultdict(Counter)
-
\$\begingroup\$ I exactly did it like this, rounding off latitude and longitude to three decimal places, thank you. \$\endgroup\$GraphicalDot– GraphicalDot2019年05月10日 08:47:45 +00:00Commented May 10, 2019 at 8:47
First - you are expected to provide complete and working code. To fully understand your code month_aware_time_stamp
is missing (I'll try to guess what it probably returns).
Second - I think your approach and your class Location
is broken by design. Also the implementation is a mess but let's stick to the design first. What your class does
>>> item1 = Location(1510213074679, 286220203, 772454413, 1414, None, None, None, 78, None)
>>> item2 = Location(1510213074679, 286220203, 772454413, 5, 6, 80, 226, None, None)
>>> item1 == item2
True
This is weird behavior. Nobody would expect that. Your code is not maintainable and will cause surprising behavior when being maintained.
If you do not need all the other attributes - do not store them at all. Just store the attributes you want to compare and do your stuff without strange hash implementations
If you need all that attributes in your class, then your implementation will render the class useless for any other use. Do not overwrite the comparison behavior but either extract the relevant data to a second data structure (tuple, ...) or provide an external comparison function for calling functions like
sort()
. You can always do comprehension to extract relevant attributes to a tuple
.
some_set = {(s.latitude, s.longitude, s.year, s.month) for s in x}
or you immediately count with collections.Counter
import collections
c = collections.Counter((s.latitude, s.longitude, s.year, s.month) for s in x)
I will skip the review of the concrete implementetion as the design has to change. But
- never implement some homebrew hash if the builtin
hash()
is sufficient. - if you think you have to implement
__hash__()
and/or `eq() please follow the guidelines in [https://docs.python.org/3/reference/datamodel.html#object.hash] - never keep hash values unless you can guarantee consistency
-
\$\begingroup\$ I have changed my code, never keep hash values unless you can guarantee consistency but couldnt understand your statement , Thanks. \$\endgroup\$GraphicalDot– GraphicalDot2019年05月10日 21:23:24 +00:00Commented May 10, 2019 at 21:23
Location
object used for any other purpose in your code, or is it solely for deduplication? \$\endgroup\$