2

I am trying to make a python script to find duplicate files in a usb flash drive.

The proccess I am following is creating a list of the file names, hashing each file, then creating an inverse dictionary. However somewhere in the proccess I am getting a UnicodeDecodeError. Could someone help me understand what's going on?

from os import listdir
from os.path import isfile, join
from collections import defaultdict
import hashlib
my_path = r"F:/"
files_in_dir = [ file for file in listdir(my_path) if isfile(join(my_path, file)) ]
file_hashes = dict()
for file in files_in_dir:
 file_hashes[file] = hashlib.md5(open(join(my_path, file), 'r').read()).digest()
inverse_dict = defaultdict(list)
for file, file_hash in file_hashes.iteritems():
 inverse_dict[file_hash].append(file)
inverse_dict.items()

The error that I face is:

Traceback (most recent call last):
 File "C:\Users\Fotis\Desktop\check_dup.py", line 12, in <module>
 file_hashes[file] = hashlib.md5(open(join(my_path, file), 'r').read()).digest()
 File "C:\Python33\lib\encodings\cp1253.py", line 23, in decode
 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0xff in position 2227: character maps to <undefined>
asked Dec 3, 2012 at 18:10
1
  • @Martijn Pieters It's python 3. I will retag the question appropriatelly. Commented Dec 3, 2012 at 18:12

1 Answer 1

5

You are trying to read a file that is not encoded in the default platform encoding (cp1253). By opening the file in text mode (r) Python 3 will try and decode the file contents to unicode. You didn't specify an encoding, so the platform preferred encoding is used.

Open the files in binary mode instead, using rb as the mode. Since you are only calculating the MD5 hash (a function that expects bytes), you should not be using text mode anyway.

jfs
417k211 gold badges1k silver badges1.7k bronze badges
answered Dec 3, 2012 at 18:16
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.