Return to Question

deleted 6 characters in body

Source Link

edited Oct 6, 2016 at 18:21

Mahmood Muhammad Nageeb

edited Oct 6, 2016 at 18:21

Mahmood Muhammad Nageeb

Here's its description:Here's its description:

Here's my solution:Here's my solution:

Notes:Notes:

Here's its description:

Here's my solution:

Notes:

Here's its description:

Here's my solution:

Notes:

added 2 characters in body

Source Link

edited Oct 5, 2016 at 22:29

Mahmood Muhammad Nageeb

edited Oct 5, 2016 at 22:29

Mahmood Muhammad Nageeb

It wasn't like that, but it worked. After reading the author's solution I refactored it and edited a lot of it.

How can it be refactored and optimized?

It wasn't like that, but it worked. After reading the author's solution I refactored it and edited a lot of it.
I'm no expert at MD5, just got the basic idea.
I'm no expert at Unix or Linux. I use Windows, and I've tested this script using a virtual machine running Ubuntu.
I'm a hobbyist and beginner in Python and programming.

It wasn't like that, but it worked. After reading the author's solution I refactored it and edited a lot of it.

How can it be refactored and optimized?

I'm no expert at MD5, just got the basic idea.
I'm no expert at Unix or Linux. I use Windows, and I've tested this script using a virtual machine running Ubuntu.
I'm a hobbyist and beginner in Python and programming.

How can it be refactored and optimized?

It wasn't like that, but it worked. After reading the author's solution I refactored it and edited a lot of it.
I'm no expert at MD5, just got the basic idea.
I'm no expert at Unix or Linux. I use Windows, and I've tested this script using a virtual machine running Ubuntu.
I'm a hobbyist and beginner in Python and programming.

Source Link

asked Oct 5, 2016 at 21:42

Mahmood Muhammad Nageeb

asked Oct 5, 2016 at 21:42

Mahmood Muhammad Nageeb

Finding duplicate files using md5sum Unix command

This is an exercise from Think Python: How to Think Like a Computer Scientist

Here's its description:

In a large collection of MP3 files, there may be more than one copy of the same song, stored in different directories or with different filenames. The goal of this exercise is to search for duplicates.

Write a program that searches a directory and all of its subdirectories, recur‐ sively, and returns a list of complete paths for all files with a given suffix (like .mp3). Hint: os.path provides several useful functions for manipulating fileand path names.

To recognize duplicates, you can use md5sum to compute a "checksum" for each files. If two files have the same checksum, they probably have the same contents.

To double-check, you can use the Unix command diff.

Here's my solution:

import os
def run_command(cmd):
 """Runs a command in a shell.
 cmd: a string specifies a Unix command.
 Returns: a string specifies the result
 of executing the command.
 """
 filepipe = os.popen(cmd)
 result = filepipe.read()
 status = filepipe.close()
 return result
def md5_checksum(filepath):
 """Returns a string specifies the MD5 checksum of
 a given file using md5sum Unix command.
 filepath: a string specifies a file.
 """
 command = 'md5sum ' + filepath
 return run_command(command)
def md5_checksum_table(dirname, suffix):
 """Searches a directory for files with a given
 file format (a suffix) and computes their
 MD5 checksums.
 dirname: a string specifies a directory.
 suffix: a file format (e.g. .pdf or .mp3).
 
 Returns: a dictionary mapping from string
 works as a MD5 checksum to list of strings
 work as pathes of files have this checksum.
 """
 table = {}
 for root, sub, files in os.walk(dirname):
 for file in files:
 if file.endswith(suffix):
 filepath = os.path.join(root, file)
 checksum, filename = md5_checksum(filepath).split()
 table.setdefault(checksum, []).append(filename)
 
 return table
def are_identical(files_names):
 """Returns whether files in files_names
 are identical using diff Unix command.
 files_names: a list of strings specify pathes of files.
 """
 index = 1
 
 for filename1 in files_names:
 for filename2 in files_names[index:]:
 command = 'diff %s %s' % (filename1, filename2)
 result = run_command(command)
 if result:
 return False
 
 index += 1
 return True
 
def print_duplicates(checksums):
 """Prints pathes of files have the same MD5 checksum.
 checksums: a dictionary mapping from MD5 checksum (string) to
 list of pathes of files (strings) have share this checksum.
 """
 for checksum, filepathes in list(checksums.items()):
 if len(filepathes) > 1:
 print('Files have the checksum %s %s' % (checksum, 'are: '))
 for filepath in filepathes:
 print(filepath)
 if are_identical(filepathes):
 print('\nThey are indentical. \n')
def main():
 table = md5_checksum_table('/media/sf_Shared/', '.pdf')
 print_duplicates(table)
if __name__ == '__main__':
 main()

It wasn't like that, but it worked. After reading the author's solution I refactored it and edited a lot of it.

How can it be refactored and optimized?

Notes:

I'm no expert at MD5, just got the basic idea.
I'm no expert at Unix or Linux. I use Windows, and I've tested this script using a virtual machine running Ubuntu.
I'm a hobbyist and beginner in Python and programming.

lang-py