program to generate data helpful in finding duplicate large files

Thu Sep 18 16:23:18 EDT 2014

David Alban wrote:
> * sep = ascii_nul*
>> * print "%s%c%s%c%d%c%d%c%d%c%d%c%s" % ( thishost, sep, md5sum, sep,
> dev, sep, ino, sep, nlink, sep, size, sep, file_path )*

file_path may contain newlines, therefore you should probably use "0円" to 
separate the records. The other fields may not contain whitespace, so it's 
safe to use " " as the field separator. When you deserialize the record you 
can prevent the file_path from being broken by providing maxsplit to the 
str.split() method:
for record in infile.read().split("0円"):
 print(record.split(" ", 6))
Splitting into records without reading the whole data into memory left as an 
exercise ;)