I have comma-delimited files like these, where the first field is sorted in increasing order:
Case 1 ( 1st file ) :
abcd,1
abcd,21
abcd,122
abce,12
abcf,13
abcf,21
Case 2 ( and another file like this ) :
abcd,1
abcd,21
abcd,122
What I want to do is convert the first file to like this :
abcd 1,21,122
abce 12
abcf 13,21
And similarly, for the second file like this :
abcd 1,21,122
Now, I wrote a very ugly code with a lot of if's to check whether the next line's string before the comma is same as current line's string so, if it is then do ....
It's so badly written that, I wrote it myself around 6 months back and it took me around 3-4 minutes to understand why I did what I did in this code. Well in short it's ugly, in case you would like to see, here it is ( also there's a bug currently in here and since I needed a better way than this whole code so I didn't sort it out, for the curious folks out there the bug is that it doesn't print anything for the second case mentioned above and I know why ).
def clean_file(filePath, destination):
f = open(filePath, 'r')
data = f.read()
f.close()
curr_string = current_number = next_string = next_number = ""
current_numbers = ""
final_payload = ""
lines = data.split('\n')[:-1]
for i in range(len(lines)-1):
print(lines[i])
curr_line = lines[i]
next_line = lines[i+1]
curr_string, current_number = curr_line.split(',')
next_string, next_number = next_line.split(',')
if curr_string == next_string:
current_numbers += current_number + ","
else:
current_numbers += current_number # check to avoid ',' in the end
final_payload += curr_string + " " + current_numbers + "\n"
current_numbers = ""
print(final_payload)
# For last line
if curr_string != next_string:
# Directly add it to the final_payload
final_payload += next_line + "\n"
else:
# Remove the newline, add a comma and then finally add a newline
final_payload = final_payload[:-1] + ","+next_number+"\n"
with open(destination, 'a') as f:
f.write(final_payload)
Any better solutions?
-
4\$\begingroup\$ Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers . \$\endgroup\$Mast– Mast ♦2019年01月13日 19:02:18 +00:00Commented Jan 13, 2019 at 19:02
2 Answers 2
- To solve the grouping problem, use
itertools.groupby
. - To read files with comma-separated fields, use the
csv
module. In almost all cases,
open()
should be called using awith
block, so that the files will be automatically closed for you, even if an exception occurs within the block:with open(file_path) as in_f, open(destination, 'w') as out_f: data = csv.reader(in_f) # code goes here
filePath
violates Python's official style guide, which recommends underscores, like yourcurr_line
.
While @200_success's answer is very good (always use libraries that solve your problem), I'm going to give an answer that illustrates how to think about more general problems in case there isn't a perfect library.
Use with
to automatically close files when you're done
You risk leaving a file open if an exception is raised and file.close()
is never called.
with open(input_file) as in_file:
Use the object to iterate, not indices
Most collections and objects can be iterated over directly, so you don't need indices
with open(input_file) as in_file:
for line in in_file:
line = line.strip() # get rid of '\n' at end of line
Use data structures to organize your data
In the end, you want to associate a letter-string with a list of numbers. In python, a dict
allows you to associate any piece of data with any other, so we'll use that to associate the letter-strings with a list
of numbers.
with open(input_file) as in_file:
data = dict()
for line in in_file:
line = line.strip() # get rid of '\n' at end of line
letters, numbers = line.split(',')
data[letters].append(numbers)
Now, this doesn't quite work since, if a letters
entry hasn't been seen yet, the call to data[letters]
won't have anything to return and will raise a KeyError
exception. So, we have to account for that
with open(input_file) as in_file:
data = dict()
for line in in_file:
line = line.strip() # get rid of '\n' at end of line
letters, number = line.split(',')
try: # there might be an error
data[letters].append(number) # append new number if letters has been seen before
except KeyError:
data[letters] = [number] # create new list with one number for a new letter-string
Now, all of the file is stored in a convenient form in the data
object. To output, just loop through the data
with open(input_file) as in_file:
data = dict()
for line in in_file:
line = line.strip() # get rid of '\n' at end of line
letters, number = line.split(',')
try: # there might be an error
data[letters].append(number) # append new number if letters has been seen before
except KeyError:
data[letters] = [number] # create new list with one number for a new letter-string
with open(output_file, 'w') as out_file:
for letters, number_list in data.items(): # iterate over all entries
out_file.write(letters + ' ' + ','.join(number_list) + '\n')
The .join()
method creates a string from a list such that the entries of the list are separated by the string that precedes it--','
in this case.
-
1\$\begingroup\$ Instead of trying to append and catching the error, you can use
setdefault
:data.setdefault(letters, []).append(number)
\$\endgroup\$KarelPeeters– KarelPeeters2019年01月13日 23:04:10 +00:00Commented Jan 13, 2019 at 23:04 -
\$\begingroup\$ @ToddSewell Neat! That'll be useful in the future. \$\endgroup\$Mark H– Mark H2019年01月13日 23:08:36 +00:00Commented Jan 13, 2019 at 23:08
-
\$\begingroup\$ Or use
collections.defaultdict
of course. \$\endgroup\$Graipher– Graipher2019年01月14日 14:24:11 +00:00Commented Jan 14, 2019 at 14:24