I am given pasted data from a table, so I have spaces as delimiters and in some of the fields I don't care about. I want the first field and the last three fields, and got them using this code:
testdata = """8/5/15 stuffidontneed custid locid 55.00
8/9/15 stuff i really dont need with extra spaces custid otherlocid 79.00"""
rows = testdata.split('\n')
tupls = [row.split(' ') for row in rows]
dates = [tupl[0] for tupl in tupls]
custids, locids, amounts = ([tupl[i] for tupl in tupls] for i in range (-3, 0))
print(dates, custids, locids, amounts)
# ['8/5/15', '8/9/15'] ['custid', 'custid'] ['locid', 'otherlocid'] ['55.00', '79.00']
I just thought there might be a more elegant way to do things, maybe capturing the data in the middle as a single field.
Edit: I have attempted to add delimiters using re.finditer, but I can't replace the matches easily.
2 Answers 2
After you have tupls
:
data = [(t[0], t[-3], t[-2], t[-1] for t in tupls] # Or use range...
print(list(zip(*data))
Gives:
[('8/5/15', '8/9/15'), ('custid', 'custid'), ('locid', 'otherlocid'), ('55.00', '79.00')]
So this:
testdata = """8/5/15 stuffidontneed custid locid 55.00
8/9/15 stuff i really dont need with extra spaces custid otherlocid 79.00"""
rows = testdata.split('\n')
tupls = [row.split(' ') for row in rows]
dates = [tupl[0] for tupl in tupls]
custids, locids, amounts = ([tupl[i] for tupl in tupls] for i in range (-3, 0))
print(dates, custids, locids, amounts)
Becomes:
testdata = """8/5/15 stuffidontneed custid locid 55.00
8/9/15 stuff i really dont need with extra spaces custid otherlocid 79.00"""
rows = testdata.split('\n')
tupls = [row.split(' ') for row in rows]
data = [(t[0], t[-3], t[-2], t[-1] for t in tupls] # Or use range...
print(list(zip(*data))
-
1\$\begingroup\$ So basically your first line drops the irregular data so we have a matrix whose columns are the data we want. Then you split it into tuples with argument unpacking, then you zip the tuples together so you have the columns as iterators. \$\endgroup\$Noumenon– Noumenon2016年09月04日 01:14:46 +00:00Commented Sep 4, 2016 at 1:14
-
\$\begingroup\$ Yup. That's what is going on. \$\endgroup\$Dair– Dair2016年09月04日 01:16:24 +00:00Commented Sep 4, 2016 at 1:16
-
1\$\begingroup\$ It's amazing how much more readable the code is just from using "t in tupls" instead of "tupl in tupls". \$\endgroup\$Noumenon– Noumenon2016年09月04日 01:41:00 +00:00Commented Sep 4, 2016 at 1:41
Even though there are more succinct ways of writing this, depending on the size of the input that'll make the performance just worse, so there'll certainly be more efficient ways, considering that a lot of work is done repeatedly again even though it's not strictly necessary. Compare this StackOverflow post in particular for ways to iterate over the string input without having the intermediate list for it.
Apart from that I'd consider doing just one iteration and accumulating
values that way. Instead of cutting of a character I'd consider adding
an underscore instead to make it more readable without overriding
e.g. the tuple
predefined name.
Lastly you could still just look backwards from the end for the third to last space and just split that substring.
More verbose it could thus be:
testdata = """8/5/15 stuffidontneed custid locid 55.00
8/9/15 stuff i really dont need with extra spaces custid otherlocid 79.00"""
dates, custids, locids, amounts = [], [], [], []
for row in testdata.splitlines():
tuples = row.split()
dates.append(tuples[0])
custids.append(tuples[-3])
locids.append(tuples[-2])
amounts.append(tuples[-1])
print(dates, custids, locids, amounts)
# ['8/5/15', '8/9/15'] ['custid', 'custid'] ['locid', 'otherlocid'] ['55.00', '79.00']
-
1\$\begingroup\$ split() and splitlines() are definitely more elegant than what I used. \$\endgroup\$Noumenon– Noumenon2016年09月04日 01:21:48 +00:00Commented Sep 4, 2016 at 1:21