3
\$\begingroup\$

I have some data in a .csv file, which looks roughly like this:

[fragment1, peptide1, gene1, replicate1, replicate2, replicate3]
[fragment1, peptide2, gene1, replicate1, replicate2, replicate3]
[fragment2, peptide1, gene2, replicate1, replicate2, replicate3]
[fragment2, peptide2, gene2, replicate1, replicate2, replicate3]
[fragment3, peptide1, gene2, replicate1, replicate2, replicate3]

And the problem is this - I need to use this data (the three replicates) in several different manners:

  1. Over each row (i.e. just replicate1-3 for each row)
  2. Over each replicate column for each fragment (i.e. replicate1 from peptides1 and 2 from fragment1, and the same for replicate2 and 3)
  3. Over each replicate column for each gene (i.e. same as (2), but using genes instead of fragments

The data files all have the same columns, but the rows (i.e. number of fragments/peptides/genes) vary, so I have to read the data without specifying row numbers. What I need, essentially, is statistics (coefficients of variation) across each row, across each fragment and across each gene.

The variant across rows just uses the three replicates (always three values from one row), and is of course very simple to get to. Both the variants across fragments and across genes first calculates statistics for using first statistics from every applicable replicate1, then every replicate2, then replicate3, (i.e. unknown number of values from unknown number of rows) and after that do the same statistics using the values previously calculated (i.e. always three values).

I have a script that does this, almost, but it's very long and (I think) overly complicated. I basically read the file three times, each time gathering the data in the different manners described, mostly in lists and sometimes numpy.arrays.

with open('Data/MS - PrEST + Sample/' + data_file,'rU') as in_file:
 reader = csv.reader(in_file,delimiter=';')
 x = -1
 data = numpy.array(['PrEST ID','Genes','Ratio H/L ' + cell_line + '_1','Ratio H/L ' + cell_line + '_2',\
 'Ratio H/L ' + cell_line + '_3'])
 current_PrEST = ''
 max_CN = []
 for row in reader:
 # First (headers) row
 if x == -1:
 for n in range(len(row)):
 if row[n] == 'PrEST ID':
 PrEST_column = n
 continue
 if row[n] == 'Gene names':
 Gene_column = n
 continue
 if row[n] == 'Ratio H/L ' + cell_line + '_1':
 Ratio_R1_column = n
 continue
 if row[n] == 'Ratio H/L ' + cell_line + '_2':
 Ratio_R2_column = n
 continue
 if row[n] == 'Ratio H/L ' + cell_line + '_3':
 Ratio_R3_column = n
 continue
 if row[n] == 'Sequence':
 Sequence_column = n
 continue
 x += 1
 continue
 # Skips combined / non-unique PrESTs
 if row[PrEST_column].count(';') == 1:
 continue
 # Collects and writes data for first (non-calculated) data set
 MC_count = row[Sequence_column].count('R') + row[Sequence_column].count('K') - 1
 write = (row[PrEST_column],row[Gene_column],row[Ratio_R1_column],row[Ratio_R2_column],\
 row[Ratio_R3_column],row[Sequence_column],MC_count)
 writer_1.writerow(write)
 # Plots to figure 1 (copy numbers for peptides), but only if there is some data to plot
 if current_PrEST != row[PrEST_column]:
 colour = cycle(['k','#A9F5A9','#6699FF','#A9F5F2','#9370DB','#FF3333'])
 current_PrEST = row[PrEST_column]
 x += 1
 # Checks if data for at least one replicate exists
 if row[Ratio_R1_column] != 'NaN' or row[Ratio_R2_column] != 'NaN' or row[Ratio_R3_column] != 'NaN':
 ccolour = next(colour)
 plt.figure(1)
 CN1 = (spike[row[PrEST_column]] / float(row[Ratio_R1_column]) * (10**-12) * (6.022*10**23) / (1*10**6))
 CN2 = (spike[row[PrEST_column]] / float(row[Ratio_R2_column]) * (10**-12) * (6.022*10**23) / (1*10**6))
 CN3 = (spike[row[PrEST_column]] / float(row[Ratio_R3_column]) * (10**-12) * (6.022*10**23) / (1*10**6))
 plt.plot(x,CN1,marker='o',color=ccolour)
 plt.plot(x,CN2,marker='o',color=ccolour)
 plt.plot(x,CN3,marker='o',color=ccolour)
 if CN1 > Copy_Number_Cutoff or CN2 > Copy_Number_Cutoff or CN3 > Copy_Number_Cutoff:
 plt.plot(x,Copy_Number_Cutoff*0.97,marker='^',color='red')
 max_CN.append(max(CN1,CN2,CN3))
 # Collects data for downstream calculations
 row_data = numpy.array([row[PrEST_column],row[Gene_column],row[Ratio_R1_column],row[Ratio_R2_column],row[Ratio_R3_column]])
 data = numpy.vstack((data,row_data))
# Prints largest copy number above CN cutoff (if applicable)
if max_CN != []:
 print('Largest copy number: ' + str(round(max(max_CN))))
# Gathers PrEST/Gene names
PrEST_list = []
Gene_list = []
for n in range(len(data) - 1):
 PrEST = data[n+1][0]
 if PrEST not in PrEST_list:
 PrEST_list.append(PrEST)
 Gene_list.append(data[n+1][1])
# Analyses data and writes to file
collected_PrESTs = []
collected_Genes = []
collected_CNs = []
collected_CVs = []
collected_counts = []
collected_medians = []
while len(PrEST_list) != 0:
 PrEST = PrEST_list[0]
 PrEST_list.remove(PrEST)
 Gene = Gene_list[0]
 Gene_list.remove(Gene)
 Peptide_count = 0
 # Collects H/L ratios and calculate copy numbers / statistics
 R1 = []
 R2 = []
 R3 = []
 for n in range(len(data) - 1):
 if data[n+1][0] == PrEST:
 if data[n+1][2] != 'NaN':
 R1.append((spike[PrEST] / float(data[n+1][2])) * (10**-12) * (6.022*10**23) / (1*10**6))
 Peptide_count += 1
 if data[n+1][3] != 'NaN':
 R2.append((spike[PrEST] / float(data[n+1][3])) * (10**-12) * (6.022*10**23) / (1*10**6))
 Peptide_count += 1
 if data[n+1][4] != 'NaN':
 R3.append((spike[PrEST] / float(data[n+1][4])) * (10**-12) * (6.022*10**23) / (1*10**6))
 Peptide_count += 1
 # Checks if lacking data
 if R1 == [] and R2 == [] and R3 == []:
 write = (PrEST,Gene,'No data')
 writer_2.writerow(write)
 continue
 # Calculate statistics
 curated_medians = []
 if R1 != []:
 curated_medians.append(numpy.median(R1))
 if R2 != []:
 curated_medians.append(numpy.median(R2))
 if R3 != []:
 curated_medians.append(numpy.median(R3))
 End_Copy_Number = int(round(numpy.median(curated_medians),0))
 if len(curated_medians) > 1:
 CV = round((numpy.std(curated_medians,ddof=1) / numpy.mean(curated_medians)) * 100,1)
 else:
 CV = -1
 # Writes data to file
 write = (PrEST,Gene,End_Copy_Number,CV)
 writer_2.writerow(write)
 # Checks if the current PrEST maps to a gene that has more than one PrEST and calculates statistics for that gene
 if Gene in collected_Genes and Gene not in Gene_list:
 CNs = []
 for n in range(len(collected_Genes)):
 if Gene == collected_Genes[n]:
 CNs.append(collected_medians[n])
 CNs.append(curated_medians)
 Gene_CN = int(round(numpy.median(CNs),0))
 Gene_CV = round((numpy.std(CNs,ddof=1) / numpy.mean(CNs)) * 100,1)
 write = ('',Gene,Gene_CN,Gene_CV)
 writer_2.writerow(write)

How can I best read data in different ways effectively, both speed-wise and "less code"-wise? I tried to find similar questions, but to no avail.

Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Jan 5, 2014 at 17:51
\$\endgroup\$
3
  • 2
    \$\begingroup\$ FWIW I think your code would be much simpler if rewritten using pandas, and I think its groupby facilities would come in very handy. \$\endgroup\$ Commented Jan 6, 2014 at 15:09
  • \$\begingroup\$ It would be very helpful if you could post some sample data. \$\endgroup\$ Commented Jul 22, 2014 at 13:44
  • \$\begingroup\$ A year later: pandas is awesome! Thank you so much DSM, that helped me a lot. \$\endgroup\$ Commented Jan 16, 2015 at 12:23

1 Answer 1

3
\$\begingroup\$

First off, if you want re-usability, you should probably encapsulate this into a function with it's specific arguments.

Also, the general style for naming is snake_case for functions and variables, and PascalCase for classes. You also have some other style issues. For example, (PrEST,Gene,End_Copy_Number,CV) should be changed to (PrEST, Gene, End_Copy_Number, CV). You also have various other style violations as well. To correct these, see PEP8, Python's official style guide.

You have an if/elif/else block with many continues.

for n in range(len(row)):
 if row[n] == 'PrEST ID':
 PrEST_column = n
 continue
 ...

These continues can be removed.

Finally, as mentioned by @DSM, you can use pandas to re-write this.

answered Jul 7, 2015 at 3:24
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.