6
\$\begingroup\$

I am new to Python from R. They are quite different. I wrote Python code to deal with Excel file. The code was running well, but I think it's kind of the style of R. I want somebody to show me writing the code in Python way.

The data structure looks like:

 Name Details
1 AAA first(100%-8)
2 BBB first(50%-8),second(50%-8)
3 CCC sixth(30%-8),seventh(60%-7),first(10%-7.75)
4 DDD third(100%-6)
5 EEE fifth(70%-7.5),second(30%-7.5)
6 FFF first(70%-8),ninth(30%-6.75)
... ... ..........

As you can see, Mr.first gave Mr.AAA 8 points with 100% weights. Otherwise, in the 2nd,3rd,6th row, Mr.first gave different people with different scores. So the average scores that Mr.first gave is (8+8+7.75+8)/4 = 7.94, which is the average score of his group.

What I am looking for is: for the Mr.AAA, his final score is not 8 * 100%,it's 8 * (7.5/7.94)*100%, where 7.5 is a constant and 7.94 is the average score of the group of Mr.first. Similarly, for the Mr.BBB, his final score is 8 * (7.5/7.94)*50% + 8 * (7.5/7.75)*50%. Hope you get it.

So, question is pretty simple.

My code:

#-*- coding:utf-8 -*-
import xlrd
import re
data = xlrd.open_workbook(filename) #read the data
table = data.sheets()[0] #read the sheet
nrows = table.nrows #get the number of total rows
regr = r'[\u4e00-\u9fa5a-zA-Z]+' # regular expression for CHN and ENG names
regr1 = r'[0-9]+' # for scores and percentage
score = {} # The dict: {AAA:{first:[1.0,8.0]},......}
group = {} # The dict: {first:[8,8,7.75,8],......}
for i in range(2,nrows):
 target = table.cell(i,10).value # the details data [first(100%-8)]
 person = table.cell(i,2).value # the Name data [AAA]
 c = target.split(',') # If in details data there are more than 
 # one person, then split them 
 score[person] = {} # set an empty dict
 for j in c:
 d = re.findall(regr,j) # get the name 
 d = "".join(d) # transfer the list to string
 value = re.findall(regr1,j) #get the score and percentage
 value1 = int(value[0])/100 # get the percentage 
 value2 = '.'.join([x for x in value[1:]]) # get the score
 value2 = float(value2) # change to float
 group.setdefault(d,[]).append(value2) 
 score[person].setdefault(d,[value1,value2]) 
#This part is for calculating the group average
for key in group:
 total = 0
 length = len(group[key])
 group[key] = [x for x in group[key]]
 for x in group[key]:
 total = total + x
 group[key] = total/length
output = {} #set an empty dict to store output: {AAA:7.56,......}
#this part is for calculating the final score
for key in score:
 average = 0
 for subkey in score[key]:
 average = score[key][subkey][0] * score[key][subkey][1] 
 *7.5/group[subkey] + average
 output[key] = average
 
print(output) #print the output

Finally here: for the reason of secrecy, I can not provide the raw data. But, if you need, I can create the raw data. I am here because I feel that my code is tedious. I hope someone can help me to write it more elegantly. It's for future work.

toolic
14.5k5 gold badges29 silver badges203 bronze badges
asked Nov 29, 2016 at 8:56
\$\endgroup\$
3
  • 1
    \$\begingroup\$ you should definetly check pandas and maybe check the modern pandas introduction by tom augspurger and also the apply function might be interesting. Additionaly, the split-apply-combine workflow is explained here \$\endgroup\$ Commented Nov 29, 2016 at 9:07
  • \$\begingroup\$ can you flatten your data or rearrange it into a dataframe? Then several operations become more "natural" \$\endgroup\$ Commented Nov 29, 2016 at 9:11
  • \$\begingroup\$ I know "pandas", but unfamiliar with it. I will check it. Thanks for your comment. @Quickbeam2k1 \$\endgroup\$ Commented Nov 29, 2016 at 9:27

2 Answers 2

2
\$\begingroup\$

Portability

I get a syntax error on these lines:

 average = score[key][subkey][0] * score[key][subkey][1] 
 *7.5/group[subkey] + average

Perhaps your version of Python is more forgiving. The error goes away if I merge the 2 lines:

 average = score[key][subkey][0] * score[key][subkey][1] *7.5/group[subkey] + average

Comments

Some of the comments are helpful, such as:

#This part is for calculating the group average

Although, it could be simplified as:

# Calculating the group average

Other comments can be simply deleted, such as:

print(output) #print the output

The comment merely states what is already obvious from the code. Also, because of all the empty space between the code and the comment, it is easy to miss the comment.

This comment looks helpful:

regr = r'[\u4e00-\u9fa5a-zA-Z]+' # regular expression for CHN and ENG names

However, I'm not sure what "CHN and ENG" means. Those terms can be explained using more details in the comment.

It would also be helpful to explain this part of the regular expression in the comments:

\u4e00-\u9fa5a

Naming

There are 2 similarly named variables: regr and regr1. It would be better to give them more meaningful names, such as regr_name and regr_score.

c is not a meaningful variable name in this context. Perhaps details would be better.

Similarly for j and d.

The variables named value, value1, etc., are too generic. value1 could be percentage and value2 could be score.

Simpler

This line:

total = total + x

is simpler using the special assignment operator:

total += x
answered Mar 21 at 10:53
\$\endgroup\$
2
\$\begingroup\$

As Quickbeam2k1 said nine years ago,

you should [definitely] check Pandas

Specifically, these loops:

for i in range(2,nrows):
 for j in c:
for key in group:
for key in score:

will be slow for large data, and take more effort than proper use of Pandas.

Whether you use Pandas or not, definitely don't (i,10). Use column names instead of indices.

This could look like

import io
import pandas as pd
with io.StringIO(''',Name,Details
1,AAA,"first(100%-8)"
2,BBB,"first(50%-8),second(50%-8)"
3,CCC,"sixth(30%-8),seventh(60%-7),first(10%-7.75)"
4,DDD,"third(100%-6)"
5,EEE,"fifth(70%-7.5),second(30%-7.5)"
6,FFF,"first(70%-8),ninth(30%-6.75)"
''') as f:
 data = pd.read_csv(f, index_col=0)
# Repeat Details for each name in the input row.
details = data['Details'].str.split(',').explode()
data, data['Details'] = data.align(details, axis='rows')
details_pat = r'''(?x)
 ^ # anchor to start
 (?P<group> # named capture
 \w+ # any word chars
 )
 \(
 (?P<weight> # named capture
 \d+ # any digits
 )
 %-
 (?P<score> # named capture
 [.\d]+ # any digits or point
 )
 \)
 $ # anchor to end
'''
data = pd.concat(
 (data['Name'], data['Details'].str.extract(details_pat)),
 axis='columns',
)
data['weight'] = data['weight'].astype(float)*0.01
data['score'] = data['score'].astype(float)
mean = data.groupby('group')['score'].transform('mean')
data['output'] = data['score'] * 7.5/mean * data['weight']
output = data.groupby('group')['output'].sum()
print(output)
group
fifth 5.250000
first 17.362205
ninth 2.250000
second 6.048387
seventh 4.500000
sixth 2.250000
third 7.500000
Name: output, dtype: float64

Dot product

The inner multiply-and-sum can actually be represented as a dot product. Unfortunately Pandas groupers do not support dot so this approach partially needs to break vectorisation by using an apply. Performance characteristics will probably differ and should be profiled at scale, but the output is the same.

import io
import pandas as pd
with io.StringIO(''',Name,Details
1,AAA,"first(100%-8)"
2,BBB,"first(50%-8),second(50%-8)"
3,CCC,"sixth(30%-8),seventh(60%-7),first(10%-7.75)"
4,DDD,"third(100%-6)"
5,EEE,"fifth(70%-7.5),second(30%-7.5)"
6,FFF,"first(70%-8),ninth(30%-6.75)"
''') as f:
 data = pd.read_csv(f, index_col=0)
# Separate Details for each Name in the input row.
details = data['Details'].str.split(',').explode()
fields, fields['Details'] = data.align(details, axis='rows')
details_pat = r'''(?x)
 (?P<group> # named capture
 .* # any chars
 )
 \(
 (?P<weight> # named capture
 \d+ # any digits
 )
 %-
 (?P<score> # named capture
 \d* # digits before point
 \.? # optional point
 \d* # digits after point
 )
 \)
'''
fields = pd.concat(
 (fields['Name'], fields['Details'].str.extract(details_pat)),
 axis='columns',
)
fields['weight'] = fields['weight'].astype(float)*0.01
fields['score'] = fields['score'].astype(float)
def weighted_mean(group: pd.DataFrame) -> float:
 return group['score'].dot(group['weight']) / group['score'].mean()
output = 7.5 * fields.groupby('group')[['score', 'weight']].apply(weighted_mean)
print(output)
answered Mar 22 at 15:27
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.