I am new to Python from R. They are quite different. I wrote Python code to deal with Excel file. The code was running well, but I think it's kind of the style of R. I want somebody to show me writing the code in Python way.
The data structure looks like:
Name Details
1 AAA first(100%-8)
2 BBB first(50%-8),second(50%-8)
3 CCC sixth(30%-8),seventh(60%-7),first(10%-7.75)
4 DDD third(100%-6)
5 EEE fifth(70%-7.5),second(30%-7.5)
6 FFF first(70%-8),ninth(30%-6.75)
... ... ..........
As you can see, Mr.first gave Mr.AAA 8 points with 100% weights. Otherwise, in the 2nd,3rd,6th row, Mr.first gave different people with different scores. So the average scores that Mr.first gave is (8+8+7.75+8)/4 = 7.94, which is the average score of his group.
What I am looking for is: for the Mr.AAA, his final score is not 8 * 100%,it's 8 * (7.5/7.94)*100%, where 7.5 is a constant and 7.94 is the average score of the group of Mr.first. Similarly, for the Mr.BBB, his final score is 8 * (7.5/7.94)*50% + 8 * (7.5/7.75)*50%. Hope you get it.
So, question is pretty simple.
My code:
#-*- coding:utf-8 -*-
import xlrd
import re
data = xlrd.open_workbook(filename) #read the data
table = data.sheets()[0] #read the sheet
nrows = table.nrows #get the number of total rows
regr = r'[\u4e00-\u9fa5a-zA-Z]+' # regular expression for CHN and ENG names
regr1 = r'[0-9]+' # for scores and percentage
score = {} # The dict: {AAA:{first:[1.0,8.0]},......}
group = {} # The dict: {first:[8,8,7.75,8],......}
for i in range(2,nrows):
target = table.cell(i,10).value # the details data [first(100%-8)]
person = table.cell(i,2).value # the Name data [AAA]
c = target.split(',') # If in details data there are more than
# one person, then split them
score[person] = {} # set an empty dict
for j in c:
d = re.findall(regr,j) # get the name
d = "".join(d) # transfer the list to string
value = re.findall(regr1,j) #get the score and percentage
value1 = int(value[0])/100 # get the percentage
value2 = '.'.join([x for x in value[1:]]) # get the score
value2 = float(value2) # change to float
group.setdefault(d,[]).append(value2)
score[person].setdefault(d,[value1,value2])
#This part is for calculating the group average
for key in group:
total = 0
length = len(group[key])
group[key] = [x for x in group[key]]
for x in group[key]:
total = total + x
group[key] = total/length
output = {} #set an empty dict to store output: {AAA:7.56,......}
#this part is for calculating the final score
for key in score:
average = 0
for subkey in score[key]:
average = score[key][subkey][0] * score[key][subkey][1]
*7.5/group[subkey] + average
output[key] = average
print(output) #print the output
Finally here: for the reason of secrecy, I can not provide the raw data. But, if you need, I can create the raw data. I am here because I feel that my code is tedious. I hope someone can help me to write it more elegantly. It's for future work.
-
1\$\begingroup\$ you should definetly check pandas and maybe check the modern pandas introduction by tom augspurger and also the apply function might be interesting. Additionaly, the split-apply-combine workflow is explained here \$\endgroup\$Quickbeam2k1– Quickbeam2k12016年11月29日 09:07:11 +00:00Commented Nov 29, 2016 at 9:07
-
\$\begingroup\$ can you flatten your data or rearrange it into a dataframe? Then several operations become more "natural" \$\endgroup\$Quickbeam2k1– Quickbeam2k12016年11月29日 09:11:29 +00:00Commented Nov 29, 2016 at 9:11
-
\$\begingroup\$ I know "pandas", but unfamiliar with it. I will check it. Thanks for your comment. @Quickbeam2k1 \$\endgroup\$helloswift123– helloswift1232016年11月29日 09:27:40 +00:00Commented Nov 29, 2016 at 9:27
2 Answers 2
Portability
I get a syntax error on these lines:
average = score[key][subkey][0] * score[key][subkey][1]
*7.5/group[subkey] + average
Perhaps your version of Python is more forgiving. The error goes away if I merge the 2 lines:
average = score[key][subkey][0] * score[key][subkey][1] *7.5/group[subkey] + average
Comments
Some of the comments are helpful, such as:
#This part is for calculating the group average
Although, it could be simplified as:
# Calculating the group average
Other comments can be simply deleted, such as:
print(output) #print the output
The comment merely states what is already obvious from the code. Also, because of all the empty space between the code and the comment, it is easy to miss the comment.
This comment looks helpful:
regr = r'[\u4e00-\u9fa5a-zA-Z]+' # regular expression for CHN and ENG names
However, I'm not sure what "CHN and ENG" means. Those terms can be explained using more details in the comment.
It would also be helpful to explain this part of the regular expression in the comments:
\u4e00-\u9fa5a
Naming
There are 2 similarly named variables: regr
and regr1
. It would be
better to give them more meaningful names, such as regr_name
and regr_score
.
c
is not a meaningful variable name in this context. Perhaps details
would be better.
Similarly for j
and d
.
The variables named value
, value1
, etc., are too generic.
value1
could be percentage
and value2
could be score
.
Simpler
This line:
total = total + x
is simpler using the special assignment operator:
total += x
As Quickbeam2k1 said nine years ago,
you should [definitely] check Pandas
Specifically, these loops:
for i in range(2,nrows):
for j in c:
for key in group:
for key in score:
will be slow for large data, and take more effort than proper use of Pandas.
Whether you use Pandas or not, definitely don't (i,10)
. Use column names instead of indices.
This could look like
import io
import pandas as pd
with io.StringIO(''',Name,Details
1,AAA,"first(100%-8)"
2,BBB,"first(50%-8),second(50%-8)"
3,CCC,"sixth(30%-8),seventh(60%-7),first(10%-7.75)"
4,DDD,"third(100%-6)"
5,EEE,"fifth(70%-7.5),second(30%-7.5)"
6,FFF,"first(70%-8),ninth(30%-6.75)"
''') as f:
data = pd.read_csv(f, index_col=0)
# Repeat Details for each name in the input row.
details = data['Details'].str.split(',').explode()
data, data['Details'] = data.align(details, axis='rows')
details_pat = r'''(?x)
^ # anchor to start
(?P<group> # named capture
\w+ # any word chars
)
\(
(?P<weight> # named capture
\d+ # any digits
)
%-
(?P<score> # named capture
[.\d]+ # any digits or point
)
\)
$ # anchor to end
'''
data = pd.concat(
(data['Name'], data['Details'].str.extract(details_pat)),
axis='columns',
)
data['weight'] = data['weight'].astype(float)*0.01
data['score'] = data['score'].astype(float)
mean = data.groupby('group')['score'].transform('mean')
data['output'] = data['score'] * 7.5/mean * data['weight']
output = data.groupby('group')['output'].sum()
print(output)
group
fifth 5.250000
first 17.362205
ninth 2.250000
second 6.048387
seventh 4.500000
sixth 2.250000
third 7.500000
Name: output, dtype: float64
Dot product
The inner multiply-and-sum can actually be represented as a dot
product. Unfortunately Pandas groupers do not support dot
so this approach partially needs to break vectorisation by using an apply
. Performance characteristics will probably differ and should be profiled at scale, but the output is the same.
import io
import pandas as pd
with io.StringIO(''',Name,Details
1,AAA,"first(100%-8)"
2,BBB,"first(50%-8),second(50%-8)"
3,CCC,"sixth(30%-8),seventh(60%-7),first(10%-7.75)"
4,DDD,"third(100%-6)"
5,EEE,"fifth(70%-7.5),second(30%-7.5)"
6,FFF,"first(70%-8),ninth(30%-6.75)"
''') as f:
data = pd.read_csv(f, index_col=0)
# Separate Details for each Name in the input row.
details = data['Details'].str.split(',').explode()
fields, fields['Details'] = data.align(details, axis='rows')
details_pat = r'''(?x)
(?P<group> # named capture
.* # any chars
)
\(
(?P<weight> # named capture
\d+ # any digits
)
%-
(?P<score> # named capture
\d* # digits before point
\.? # optional point
\d* # digits after point
)
\)
'''
fields = pd.concat(
(fields['Name'], fields['Details'].str.extract(details_pat)),
axis='columns',
)
fields['weight'] = fields['weight'].astype(float)*0.01
fields['score'] = fields['score'].astype(float)
def weighted_mean(group: pd.DataFrame) -> float:
return group['score'].dot(group['weight']) / group['score'].mean()
output = 7.5 * fields.groupby('group')[['score', 'weight']].apply(weighted_mean)
print(output)