\$\begingroup\$
\$\endgroup\$
I have a dataframe which contains information of programmers like: country, programming languages. etc:
COUNTRY PROGRAMMING_LANGUAGE
usa javascript
uk python;swift;kotlin
india python;ruby
usa c++;c;assembly;python
canada java;php;golang;ruby
angola python;c#
india c;java
brazil javascript;php
canada php;sql
india c#;java
brazil java;javascript
russia java;kotlin
china javascript
usa python;c;c++
india ruby
australia javascrit
india php;java
china swift;kotlin
russia php;sql
brazil firebase;kotlin
uk sql;firebase
canada python;c
portugal python;php
My program should display on a dataframe:
- All countries;
- How many people from each country use python;
COUNTRY KNOWS_PYTHON
usa 2
uk 1
india 1
angola 1
canada 1
portugal 1
russia 0
brazil 0
australia 0
china 0
Please share your opinion about my algorithm, in any possible way to improve it:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns',100)
pd.set_option('display.max_rows',100)
df = pd.DataFrame({
"PROGRAMMER":np.arange(0,25),
"AGE":np.array([22,30,np.nan,25,19,27,28,26,33,18,14,np.nan,29,35,19,30,29,24,21,52,np.nan,24,np.nan,18,25],dtype=np.float16),
"COUNTRY":['uSa','Uk','india','usa','Canada','AngOla','India','braZil','canada','india','brazil','russia','china','usa','india',np.nan,'Australia','india','China','russia','brazil','uk','canada','portugal','ChiNa'],
"PROGRAMMING_LANGUAGE":['JAVASCRIPT','python;swift;kotlin','python;ruby','c++;c;assembly;python','java;php;golang;ruby','python;c#','c;java','javascript;php','php;sql','c#;java','java;javascript','java;kotlin','javascript','python;c;c++','ruby',np.nan,'javascrit','php;java','swift;kotlin','php;sql','firebase;kotlin','sql;firebase','python;C','python;php',np.nan],
"GENDER":['male','female','male','male','female','female',np.nan,'male','female','male','male','female','female',np.nan,'female','male','male','male','female','male',np.nan,'male','female','male','male'],
"LED_ZEPPELIN_FAN":['yes','YES','yes','yes','yes','yes','yes','yes','yes','yes','yes','yes','yes',np.nan,'yes','yes','yes','yes','yes','yes','yes','yes','yes','yes','yes'],
})
#Replacing NaN value as 'missing'
df = df.fillna("missing")
filt = (df['COUNTRY'] != "missing") & (df['PROGRAMMING_LANGUAGE'] != "missing")
table = df.loc[filt,['COUNTRY','PROGRAMMING_LANGUAGE']]
table = table.applymap(str.lower)
table
#This is just a list with all countries(without duplicates), and it will be used later
total_countries = list(set(table['COUNTRY']))
#Filter rows that contain python as programming language
filt = table['PROGRAMMING_LANGUAGE'].str.contains('python',na=False)
table_python = table.loc[filt,['COUNTRY','PROGRAMMING_LANGUAGE']]
#Getting all countries that have programmers that use python(without duplicates)
countries = table_python['COUNTRY'].value_counts().index.tolist()
#Getting the number of programmers from each country that use python(including duplicates from each country)
quantities = []
for i in range(0,len(countries)):
quantities.append(table_python['COUNTRY'].value_counts()[i])
#Comparing the list that contains all countries, with a list of countries that use python.
#If there is a country that doesn't have programmers that use python, these will be added to final with 0 as one of the values
for i in total_countries:
if i not in countries:
countries.append(i)
quantities.append(0)
table_python = pd.DataFrame({"COUNTRY":countries,"KNOWS_PYTHON":quantities})
table_python.set_index('COUNTRY',inplace=True)
table_python
irtexas19irtexas19
1 Answer 1
\$\begingroup\$
\$\endgroup\$
1
- The construction of your dataframe could be improved; your
PROGRAMMER
column looks like it should be the index, andnp.float16
is not a good representation for what looks to be integer data. - Not a good idea to
fillna
with a string and then compare to that string; instead operate on the NaN values directly - Should not be doing your own
list
,set
orloops
; this problem is fully vectorizable - Your
df['PROGRAMMING_LANGUAGE'] != "missing"
is counterproductive; rather than filtering away one country with a missing programming language, you'd want to count it as "0"
Suggested
import numpy as np
import pandas as pd
df = pd.DataFrame({
"PROGRAMMER": np.arange(0, 25),
"AGE": np.array(
[
22, 30, np.nan, 25, 19, 27, 28, 26, 33, 18, 14, np.nan, 29, 35, 19, 30, 29, 24, 21, 52,
np.nan, 24, np.nan, 18, 25,
],
dtype=np.float16,
),
"COUNTRY": [
'uSa', 'Uk', 'india', 'usa', 'Canada', 'AngOla', 'India', 'braZil', 'canada', 'india',
'brazil', 'russia', 'china', 'usa', 'india', np.nan, 'Australia', 'india', 'China',
'russia', 'brazil', 'uk', 'canada', 'portugal', 'ChiNa',
],
"PROGRAMMING_LANGUAGE": [
'JAVASCRIPT', 'python;swift;kotlin', 'python;ruby', 'c++;c;assembly;python',
'java;php;golang;ruby', 'python;c#', 'c;java', 'javascript;php', 'php;sql', 'c#;java',
'java;javascript', 'java;kotlin', 'javascript', 'python;c;c++', 'ruby', np.nan, 'javascrit',
'php;java', 'swift;kotlin', 'php;sql', 'firebase;kotlin', 'sql;firebase', 'python;C',
'python;php', np.nan,
],
"GENDER": [
'male', 'female', 'male', 'male', 'female', 'female', np.nan, 'male', 'female', 'male',
'male', 'female', 'female', np.nan, 'female', 'male', 'male', 'male', 'female', 'male',
np.nan, 'male', 'female', 'male', 'male',
],
"LED_ZEPPELIN_FAN": [
'yes', 'YES', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
np.nan, 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
],
})
# Get rid of all columns we don't care about for the final sum
df.drop(['PROGRAMMER', 'AGE', 'GENDER', 'LED_ZEPPELIN_FAN'], axis=1, inplace=True)
# Don't count countries that are undefined
df = df[df.COUNTRY.notna()]
# Standard case for both string columns
df['PROGRAMMING_LANGUAGE'] = df.PROGRAMMING_LANGUAGE.str.casefold()
df['COUNTRY'] = df.COUNTRY.str.title()
# Add a boolean column we'll use to apply a grouped sum
df['KNOWS_PYTHON'] = df.PROGRAMMING_LANGUAGE.str.contains('python', na=False)
sums = df.groupby(['COUNTRY']).sum()
sums.sort_values(by=['KNOWS_PYTHON'], ascending=False, inplace=True)
answered Jul 30, 2021 at 3:43
-
1\$\begingroup\$ pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html \$\endgroup\$Reinderien– Reinderien2021年07月30日 20:32:48 +00:00Commented Jul 30, 2021 at 20:32
default