Comparison with SQL#
Since many potential pandas users have some familiarity with SQL, this page is meant to provide some examples of how various SQL operations would be performed using pandas.
If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself with the library.
As is customary, we import pandas and NumPy as follows:
In [1]: importpandasaspd In [2]: importnumpyasnp
Most of the examples will utilize the tips
dataset found within pandas tests. We’ll read
the data into a DataFrame called tips
and assume we have a database table of the same name and
structure.
In [3]: url = ( ...: "https://raw.githubusercontent.com/pandas-dev" ...: "/pandas/main/pandas/tests/io/data/csv/tips.csv" ...: ) ...: In [4]: tips = pd.read_csv(url) In [5]: tips Out[5]: total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4 .. ... ... ... ... ... ... ... 239 29.03 5.92 Male No Sat Dinner 3 240 27.18 2.00 Female Yes Sat Dinner 2 241 22.67 2.00 Male Yes Sat Dinner 2 242 17.82 1.75 Male No Sat Dinner 2 243 18.78 3.00 Female No Thur Dinner 2 [244 rows x 7 columns]
Copies vs. in place operations#
Most pandas operations return copies of the Series
/DataFrame
. To make the changes "stick",
you’ll need to either assign to a new variable:
sorted_df = df.sort_values("col1")
or overwrite the original one:
df = df.sort_values("col1")
Note
You will see an inplace=True
or copy=False
keyword argument available for
some methods:
df.replace(5, inplace=True)
There is an active discussion about deprecating and removing inplace
and copy
for
most methods (e.g. dropna
) except for a very small subset of methods
(including replace
). Both keywords won’t be
necessary anymore in the context of Copy-on-Write. The proposal can be found
here.
SELECT#
In SQL, selection is done using a comma-separated list of columns you’d like to select (or a *
to select all columns):
SELECTtotal_bill,tip,smoker,time FROMtips;
With pandas, column selection is done by passing a list of column names to your DataFrame:
In [6]: tips[["total_bill", "tip", "smoker", "time"]] Out[6]: total_bill tip smoker time 0 16.99 1.01 No Dinner 1 10.34 1.66 No Dinner 2 21.01 3.50 No Dinner 3 23.68 3.31 No Dinner 4 24.59 3.61 No Dinner .. ... ... ... ... 239 29.03 5.92 No Dinner 240 27.18 2.00 Yes Dinner 241 22.67 2.00 Yes Dinner 242 17.82 1.75 No Dinner 243 18.78 3.00 No Dinner [244 rows x 4 columns]
Calling the DataFrame without the list of column names would display all columns (akin to SQL’s
*
).
In SQL, you can add a calculated column:
SELECT*,tip/total_billastip_rate FROMtips;
With pandas, you can use the DataFrame.assign()
method of a DataFrame to append a new column:
In [7]: tips.assign(tip_rate=tips["tip"] / tips["total_bill"]) Out[7]: total_bill tip sex smoker day time size tip_rate 0 16.99 1.01 Female No Sun Dinner 2 0.059447 1 10.34 1.66 Male No Sun Dinner 3 0.160542 2 21.01 3.50 Male No Sun Dinner 3 0.166587 3 23.68 3.31 Male No Sun Dinner 2 0.139780 4 24.59 3.61 Female No Sun Dinner 4 0.146808 .. ... ... ... ... ... ... ... ... 239 29.03 5.92 Male No Sat Dinner 3 0.203927 240 27.18 2.00 Female Yes Sat Dinner 2 0.073584 241 22.67 2.00 Male Yes Sat Dinner 2 0.088222 242 17.82 1.75 Male No Sat Dinner 2 0.098204 243 18.78 3.00 Female No Thur Dinner 2 0.159744 [244 rows x 8 columns]
WHERE#
Filtering in SQL is done via a WHERE clause.
SELECT* FROMtips WHEREtime='Dinner';
DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing.
In [8]: tips[tips["total_bill"] > 10] Out[8]: total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4 .. ... ... ... ... ... ... ... 239 29.03 5.92 Male No Sat Dinner 3 240 27.18 2.00 Female Yes Sat Dinner 2 241 22.67 2.00 Male Yes Sat Dinner 2 242 17.82 1.75 Male No Sat Dinner 2 243 18.78 3.00 Female No Thur Dinner 2 [227 rows x 7 columns]
The above statement is simply passing a Series
of True
/False
objects to the DataFrame,
returning all rows with True
.
In [9]: is_dinner = tips["time"] == "Dinner" In [10]: is_dinner Out[10]: 0 True 1 True 2 True 3 True 4 True ... 239 True 240 True 241 True 242 True 243 True Name: time, Length: 244, dtype: bool In [11]: is_dinner.value_counts() Out[11]: time True 176 False 68 Name: count, dtype: int64 In [12]: tips[is_dinner] Out[12]: total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4 .. ... ... ... ... ... ... ... 239 29.03 5.92 Male No Sat Dinner 3 240 27.18 2.00 Female Yes Sat Dinner 2 241 22.67 2.00 Male Yes Sat Dinner 2 242 17.82 1.75 Male No Sat Dinner 2 243 18.78 3.00 Female No Thur Dinner 2 [176 rows x 7 columns]
Just like SQL’s OR
and AND
, multiple conditions can be passed to a DataFrame using |
(OR
) and &
(AND
).
Tips of more than 5ドル at Dinner meals:
SELECT* FROMtips WHEREtime='Dinner'ANDtip>5.00;
In [13]: tips[(tips["time"] == "Dinner") & (tips["tip"] > 5.00)] Out[13]: total_bill tip sex smoker day time size 23 39.42 7.58 Male No Sat Dinner 4 44 30.40 5.60 Male No Sun Dinner 4 47 32.40 6.00 Male No Sun Dinner 4 52 34.81 5.20 Female No Sun Dinner 4 59 48.27 6.73 Male No Sat Dinner 4 116 29.93 5.07 Male No Sun Dinner 4 155 29.85 5.14 Female No Sun Dinner 5 170 50.81 10.00 Male Yes Sat Dinner 3 172 7.25 5.15 Male Yes Sun Dinner 2 181 23.33 5.65 Male Yes Sun Dinner 2 183 23.17 6.50 Male Yes Sun Dinner 4 211 25.89 5.16 Male Yes Sat Dinner 4 212 48.33 9.00 Male No Sat Dinner 4 214 28.17 6.50 Female Yes Sat Dinner 3 239 29.03 5.92 Male No Sat Dinner 3
Tips by parties of at least 5 diners OR bill total was more than 45ドル:
SELECT* FROMtips WHEREsize>=5ORtotal_bill>45;
In [14]: tips[(tips["size"] >= 5) | (tips["total_bill"] > 45)] Out[14]: total_bill tip sex smoker day time size 59 48.27 6.73 Male No Sat Dinner 4 125 29.80 4.20 Female No Thur Lunch 6 141 34.30 6.70 Male No Thur Lunch 6 142 41.19 5.00 Male No Thur Lunch 5 143 27.05 5.00 Female No Thur Lunch 6 155 29.85 5.14 Female No Sun Dinner 5 156 48.17 5.00 Male No Sun Dinner 6 170 50.81 10.00 Male Yes Sat Dinner 3 182 45.35 3.50 Male Yes Sun Dinner 3 185 20.69 5.00 Male No Sun Dinner 5 187 30.46 2.00 Male Yes Sun Dinner 5 212 48.33 9.00 Male No Sat Dinner 4 216 28.15 3.00 Male Yes Sat Dinner 5
NULL checking is done using the notna()
and isna()
methods.
In [15]: frame = pd.DataFrame( ....: {"col1": ["A", "B", np.nan, "C", "D"], "col2": ["F", np.nan, "G", "H", "I"]} ....: ) ....: In [16]: frame Out[16]: col1 col2 0 A F 1 B NaN 2 NaN G 3 C H 4 D I
Assume we have a table of the same structure as our DataFrame above. We can see only the records
where col2
IS NULL with the following query:
SELECT* FROMframe WHEREcol2ISNULL;
In [17]: frame[frame["col2"].isna()] Out[17]: col1 col2 1 B NaN
Getting items where col1
IS NOT NULL can be done with notna()
.
SELECT* FROMframe WHEREcol1ISNOTNULL;
In [18]: frame[frame["col1"].notna()] Out[18]: col1 col2 0 A F 1 B NaN 3 C H 4 D I
GROUP BY#
In pandas, SQL’s GROUP BY
operations are performed using the similarly named
groupby()
method. groupby()
typically refers to a
process where we’d like to split a dataset into groups, apply some function (typically aggregation)
, and then combine the groups together.
A common SQL operation would be getting the count of records in each group throughout a dataset. For instance, a query getting us the number of tips left by sex:
SELECTsex,count(*) FROMtips GROUPBYsex; /* Female 87 Male 157 */
The pandas equivalent would be:
In [19]: tips.groupby("sex").size() Out[19]: sex Female 87 Male 157 dtype: int64
Notice that in the pandas code we used DataFrameGroupBy.size()
and not
DataFrameGroupBy.count()
. This is because
DataFrameGroupBy.count()
applies the function to each column, returning
the number of NOT NULL
records within each.
In [20]: tips.groupby("sex").count() Out[20]: total_bill tip smoker day time size sex Female 87 87 87 87 87 87 Male 157 157 157 157 157 157
Alternatively, we could have applied the DataFrameGroupBy.count()
method
to an individual column:
In [21]: tips.groupby("sex")["total_bill"].count() Out[21]: sex Female 87 Male 157 Name: total_bill, dtype: int64
Multiple functions can also be applied at once. For instance, say we’d like to see how tip amount
differs by day of the week - DataFrameGroupBy.agg()
allows you to pass a dictionary
to your grouped DataFrame, indicating which functions to apply to specific columns.
SELECTday,AVG(tip),COUNT(*) FROMtips GROUPBYday; /* Fri 2.734737 19 Sat 2.993103 87 Sun 3.255132 76 Thu 2.771452 62 */
In [22]: tips.groupby("day").agg({"tip": "mean", "day": "size"}) Out[22]: tip day day Fri 2.734737 19 Sat 2.993103 87 Sun 3.255132 76 Thur 2.771452 62
Grouping by more than one column is done by passing a list of columns to the
groupby()
method.
SELECTsmoker,day,COUNT(*),AVG(tip) FROMtips GROUPBYsmoker,day; /* smoker day No Fri 4 2.812500 Sat 45 3.102889 Sun 57 3.167895 Thu 45 2.673778 Yes Fri 15 2.714000 Sat 42 2.875476 Sun 19 3.516842 Thu 17 3.030000 */
In [23]: tips.groupby(["smoker", "day"]).agg({"tip": ["size", "mean"]}) Out[23]: tip size mean smoker day No Fri 4 2.812500 Sat 45 3.102889 Sun 57 3.167895 Thur 45 2.673778 Yes Fri 15 2.714000 Sat 42 2.875476 Sun 19 3.516842 Thur 17 3.030000
JOIN#
JOIN
s can be performed with join()
or merge()
. By
default, join()
will join the DataFrames on their indices. Each method has
parameters allowing you to specify the type of join to perform (LEFT
, RIGHT
, INNER
,
FULL
) or the columns to join on (column names or indices).
Warning
If both key columns contain rows where the key is a null value, those rows will be matched against each other. This is different from usual SQL join behaviour and can lead to unexpected results.
In [24]: df1 = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)}) In [25]: df2 = pd.DataFrame({"key": ["B", "D", "D", "E"], "value": np.random.randn(4)})
Assume we have two database tables of the same name and structure as our DataFrames.
Now let’s go over the various types of JOIN
s.
INNER JOIN#
SELECT* FROMdf1 INNERJOINdf2 ONdf1.key=df2.key;
# merge performs an INNER JOIN by default In [26]: pd.merge(df1, df2, on="key") Out[26]: key value_x value_y 0 B -0.282863 1.212112 1 D -1.135632 -0.173215 2 D -1.135632 0.119209
merge()
also offers parameters for cases when you’d like to join one DataFrame’s
column with another DataFrame’s index.
In [27]: indexed_df2 = df2.set_index("key") In [28]: pd.merge(df1, indexed_df2, left_on="key", right_index=True) Out[28]: key value_x value_y 1 B -0.282863 1.212112 3 D -1.135632 -0.173215 3 D -1.135632 0.119209
LEFT OUTER JOIN#
Show all records from df1
.
SELECT* FROMdf1 LEFTOUTERJOINdf2 ONdf1.key=df2.key;
In [29]: pd.merge(df1, df2, on="key", how="left") Out[29]: key value_x value_y 0 A 0.469112 NaN 1 B -0.282863 1.212112 2 C -1.509059 NaN 3 D -1.135632 -0.173215 4 D -1.135632 0.119209
RIGHT JOIN#
Show all records from df2
.
SELECT* FROMdf1 RIGHTOUTERJOINdf2 ONdf1.key=df2.key;
In [30]: pd.merge(df1, df2, on="key", how="right") Out[30]: key value_x value_y 0 B -0.282863 1.212112 1 D -1.135632 -0.173215 2 D -1.135632 0.119209 3 E NaN -1.044236
FULL JOIN#
pandas also allows for FULL JOIN
s, which display both sides of the dataset, whether or not the
joined columns find a match. As of writing, FULL JOIN
s are not supported in all RDBMS (MySQL).
Show all records from both tables.
SELECT* FROMdf1 FULLOUTERJOINdf2 ONdf1.key=df2.key;
In [31]: pd.merge(df1, df2, on="key", how="outer") Out[31]: key value_x value_y 0 A 0.469112 NaN 1 B -0.282863 1.212112 2 C -1.509059 NaN 3 D -1.135632 -0.173215 4 D -1.135632 0.119209 5 E NaN -1.044236
UNION#
UNION ALL
can be performed using concat()
.
In [32]: df1 = pd.DataFrame( ....: {"city": ["Chicago", "San Francisco", "New York City"], "rank": range(1, 4)} ....: ) ....: In [33]: df2 = pd.DataFrame( ....: {"city": ["Chicago", "Boston", "Los Angeles"], "rank": [1, 4, 5]} ....: ) ....:
SELECTcity,rank FROMdf1 UNIONALL SELECTcity,rank FROMdf2; /* city rank Chicago 1 San Francisco 2 New York City 3 Chicago 1 Boston 4 Los Angeles 5 */
In [34]: pd.concat([df1, df2]) Out[34]: city rank 0 Chicago 1 1 San Francisco 2 2 New York City 3 0 Chicago 1 1 Boston 4 2 Los Angeles 5
SQL’s UNION
is similar to UNION ALL
, however UNION
will remove duplicate rows.
SELECTcity,rank FROMdf1 UNION SELECTcity,rank FROMdf2; -- notice that there is only one Chicago record this time /* city rank Chicago 1 San Francisco 2 New York City 3 Boston 4 Los Angeles 5 */
In pandas, you can use concat()
in conjunction with
drop_duplicates()
.
In [35]: pd.concat([df1, df2]).drop_duplicates() Out[35]: city rank 0 Chicago 1 1 San Francisco 2 2 New York City 3 1 Boston 4 2 Los Angeles 5
LIMIT#
SELECT*FROMtips LIMIT10;
In [36]: tips.head(10) Out[36]: total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4 5 25.29 4.71 Male No Sun Dinner 4 6 8.77 2.00 Male No Sun Dinner 2 7 26.88 3.12 Male No Sun Dinner 4 8 15.04 1.96 Male No Sun Dinner 2 9 14.78 3.23 Male No Sun Dinner 2
pandas equivalents for some SQL analytic and aggregate functions#
Top n rows with offset#
-- MySQL SELECT*FROMtips ORDERBYtipDESC LIMIT10OFFSET5;
In [37]: tips.nlargest(10 + 5, columns="tip").tail(10) Out[37]: total_bill tip sex smoker day time size 183 23.17 6.50 Male Yes Sun Dinner 4 214 28.17 6.50 Female Yes Sat Dinner 3 47 32.40 6.00 Male No Sun Dinner 4 239 29.03 5.92 Male No Sat Dinner 3 88 24.71 5.85 Male No Thur Lunch 2 181 23.33 5.65 Male Yes Sun Dinner 2 44 30.40 5.60 Male No Sun Dinner 4 52 34.81 5.20 Female No Sun Dinner 4 85 34.83 5.17 Female No Thur Lunch 4 211 25.89 5.16 Male Yes Sat Dinner 4
Top n rows per group#
-- Oracle's ROW_NUMBER() analytic function SELECT*FROM( SELECT t.*, ROW_NUMBER()OVER(PARTITIONBYdayORDERBYtotal_billDESC)ASrn FROMtipst ) WHERErn<3 ORDERBYday,rn;
In [38]: ( ....: tips.assign( ....: rn=tips.sort_values(["total_bill"], ascending=False) ....: .groupby(["day"]) ....: .cumcount() ....: + 1 ....: ) ....: .query("rn < 3") ....: .sort_values(["day", "rn"]) ....: ) ....: Out[38]: total_bill tip sex smoker day time size rn 95 40.17 4.73 Male Yes Fri Dinner 4 1 90 28.97 3.00 Male Yes Fri Dinner 2 2 170 50.81 10.00 Male Yes Sat Dinner 3 1 212 48.33 9.00 Male No Sat Dinner 4 2 156 48.17 5.00 Male No Sun Dinner 6 1 182 45.35 3.50 Male Yes Sun Dinner 3 2 197 43.11 5.00 Female Yes Thur Lunch 4 1 142 41.19 5.00 Male No Thur Lunch 5 2
the same using rank(method='first')
function
In [39]: ( ....: tips.assign( ....: rnk=tips.groupby(["day"])["total_bill"].rank( ....: method="first", ascending=False ....: ) ....: ) ....: .query("rnk < 3") ....: .sort_values(["day", "rnk"]) ....: ) ....: Out[39]: total_bill tip sex smoker day time size rnk 95 40.17 4.73 Male Yes Fri Dinner 4 1.0 90 28.97 3.00 Male Yes Fri Dinner 2 2.0 170 50.81 10.00 Male Yes Sat Dinner 3 1.0 212 48.33 9.00 Male No Sat Dinner 4 2.0 156 48.17 5.00 Male No Sun Dinner 6 1.0 182 45.35 3.50 Male Yes Sun Dinner 3 2.0 197 43.11 5.00 Female Yes Thur Lunch 4 1.0 142 41.19 5.00 Male No Thur Lunch 5 2.0
-- Oracle's RANK() analytic function SELECT*FROM( SELECT t.*, RANK()OVER(PARTITIONBYsexORDERBYtip)ASrnk FROMtipst WHEREtip<2 ) WHERErnk<3 ORDERBYsex,rnk;
Let’s find tips with (rank < 3) per gender group for (tips < 2).
Notice that when using rank(method='min')
function
rnk_min
remains the same for the same tip
(as Oracle’s RANK()
function)
In [40]: ( ....: tips[tips["tip"] < 2] ....: .assign(rnk_min=tips.groupby(["sex"])["tip"].rank(method="min")) ....: .query("rnk_min < 3") ....: .sort_values(["sex", "rnk_min"]) ....: ) ....: Out[40]: total_bill tip sex smoker day time size rnk_min 67 3.07 1.00 Female Yes Sat Dinner 1 1.0 92 5.75 1.00 Female Yes Fri Dinner 2 1.0 111 7.25 1.00 Female No Sat Dinner 1 1.0 236 12.60 1.00 Male Yes Sat Dinner 2 1.0 237 32.83 1.17 Male Yes Sat Dinner 2 2.0
UPDATE#
UPDATEtips SETtip=tip*2 WHEREtip<2;
In [41]: tips.loc[tips["tip"] < 2, "tip"] *= 2
DELETE#
DELETEFROMtips WHEREtip>9;
In pandas we select the rows that should remain instead of deleting them:
In [42]: tips = tips.loc[tips["tip"] <= 9]