[フレーム]

Python Tutorial

Python HOME Python Intro Python Get Started Python Syntax Python Comments Python Variables Python Data Types Python Numbers Python Casting Python Strings Python Booleans Python Operators Python Lists Python Tuples Python Sets Python Dictionaries Python If...Else Python Match Python While Loops Python For Loops Python Functions Python Lambda Python Arrays Python OOP Python Classes/Objects Python Inheritance Python Iterators Python Polymorphism Python Scope Python Modules Python Dates Python Math Python JSON Python RegEx Python PIP Python Try...Except Python String Formatting Python User Input Python VirtualEnv

File Handling

Python File Handling Python Read Files Python Write/Create Files Python Delete Files

Python Modules

NumPy Tutorial Pandas Tutorial SciPy Tutorial Django Tutorial

Python Matplotlib

Matplotlib Intro Matplotlib Get Started Matplotlib Pyplot Matplotlib Plotting Matplotlib Markers Matplotlib Line Matplotlib Labels Matplotlib Grid Matplotlib Subplot Matplotlib Scatter Matplotlib Bars Matplotlib Histograms Matplotlib Pie Charts

Machine Learning

Getting Started Mean Median Mode Standard Deviation Percentile Data Distribution Normal Data Distribution Scatter Plot Linear Regression Polynomial Regression Multiple Regression Scale Train/Test Decision Tree Confusion Matrix Hierarchical Clustering Logistic Regression Grid Search Categorical Data K-means Bootstrap Aggregation Cross Validation AUC - ROC Curve K-nearest neighbors

Python DSA

Python DSA Lists and Arrays Stacks Queues Linked Lists Hash Tables Trees Binary Trees Binary Search Trees AVL Trees Graphs Linear Search Binary Search Bubble Sort Selection Sort Insertion Sort Quick Sort Counting Sort Radix Sort Merge Sort

Python MySQL

MySQL Get Started MySQL Create Database MySQL Create Table MySQL Insert MySQL Select MySQL Where MySQL Order By MySQL Delete MySQL Drop Table MySQL Update MySQL Limit MySQL Join

Python MongoDB

MongoDB Get Started MongoDB Create DB MongoDB Collection MongoDB Insert MongoDB Find MongoDB Query MongoDB Sort MongoDB Delete MongoDB Drop Collection MongoDB Update MongoDB Limit

Python Reference

Python Overview Python Built-in Functions Python String Methods Python List Methods Python Dictionary Methods Python Tuple Methods Python Set Methods Python File Methods Python Keywords Python Exceptions Python Glossary

Module Reference

Random Module Requests Module Statistics Module Math Module cMath Module

Python How To

Remove List Duplicates Reverse a String Add Two Numbers

Python Examples

Python Examples Python Compiler Python Exercises Python Quiz Python Server Python Syllabus Python Study Plan Python Interview Q&A Python Bootcamp Python Certificate Python Training

Preprocessing - Categorical Data


Categorical Data

When your data has categories represented by strings, it will be difficult to use them to train machine learning models which often only accepts numeric data.

Instead of ignoring the categorical data and excluding the information from our model, you can tranform the data so it can be used in your models.

Take a look at the table below, it is the same data set that we used in the multiple regression chapter.

Example

import pandas as pd

cars = pd.read_csv('data.csv')
print(cars.to_string())

Result

 Car Model Volume Weight CO2
 0 Toyoty Aygo 1000 790 99
 1 Mitsubishi Space Star 1200 1160 95
 2 Skoda Citigo 1000 929 95
 3 Fiat 500 900 865 90
 4 Mini Cooper 1500 1140 105
 5 VW Up! 1000 929 105
 6 Skoda Fabia 1400 1109 90
 7 Mercedes A-Class 1500 1365 92
 8 Ford Fiesta 1500 1112 98
 9 Audi A1 1600 1150 99
 10 Hyundai I20 1100 980 99
 11 Suzuki Swift 1300 990 101
 12 Ford Fiesta 1000 1112 99
 13 Honda Civic 1600 1252 94
 14 Hundai I30 1600 1326 97
 15 Opel Astra 1600 1330 97
 16 BMW 1 1600 1365 99
 17 Mazda 3 2200 1280 104
 18 Skoda Rapid 1600 1119 104
 19 Ford Focus 2000 1328 105
 20 Ford Mondeo 1600 1584 94
 21 Opel Insignia 2000 1428 99
 22 Mercedes C-Class 2100 1365 99
 23 Skoda Octavia 1600 1415 99
 24 Volvo S60 2000 1415 99
 25 Mercedes CLA 1500 1465 102
 26 Audi A4 2000 1490 104
 27 Audi A6 2000 1725 114
 28 Volvo V70 1600 1523 109
 29 BMW 5 2000 1705 114
 30 Mercedes E-Class 2100 1605 115
 31 Volvo XC70 2000 1746 117
 32 Ford B-Max 1600 1235 104
 33 BMW 216 1600 1390 108
 34 Opel Zafira 1600 1405 109
 35 Mercedes SLK 2500 1395 120
 


Run example »

In the multiple regression chapter, we tried to predict the CO2 emitted based on the volume of the engine and the weight of the car but we excluded information about the car brand and model.

The information about the car brand or the car model might help us make a better prediction of the CO2 emitted.


One Hot Encoding

We cannot make use of the Car or Model column in our data since they are not numeric. A linear relationship between a categorical variable, Car or Model, and a numeric variable, CO2, cannot be determined.

To fix this issue, we must have a numeric representation of the categorical variable. One way to do this is to have a column representing each group in the category.

For each column, the values will be 1 or 0 where 1 represents the inclusion of the group and 0 represents the exclusion. This transformation is called one hot encoding.

You do not have to do this manually, the Python Pandas module has a function that called get_dummies() which does one hot encoding.

Learn about the Pandas module in our Pandas Tutorial.

Example

One Hot Encode the Car column:

import pandas as pd

cars = pd.read_csv('data.csv')
ohe_cars = pd.get_dummies(cars[['Car']])

print(ohe_cars.to_string())

Result

 Car_Audi Car_BMW Car_Fiat Car_Ford Car_Honda Car_Hundai Car_Hyundai Car_Mazda Car_Mercedes Car_Mini Car_Mitsubishi Car_Opel Car_Skoda Car_Suzuki Car_Toyoty Car_VW Car_Volvo
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
 3 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 4 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
 6 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
 7 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
 8 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
 9 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 10 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
 11 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
 12 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
 13 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
 14 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
 15 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
 16 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 17 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 18 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
 19 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
 20 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
 21 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
 22 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
 23 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
 24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
 25 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
 26 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 27 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 28 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
 29 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 30 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
 31 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
 32 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
 33 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 34 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
 35 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0


Run example »

Results

A column was created for every car brand in the Car column.



Predict CO2

We can use this additional information alongside the volume and weight to predict CO2

To combine the information, we can use the concat() function from pandas.

First we will need to import a couple modules.

We will start with importing the Pandas.

import pandas

The pandas module allows us to read csv files and manipulate DataFrame objects:

cars = pandas.read_csv("data.csv")

It also allows us to create the dummy variables:

ohe_cars = pandas.get_dummies(cars[['Car']])

Then we must select the independent variables (X) and add the dummy variables columnwise.

Also store the dependent variable in y.

X = pandas.concat([cars[['Volume', 'Weight']], ohe_cars], axis=1)
y = cars['CO2']

We also need to import a method from sklearn to create a linear model

Learn about linear regression.

from sklearn import linear_model

Now we can fit the data to a linear regression:

regr = linear_model.LinearRegression()
regr.fit(X,y)

Finally we can predict the CO2 emissions based on the car's weight, volume, and manufacturer.

##predict the CO2 emission of a VW where the weight is 2300kg, and the volume is 1300cm3:
predictedCO2 = regr.predict([[2300, 1300,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0]])

Example

import pandas
from sklearn import linear_model

cars = pandas.read_csv("data.csv")
ohe_cars = pandas.get_dummies(cars[['Car']])

X = pandas.concat([cars[['Volume', 'Weight']], ohe_cars], axis=1)
y = cars['CO2']

regr = linear_model.LinearRegression()
regr.fit(X,y)

##predict the CO2 emission of a VW where the weight is 2300kg, and the volume is 1300cm3:
predictedCO2 = regr.predict([[2300, 1300,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0]])

print(predictedCO2)

Result

 [122.45153299]


Run example »

We now have a coefficient for the volume, the weight, and each car brand in the data set


Dummifying

It is not necessary to create one column for each group in your category. The information can be retained using 1 column less than the number of groups you have.

For example, you have a column representing colors and in that column, you have two colors, red and blue.

Example

import pandas as pd

colors = pd.DataFrame({'color': ['blue', 'red']})

print(colors)

Result

 color
 0 blue
 1 red


Run example »

You can create 1 column called red where 1 represents red and 0 represents not red, which means it is blue.

To do this, we can use the same function that we used for one hot encoding, get_dummies, and then drop one of the columns. There is an argument, drop_first, which allows us to exclude the first column from the resulting table.

Example

import pandas as pd

colors = pd.DataFrame({'color': ['blue', 'red']})
dummies = pd.get_dummies(colors, drop_first=True)

print(dummies)

Result

 color_red
 0 0
 1 1


Run example »

What if you have more than 2 groups? How can the multiple groups be represented by 1 less column?

Let's say we have three colors this time, red, blue and green. When we get_dummies while dropping the first column, we get the following table.

Example

import pandas as pd

colors = pd.DataFrame({'color': ['blue', 'red', 'green']})
dummies = pd.get_dummies(colors, drop_first=True)
dummies['color'] = colors['color']

print(dummies)

Result

 color_green color_red color
 0 0 0 blue
 1 0 1 red
 2 1 0 green


Run example »

Track your progress - it's free!
×

Contact Sales

If you want to use W3Schools services as an educational institution, team or enterprise, send us an e-mail:
sales@w3schools.com

Report Error

If you want to report an error, or if you want to make a suggestion, send us an e-mail:
help@w3schools.com

FORUM ABOUT ACADEMY
W3Schools is optimized for learning and training. Examples might be simplified to improve reading and learning.
Tutorials, references, and examples are constantly reviewed to avoid errors, but we cannot warrant full correctness
of all content. While using W3Schools, you agree to have read and accepted our terms of use, cookie and privacy policy.

Copyright 1999-2025 by Refsnes Data. All Rights Reserved. W3Schools is Powered by W3.CSS.

AltStyle によって変換されたページ (->オリジナル) /