Name	Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore	.gitignore
README.md	README.md
main.py	main.py
requirements.txt	requirements.txt

📊 Python & Pandas Assignment: DataFrame Column Selection by Data Type

🏫 Assignment Title Page

Course: Data Analytics with Python & Pandas
Topic: Column Filtering using pandas.DataFrame.select_dtypes()
Assignment ID: PY-PD-01
Level: Beginner to Intermediate
Author: Gowtham (Workspace: Assigenmt - 1)
Date: May 2026

📝 1. Introduction

When working with real-world datasets, data often arrives in mixed formats containing numbers, strings, dates, and boolean flags. Before performing any mathematical computations, machine learning training, or text processing, a data scientist must isolate columns of specific types.

The Pandas library in Python provides a highly efficient, built-in method called pandas.DataFrame.select_dtypes() to solve this problem. Instead of manually looping over columns and checking their types one-by-one, select_dtypes() allows you to subset a DataFrame based on column data types using high-level groups or specific type classes.

🎯 2. Objective

The primary objectives of this assignment are to:

Understand how Pandas represents and tracks data types (dtypes) within a DataFrame.
Master the usage of the pandas.DataFrame.select_dtypes() method.
Learn how to selectively include or exclude columns based on their data types (numeric, object/string, and boolean).
Build a professional, GitHub-ready Python project that showcases clean, well-commented code.

📖 3. Theory & Explanation

In Pandas, a DataFrame is composed of one or more Series (columns). Each column has an associated data type (dtype), which dictates what operations can be performed on it.

Common Pandas Data Types:

int64 / int32: Integer numbers (e.g., 29, 45, 100).
float64 / float32: Floating-point decimal numbers (e.g., 150.50, 89.20).
object: Generally represents text / strings, or mixed python objects.
bool: Boolean values (True or False).
datetime64: Date and time values.
category: Finite list of text values (efficient for repeated values).

The select_dtypes() method returns a subset of the DataFrame’s columns based on the column data types. It checks each column's dtype against the criteria specified in the arguments and yields a new DataFrame with matching columns.

⚙️ 4. Syntax Explanation

The syntax of the select_dtypes() method is simple and expressive:

DataFrame.select_dtypes(include=None, exclude=None)

Both arguments are optional, but at least one of them must be provided.

📋 5. Parameters Explanation

Parameter	Type	Description
`include`	scalar or list-like	A selection of dtypes or strings to be included in the result. At least one must be matched.
`exclude`	scalar or list-like	A selection of dtypes or strings to be excluded from the result.

Valid Data Type Inputs (Strings or Objects):

To select numeric types, use 'number' or np.number (includes all integers and floats).
To select strings, use 'object'.
To select booleans, use 'bool'.
To select datetimes, use 'datetime' or 'datetime64'.
To select categories, use 'category'.

⚠️ Note: You can pass lists of types, such as include=['int64', 'float64'] or include=['number', 'bool'].

💻 6. Example Programs & Outputs

Here is how select_dtypes() works across different scenarios. The following examples represent the actual code implemented in main.py.

🗃️ Sample Dataset

We define a sample customer database:

data = {
 'CustomerID': [101, 102, 103, 104, 105],
 'Name': ['Alice Johnson', 'Bob Smith', 'Charlie Brown', 'Diana Prince', 'Evan Wright'],
 'Age': [29, 34, 45, 28, 52],
 'PurchaseAmount': [150.50, 89.20, 420.75, 310.00, 75.60],
 'IsPremium': [True, False, True, True, False],
 'Country': ['USA', 'Canada', 'UK', 'USA', 'Australia'],
 'NewsletterSubscribed': [True, True, False, True, False]
}

🔹 Example 1: Selecting Numeric Columns (`include='number'`)

Extracts integer and floating-point columns. Ideal before computing mathematical averages, sums, or correlations.

numeric_df = df.select_dtypes(include='number')

Output:

 CustomerID Age PurchaseAmount
0 101 29 150.50
1 102 34 89.20
2 103 45 420.75
3 104 28 310.00
4 105 52 75.60

🔹 Example 2: Selecting String / Object Columns (`include='object'`)

Extracts text columns. Highly useful when cleaning strings, stripping whitespaces, or encoding categorical variables for machine learning.

object_df = df.select_dtypes(include='object')

Output:

 Name Country
0 Alice Johnson USA
1 Bob Smith Canada
2 Charlie Brown UK
3 Diana Prince USA
4 Evan Wright Australia

🔹 Example 3: Selecting Boolean Columns (`include='bool'`)

Extracts truth-value columns. Perfect for filtering customers, calculating conversion rates, or performing logical flag checks.

boolean_df = df.select_dtypes(include='bool')

Output:

 IsPremium NewsletterSubscribed
0 True True
1 False True
2 True False
3 True True
4 False False

🔹 Example 4: Excluding Columns (`exclude='number'`)

Extracts all columns except the specified ones. In this case, we exclude all numeric types to keep only descriptive/categorical properties.

non_numeric_df = df.select_dtypes(exclude='number')

Output:

 Name IsPremium Country NewsletterSubscribed
0 Alice Johnson True USA True
1 Bob Smith False Canada True
2 Charlie Brown True UK False
3 Diana Prince True USA True
4 Evan Wright False Australia False

🏢 7. Real-World Use Case

Imagine you are a Junior Data Scientist at an e-commerce platform preparing a raw transactions table for a Machine Learning Model.

Model Inputs: Machine Learning algorithms (like Linear Regression or XGBoost) require purely numeric data. You must use .select_dtypes(include='number') to feed columns like Age and PurchaseAmount into the model.
Feature Engineering: You want to convert customer strings (Country) into dummy indicators (one-hot encoding). You isolate these variables using .select_dtypes(include='object') to apply preprocessing algorithms.
Auditing: You need to ensure no private text files or raw labels slip into a mathematical calculation step. .select_dtypes(exclude='object') protects against errors!

🌟 8. Advantages of `select_dtypes()`

Readability & Elegance: Replaces long loops and complex conditional statements with one clear, readable line of code.
Speed & Efficiency: Vectorized execution in Pandas runs C-level optimizations under the hood, significantly outperforming Python native loops.
Dynamic Flexibility: If your schema updates (e.g., a new numeric column is added to the source database), select_dtypes() handles it automatically without hardcoded column index adjustments.
Error Prevention: Safeguards downstream processes (e.g., you won't accidentally try to compute the mathematical average of Name).

🚀 9. Step-by-Step Execution Guide

Follow these steps to run the project locally on your machine.

🔌 Prerequisites

Ensure you have Python 3.8+ installed on your system.

📥 Step 1: Install Pandas

Run the following pip command in your terminal to install the necessary library:

pip install -r requirements.txt

Alternatively, you can install pandas directly:

pip install pandas

⚡ Step 2: Run the Project

Execute the main program using Python:

python main.py

🐙 10. Git & GitHub Pushing Instructions

Upload this professional project to your GitHub account to showcase your skills:

Step 1: Initialize Git Repository

git init

Step 2: Add Files to Staging Area

git add .

Step 3: Commit Changes

git commit -m "Initial commit"

Step 4: Rename Branch to `main`

git branch -M main

Step 5: Add Remote Repository

Replace <repo-url> with your actual GitHub repository URL (e.g., https://github.com/username/pandas-select-dtypes.git):

git remote add origin <repo-url>

Step 6: Push to GitHub

git push -u origin main

🏁 11. Conclusion

The pandas.DataFrame.select_dtypes() method is a fundamental utility in every data wrangler's toolkit. It streamlines the data preprocessing lifecycle, improves code clarity, and allows dynamic operations on columns based on their functional characteristics. Mastering this method is a key milestone for any aspiring Python developer or data scientist!

Folders and files

Latest commit

History

Repository files navigation

📊 Python & Pandas Assignment: DataFrame Column Selection by Data Type

🏫 Assignment Title Page

📝 1. Introduction

🎯 2. Objective

📖 3. Theory & Explanation

Common Pandas Data Types:

⚙️ 4. Syntax Explanation

📋 5. Parameters Explanation

Valid Data Type Inputs (Strings or Objects):

💻 6. Example Programs & Outputs

🗃️ Sample Dataset

🔹 Example 1: Selecting Numeric Columns (include='number')

🔹 Example 2: Selecting String / Object Columns (include='object')

🔹 Example 3: Selecting Boolean Columns (include='bool')

🔹 Example 4: Excluding Columns (exclude='number')

🏢 7. Real-World Use Case

🌟 8. Advantages of select_dtypes()

🚀 9. Step-by-Step Execution Guide

🔌 Prerequisites

📥 Step 1: Install Pandas

⚡ Step 2: Run the Project

🐙 10. Git & GitHub Pushing Instructions

Step 1: Initialize Git Repository

Step 2: Add Files to Staging Area

Step 3: Commit Changes

Step 4: Rename Branch to main

Step 5: Add Remote Repository

Step 6: Push to GitHub

🏁 11. Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

🔹 Example 1: Selecting Numeric Columns (`include='number'`)

🔹 Example 2: Selecting String / Object Columns (`include='object'`)

🔹 Example 3: Selecting Boolean Columns (`include='bool'`)

🔹 Example 4: Excluding Columns (`exclude='number'`)

🌟 8. Advantages of `select_dtypes()`

Step 4: Rename Branch to `main`

Packages