Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

TELUGUSCRIPTER/Day-5

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

2 Commits

Repository files navigation

πŸ“Š Python & Pandas Assignment: DataFrame Column Selection by Data Type


🏫 Assignment Title Page

  • Course: Data Analytics with Python & Pandas
  • Topic: Column Filtering using pandas.DataFrame.select_dtypes()
  • Assignment ID: PY-PD-01
  • Level: Beginner to Intermediate
  • Author: Gowtham (Workspace: Assigenmt - 1)
  • Date: May 2026

πŸ“ 1. Introduction

When working with real-world datasets, data often arrives in mixed formats containing numbers, strings, dates, and boolean flags. Before performing any mathematical computations, machine learning training, or text processing, a data scientist must isolate columns of specific types.

The Pandas library in Python provides a highly efficient, built-in method called pandas.DataFrame.select_dtypes() to solve this problem. Instead of manually looping over columns and checking their types one-by-one, select_dtypes() allows you to subset a DataFrame based on column data types using high-level groups or specific type classes.


🎯 2. Objective

The primary objectives of this assignment are to:

  1. Understand how Pandas represents and tracks data types (dtypes) within a DataFrame.
  2. Master the usage of the pandas.DataFrame.select_dtypes() method.
  3. Learn how to selectively include or exclude columns based on their data types (numeric, object/string, and boolean).
  4. Build a professional, GitHub-ready Python project that showcases clean, well-commented code.

πŸ“– 3. Theory & Explanation

In Pandas, a DataFrame is composed of one or more Series (columns). Each column has an associated data type (dtype), which dictates what operations can be performed on it.

Common Pandas Data Types:

  • int64 / int32: Integer numbers (e.g., 29, 45, 100).
  • float64 / float32: Floating-point decimal numbers (e.g., 150.50, 89.20).
  • object: Generally represents text / strings, or mixed python objects.
  • bool: Boolean values (True or False).
  • datetime64: Date and time values.
  • category: Finite list of text values (efficient for repeated values).

The select_dtypes() method returns a subset of the DataFrame’s columns based on the column data types. It checks each column's dtype against the criteria specified in the arguments and yields a new DataFrame with matching columns.


βš™οΈ 4. Syntax Explanation

The syntax of the select_dtypes() method is simple and expressive:

DataFrame.select_dtypes(include=None, exclude=None)

Both arguments are optional, but at least one of them must be provided.


πŸ“‹ 5. Parameters Explanation

Parameter Type Description
include scalar or list-like A selection of dtypes or strings to be included in the result. At least one must be matched.
exclude scalar or list-like A selection of dtypes or strings to be excluded from the result.

Valid Data Type Inputs (Strings or Objects):

  • To select numeric types, use 'number' or np.number (includes all integers and floats).
  • To select strings, use 'object'.
  • To select booleans, use 'bool'.
  • To select datetimes, use 'datetime' or 'datetime64'.
  • To select categories, use 'category'.

⚠️ Note: You can pass lists of types, such as include=['int64', 'float64'] or include=['number', 'bool'].


πŸ’» 6. Example Programs & Outputs

Here is how select_dtypes() works across different scenarios. The following examples represent the actual code implemented in main.py.

πŸ—ƒοΈ Sample Dataset

We define a sample customer database:

data = {
 'CustomerID': [101, 102, 103, 104, 105],
 'Name': ['Alice Johnson', 'Bob Smith', 'Charlie Brown', 'Diana Prince', 'Evan Wright'],
 'Age': [29, 34, 45, 28, 52],
 'PurchaseAmount': [150.50, 89.20, 420.75, 310.00, 75.60],
 'IsPremium': [True, False, True, True, False],
 'Country': ['USA', 'Canada', 'UK', 'USA', 'Australia'],
 'NewsletterSubscribed': [True, True, False, True, False]
}

πŸ”Ή Example 1: Selecting Numeric Columns (include='number')

Extracts integer and floating-point columns. Ideal before computing mathematical averages, sums, or correlations.

numeric_df = df.select_dtypes(include='number')

Output:

 CustomerID Age PurchaseAmount
0 101 29 150.50
1 102 34 89.20
2 103 45 420.75
3 104 28 310.00
4 105 52 75.60

πŸ”Ή Example 2: Selecting String / Object Columns (include='object')

Extracts text columns. Highly useful when cleaning strings, stripping whitespaces, or encoding categorical variables for machine learning.

object_df = df.select_dtypes(include='object')

Output:

 Name Country
0 Alice Johnson USA
1 Bob Smith Canada
2 Charlie Brown UK
3 Diana Prince USA
4 Evan Wright Australia

πŸ”Ή Example 3: Selecting Boolean Columns (include='bool')

Extracts truth-value columns. Perfect for filtering customers, calculating conversion rates, or performing logical flag checks.

boolean_df = df.select_dtypes(include='bool')

Output:

 IsPremium NewsletterSubscribed
0 True True
1 False True
2 True False
3 True True
4 False False

πŸ”Ή Example 4: Excluding Columns (exclude='number')

Extracts all columns except the specified ones. In this case, we exclude all numeric types to keep only descriptive/categorical properties.

non_numeric_df = df.select_dtypes(exclude='number')

Output:

 Name IsPremium Country NewsletterSubscribed
0 Alice Johnson True USA True
1 Bob Smith False Canada True
2 Charlie Brown True UK False
3 Diana Prince True USA True
4 Evan Wright False Australia False

🏒 7. Real-World Use Case

Imagine you are a Junior Data Scientist at an e-commerce platform preparing a raw transactions table for a Machine Learning Model.

  1. Model Inputs: Machine Learning algorithms (like Linear Regression or XGBoost) require purely numeric data. You must use .select_dtypes(include='number') to feed columns like Age and PurchaseAmount into the model.
  2. Feature Engineering: You want to convert customer strings (Country) into dummy indicators (one-hot encoding). You isolate these variables using .select_dtypes(include='object') to apply preprocessing algorithms.
  3. Auditing: You need to ensure no private text files or raw labels slip into a mathematical calculation step. .select_dtypes(exclude='object') protects against errors!

🌟 8. Advantages of select_dtypes()

  1. Readability & Elegance: Replaces long loops and complex conditional statements with one clear, readable line of code.
  2. Speed & Efficiency: Vectorized execution in Pandas runs C-level optimizations under the hood, significantly outperforming Python native loops.
  3. Dynamic Flexibility: If your schema updates (e.g., a new numeric column is added to the source database), select_dtypes() handles it automatically without hardcoded column index adjustments.
  4. Error Prevention: Safeguards downstream processes (e.g., you won't accidentally try to compute the mathematical average of Name).

πŸš€ 9. Step-by-Step Execution Guide

Follow these steps to run the project locally on your machine.

πŸ”Œ Prerequisites

Ensure you have Python 3.8+ installed on your system.

πŸ“₯ Step 1: Install Pandas

Run the following pip command in your terminal to install the necessary library:

pip install -r requirements.txt

Alternatively, you can install pandas directly:

pip install pandas

⚑ Step 2: Run the Project

Execute the main program using Python:

python main.py

πŸ™ 10. Git & GitHub Pushing Instructions

Upload this professional project to your GitHub account to showcase your skills:

Step 1: Initialize Git Repository

git init

Step 2: Add Files to Staging Area

git add .

Step 3: Commit Changes

git commit -m "Initial commit"

Step 4: Rename Branch to main

git branch -M main

Step 5: Add Remote Repository

Replace <repo-url> with your actual GitHub repository URL (e.g., https://github.com/username/pandas-select-dtypes.git):

git remote add origin <repo-url>

Step 6: Push to GitHub

git push -u origin main

🏁 11. Conclusion

The pandas.DataFrame.select_dtypes() method is a fundamental utility in every data wrangler's toolkit. It streamlines the data preprocessing lifecycle, improves code clarity, and allows dynamic operations on columns based on their functional characteristics. Mastering this method is a key milestone for any aspiring Python developer or data scientist!

About

Data Cleaning and Pandas Fundamental

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

Languages

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /