Name	Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github	.github
assets	assets
docs	docs
test	test
utils	utils
.clinerules	.clinerules
.cursorrules	.cursorrules
.gitignore	.gitignore
.goosehints	.goosehints
.windsurfrules	.windsurfrules
CLAUDE.md	CLAUDE.md
GEMINI.md	GEMINI.md
README.md	README.md
data_profiling_report.md	data_profiling_report.md
flow.py	flow.py
main.py	main.py
nodes.py	nodes.py
requirements.txt	requirements.txt

PocketFlow Data Profiling Tool

An intelligent data profiling tool powered by LLMs that provides deep, contextual analysis of your datasets beyond traditional statistical metrics.

🎯 What This Tool Does

This tool performs comprehensive data profiling through a 7-step workflow:

Duplicate Detection - Identifies and analyzes duplicate rows with recommendations
Table Summary - Generates high-level description of what your data represents
Column Descriptions - Analyzes each column with meaningful descriptions and naming suggestions
Data Type Analysis - Recommends optimal data types for each column
Missing Values Analysis - Categorizes missing values as meaningful vs problematic
Uniqueness Analysis - Identifies potential unique identifier columns
Unusual Values Detection - Detects outliers, anomalies, and data quality issues

🚀 How to Run

Prerequisites

Install dependencies:

pip install -r requirements.txt

Set up your LLM:

The tool uses OpenAI by default. Set your API key:

export OPENAI_API_KEY="your-key-here"

To use your own LLM or different providers, check out the PocketFlow LLM documentation and modify utils/call_llm.py accordingly.

Test your LLM setup:

python utils/call_llm.py

Running the Tool

python main.py

By default, it analyzes the sample patient dataset in test/patients.csv. To analyze your own data, modify main.py:

# Replace this line:
df = pd.read_csv("test/patients.csv")
# With your data:
df = pd.read_csv("path/to/your/data.csv")

Output

The tool generates:

Console summary with key statistics
Markdown report saved as data_profiling_report.md with comprehensive analysis

📊 Example Results

From the sample patient dataset (60 rows, 27 columns):

✅ Detected invalid SSN formats (test data with "999" prefix)
✅ Identified name contamination (numeric suffixes in names)
✅ Found meaningful missing patterns (83% missing death dates = living patients)
✅ Recommended data type conversions (dates to datetime64, categories for demographics)
✅ Identified unique identifiers (UUID primary key, SSN)

🏗️ Architecture

Built with PocketFlow - a minimalist LLM framework:

Workflow pattern for sequential processing pipeline
BatchNode for efficient parallel column analysis
YAML-based structured outputs with validation
Intelligent LLM analysis for contextual understanding

📁 Project Structure

├── main.py # Entry point
├── flow.py # Flow orchestrator
├── nodes.py # All profiling nodes
├── utils/
│ └── call_llm.py # LLM utility (customize for your provider)
├── test/
│ └── patients.csv # Sample dataset
└── docs/
 └── design.md # Design documentation

🔧 Customization

Using Different LLM Providers

Edit utils/call_llm.py to use your preferred LLM:

Claude (Anthropic)
Google Gemini
Azure OpenAI
Local models (Ollama)

See the PocketFlow LLM guide for examples.

Analyzing Different Data Types

The tool works with any pandas DataFrame. You can:

Load from CSV, Excel, JSON, Parquet
Connect to databases
Use API data

Just ensure your data is loaded as a pandas DataFrame before running the flow.

🎓 Tutorial

This project demonstrates Agentic Coding with PocketFlow. Want to learn more?

Check out the Agentic Coding Guidance
Watch the YouTube Tutorial

📝 License

This project is a tutorial example for PocketFlow.

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The-Pocket/PocketFlow-Tutorial-Data-Profiler

Folders and files

Latest commit

History

Repository files navigation

PocketFlow Data Profiling Tool

🎯 What This Tool Does

🚀 How to Run

Prerequisites

Running the Tool

Output

📊 Example Results

🏗️ Architecture

📁 Project Structure

🔧 Customization

Using Different LLM Providers

Analyzing Different Data Types

🎓 Tutorial

📝 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

The-Pocket/PocketFlow-Tutorial-Data-Profiler

Folders and files

Latest commit

History

Repository files navigation

PocketFlow Data Profiling Tool

🎯 What This Tool Does

🚀 How to Run

Prerequisites

Running the Tool

Output

📊 Example Results

🏗️ Architecture

📁 Project Structure

🔧 Customization

Using Different LLM Providers

Analyzing Different Data Types

🎓 Tutorial

📝 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages