Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

The-Pocket/PocketFlow-Tutorial-Data-Profiler

Repository files navigation

PocketFlow Data Profiling Tool

An intelligent data profiling tool powered by LLMs that provides deep, contextual analysis of your datasets beyond traditional statistical metrics.

🎯 What This Tool Does

This tool performs comprehensive data profiling through a 7-step workflow:

  1. Duplicate Detection - Identifies and analyzes duplicate rows with recommendations
  2. Table Summary - Generates high-level description of what your data represents
  3. Column Descriptions - Analyzes each column with meaningful descriptions and naming suggestions
  4. Data Type Analysis - Recommends optimal data types for each column
  5. Missing Values Analysis - Categorizes missing values as meaningful vs problematic
  6. Uniqueness Analysis - Identifies potential unique identifier columns
  7. Unusual Values Detection - Detects outliers, anomalies, and data quality issues

πŸš€ How to Run

Prerequisites

  1. Install dependencies:
pip install -r requirements.txt
  1. Set up your LLM:

The tool uses OpenAI by default. Set your API key:

export OPENAI_API_KEY="your-key-here"

To use your own LLM or different providers, check out the PocketFlow LLM documentation and modify utils/call_llm.py accordingly.

Test your LLM setup:

python utils/call_llm.py

Running the Tool

python main.py

By default, it analyzes the sample patient dataset in test/patients.csv. To analyze your own data, modify main.py:

# Replace this line:
df = pd.read_csv("test/patients.csv")
# With your data:
df = pd.read_csv("path/to/your/data.csv")

Output

The tool generates:

  • Console summary with key statistics
  • Markdown report saved as data_profiling_report.md with comprehensive analysis

πŸ“Š Example Results

From the sample patient dataset (60 rows, 27 columns):

  • βœ… Detected invalid SSN formats (test data with "999" prefix)
  • βœ… Identified name contamination (numeric suffixes in names)
  • βœ… Found meaningful missing patterns (83% missing death dates = living patients)
  • βœ… Recommended data type conversions (dates to datetime64, categories for demographics)
  • βœ… Identified unique identifiers (UUID primary key, SSN)

πŸ—οΈ Architecture

Built with PocketFlow - a minimalist LLM framework:

  • Workflow pattern for sequential processing pipeline
  • BatchNode for efficient parallel column analysis
  • YAML-based structured outputs with validation
  • Intelligent LLM analysis for contextual understanding

πŸ“ Project Structure

β”œβ”€β”€ main.py # Entry point
β”œβ”€β”€ flow.py # Flow orchestrator
β”œβ”€β”€ nodes.py # All profiling nodes
β”œβ”€β”€ utils/
β”‚ └── call_llm.py # LLM utility (customize for your provider)
β”œβ”€β”€ test/
β”‚ └── patients.csv # Sample dataset
└── docs/
 └── design.md # Design documentation

πŸ”§ Customization

Using Different LLM Providers

Edit utils/call_llm.py to use your preferred LLM:

  • Claude (Anthropic)
  • Google Gemini
  • Azure OpenAI
  • Local models (Ollama)

See the PocketFlow LLM guide for examples.

Analyzing Different Data Types

The tool works with any pandas DataFrame. You can:

  • Load from CSV, Excel, JSON, Parquet
  • Connect to databases
  • Use API data

Just ensure your data is loaded as a pandas DataFrame before running the flow.

πŸŽ“ Tutorial

This project demonstrates Agentic Coding with PocketFlow. Want to learn more?

πŸ“ License

This project is a tutorial example for PocketFlow.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /