Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Mihawk1891/TuneLab

Repository files navigation

TuneLab

Fully Autonomous ML Engineer Agent

A production-ready, fully autonomous machine learning pipeline for tabular datasets. Build high-quality models end-to-end without human intervention.

Features

  • 🔄 Fully Autonomous: Zero human intervention required
  • 💾 Strategy Memory: Learns from past runs via dataset fingerprinting
  • 📊 Complete Pipeline: From raw data to production model
  • 📝 Professional Docs: Auto-generated Markdown reports
  • 🖥️ CPU-First: Optimized for CPU, GTX 1650 friendly (no GPU required)
  • 🆓 100% Open Source: No proprietary dependencies

🚀 Quick Start

Installation

# Clone or download this repository
git clone <your-repo-url>
cd ml-agent
# Install dependencies
pip install -r requirements.txt

Basic Usage

# Run on your dataset
python ml_agent.py path/to/your/data.csv

That's it! The agent will:

  1. Analyze your data
  2. Engineer features
  3. Train multiple models
  4. Optimize hyperparameters
  5. Generate reports and plots
  6. Save the best model

Example with Sample Data

# Generate sample data first
python example_usage.py --generate-data
# Run the agent
python ml_agent.py sample_data/iris.csv

📁 Output Structure

After running, you'll find everything in the outputs/ directory:

outputs/
├── models/
│ └── final_model.joblib # Trained model ready for production
├── plots/
│ ├── feature_importance.png # Top features visualization
│ └── metric_comparison.png # Model performance comparison
├── reports/
│ ├── overview.md # Project summary
│ ├── data_analysis.md # Data insights
│ ├── modeling.md # Model selection details
│ └── results.md # Final results & recommendations
└── strategy/
 └── <fingerprint>.json # Reusable strategy for this dataset

🎯 How It Works

1. Dataset Fingerprinting

Generates a unique fingerprint based on:

  • Dataset shape
  • Column names and types
  • Missing value patterns

If you've run the agent on similar data before, it loads the successful strategy from memory.

2. Data Understanding

Automatically detects:

  • Target column (last column by default)
  • Problem type (classification vs regression)
  • Missing values
  • Feature types (numerical vs categorical)
  • Class imbalance

3. Feature Engineering

Applies robust preprocessing:

  • Numerical: Median imputation
  • Categorical: Most frequent imputation + label encoding
  • No leakage: All transformations fit only on training data

4. Model Selection

Trains multiple baseline models:

Classification:

  • Logistic Regression
  • Random Forest
  • Extra Trees
  • Gradient Boosting

Regression:

  • Linear Regression
  • Ridge Regression
  • Random Forest
  • Extra Trees
  • Gradient Boosting

5. Hyperparameter Optimization

Uses Optuna for Bayesian optimization:

  • 30 trials
  • 3-fold cross-validation
  • Automatic parameter search

6. Artifacts & Documentation

Generates:

  • Saved models (.joblib)
  • Visualizations (.png)
  • Professional Markdown reports
  • Reusable strategy files

💻 Advanced Usage

Specify Target Column

from ml_agent import MLAgent
agent = MLAgent(
 data_path="data.csv",
 target_col="target", # Specify target column
 problem_type="classification" # Or "regression"
)
agent.run()

Customize Parameters

agent = MLAgent(
 data_path="data.csv",
 output_dir="my_outputs",
 max_iterations=5,
 target_metric_threshold=0.95,
 improvement_threshold=0.01
)
agent.run()

Load and Use Trained Model

import joblib
import pandas as pd
# Load model package
model_pkg = joblib.load('outputs/models/final_model.joblib')
model = model_pkg['model']
preprocessor = model_pkg['preprocessor']
feature_names = model_pkg['feature_names']
# Load new data
new_data = pd.read_csv('new_data.csv')
# Make predictions
predictions = model.predict(new_data)

🛠️ Tech Stack

Core (Required)

  • Python 3.10+
  • pandas
  • numpy
  • scikit-learn
  • matplotlib
  • joblib

Optional (Recommended)

  • seaborn (better plots)
  • optuna (hyperparameter tuning)

Constraints

  • ✅ 100% open source
  • ✅ CPU-optimized
  • ✅ No GPU required
  • ✅ GTX 1650 compatible if GPU available
  • ❌ No proprietary software
  • ❌ No paid APIs

📊 Supported Models

Classification

Model Speed Accuracy Interpretability
Logistic Regression ⚡⚡⚡ ⭐⭐ ⭐⭐⭐
Random Forest ⚡⚡ ⭐⭐⭐ ⭐⭐
Extra Trees ⚡⚡ ⭐⭐⭐ ⭐⭐
Gradient Boosting ⭐⭐⭐

Regression

Model Speed Accuracy Interpretability
Linear Regression ⚡⚡⚡ ⭐⭐ ⭐⭐⭐
Ridge Regression ⚡⚡⚡ ⭐⭐ ⭐⭐⭐
Random Forest ⚡⚡ ⭐⭐⭐ ⭐⭐
Extra Trees ⚡⚡ ⭐⭐⭐ ⭐⭐
Gradient Boosting ⭐⭐⭐

📖 Example Workflow

1. Prepare Your Data

Your CSV should have:

  • Features in columns
  • Target variable (typically last column)
  • Header row with column names
feature1,feature2,feature3,target
1.2,3.4,cat,0
2.3,4.5,dog,1
...

2. Run the Agent

python ml_agent.py my_data.csv

Output:

🤖 ML Agent Initialized
📁 Output Directory: outputs
📊 Loading dataset: my_data.csv
 Shape: (1000, 5)
 Columns: ['feature1', 'feature2', 'feature3', 'target']
🔑 Dataset Fingerprint: a3f5d8c9b2e1f4a7
🔍 Analyzing dataset...
 Auto-detected target: target
 Auto-detected problem type: classification
 ✅ No missing values
 Numerical features: 2
 Categorical features: 1
🔧 Engineering features...
 Processing 2 numerical + 1 categorical features
 ✅ Train: 800 samples
 ✅ Test: 200 samples
🎯 Training baseline models...
 Training Logistic Regression... accuracy=0.8500
 Training Random Forest... accuracy=0.9200
 Training Extra Trees... accuracy=0.9100
 Training Gradient Boosting... accuracy=0.9350
 🏆 Best model: Gradient Boosting (accuracy=0.9350)
⚙️ Optimizing hyperparameters...
 Best trial score: 0.9425
 ✅ Optimized model score: 0.9450
💾 Model saved: outputs/models/final_model.joblib
📊 Generating plots...
 ✅ Feature importance plot saved
 ✅ Model comparison plot saved
📝 Generating reports...
 ✅ All reports generated
💾 Strategy saved to memory: a3f5d8c9
============================================================
✅ PIPELINE COMPLETE
============================================================
📁 All outputs saved to: outputs/
🏆 Final model score: 0.9450
💾 Model: outputs/models/final_model.joblib
📊 Reports: outputs/reports/
📈 Plots: outputs/plots/

3. Review the Outputs

Check the generated reports:

  • outputs/reports/overview.md - Quick summary
  • outputs/reports/data_analysis.md - Data insights
  • outputs/reports/modeling.md - Model details
  • outputs/reports/results.md - Final results

View visualizations:

  • outputs/plots/feature_importance.png
  • outputs/plots/metric_comparison.png

4. Use the Model

import joblib
# Load and use
model_pkg = joblib.load('outputs/models/final_model.joblib')
predictions = model_pkg['model'].predict(new_data)

🎓 Design Philosophy

1. Simplicity Over Complexity

  • Use simple, explainable models first
  • Avoid overfitting with regularization
  • Prefer interpretability when possible

2. CPU-First Architecture

  • No GPU required (though compatible)
  • Optimized for standard hardware
  • Works on laptops and servers alike

3. Full Transparency

  • Every decision is documented
  • Complete audit trail in reports
  • Reproducible with fixed random seeds

4. Production Ready

  • Save everything needed for deployment
  • Include preprocessing in model package
  • Professional documentation for handover

5. Autonomous Operation

  • Zero human intervention
  • Automatic problem detection
  • Self-documenting workflows

🔧 Customization

Add Custom Models

Edit ml_agent.py in the train_models() method:

if self.problem_type == 'classification':
 models = {
 'Logistic Regression': LogisticRegression(...),
 'Random Forest': RandomForestClassifier(...),
 # Add your model here:
 'SVM': SVC(...),
 }

Change Metrics

Modify the metric calculation in train_models():

# For classification
metrics = {
 'accuracy': accuracy_score(self.y_test, y_pred),
 'f1': f1_score(self.y_test, y_pred, average='weighted'),
 # Add custom metrics
}

Adjust Hyperparameter Search

Modify optimize_hyperparameters():

# Change number of trials
study.optimize(objective, n_trials=50) # Default: 30
# Adjust cross-validation folds
scores = cross_val_score(..., cv=5) # Default: 3

📝 Report Examples

Overview Report

  • Quick project summary
  • Tech stack used
  • Best model and score
  • File structure

Data Analysis Report

  • Dataset statistics
  • Missing value analysis
  • Feature type breakdown
  • Target distribution

Modeling Report

  • All models tried
  • Performance comparison
  • Selection rationale
  • Hyperparameter tuning results

Results Report

  • Final metrics
  • Usage instructions
  • Next steps recommendations
  • Reproducibility guide

🐛 Troubleshooting

"Optuna not available"

pip install optuna

Or continue without it (hyperparameter tuning will be skipped).

"Memory Error"

For large datasets, reduce n_estimators in models:

'Random Forest': RandomForestClassifier(n_estimators=50) # Default: 100

"FileNotFoundError"

Make sure your CSV path is correct:

python ml_agent.py /full/path/to/data.csv

🤝 Contributing

This is a fully autonomous agent - improvements welcome!

Areas for enhancement:

  • Additional model types
  • Advanced feature engineering
  • Custom metric support
  • Multi-class calibration
  • Time series support

📄 License

Open source - use freely for any purpose.


🙏 Acknowledgments

Built with:

  • scikit-learn - Amazing ML library
  • Optuna - Hyperparameter optimization
  • pandas - Data manipulation
  • matplotlib - Visualization

📞 Support

For issues or questions you can contact me at pranavbansode2604@gmail.com

  1. Check the generated reports in outputs/reports/
  2. Review the troubleshooting section
  3. Examine the code comments in ml_agent.py

Built for production. Designed for autonomy. Optimized for simplicity.

🤖 Let the agent do the work.

About

TuneLab is a fully autonomous ML engineer agent that builds production-ready models from raw CSV data. Upload → Train → Deploy. Features FastAPI backend, beautiful web UI, strategy memory, auto hyperparameter tuning, and one-click Render deployment. 100% open-source & CPU-first.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

Languages

AltStyle によって変換されたページ (->オリジナル) /