Research framework for mathematical reasoning with multiple LLM backends (OpenAI API, Claude API, Open-Source HuggingFace models) and reinforcement learning-based example selection.
MathCoRL supports three LLM backends for comprehensive mathematical reasoning research:
- Models: GPT-4o, GPT-4, GPT-3.5-turbo (all variants)
- Features: Complete API integration with accurate token counting
- Status: β Fully supported and tested
- Models: Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Haiku
- Features: Native Anthropic API integration via LangChain
- Status: β Fully supported and tested
- Models:
- DeepSeek-R1 (1.5B, 7B, 14B)
- Qwen2.5-Math (7B, 72B)
- Features: Local GPU inference, zero API cost
- Requirements: CUDA GPU recommended (tested on RTX 3090 24GB)
- Status: β Fully supported with unified interface
Compare different prompting techniques:
- Zero-Shot: Direct problem solving without examples
- Few-Shot: Random example selection from candidate pool
- FPP (Function Prototype Prompting): With policy network example selection
- CoT, PAL, PoT: Additional baseline methods (API models only)
Compare example selection strategies:
- Policy Network: Reinforcement learning-based selection
- KATE: K-nearest neighbors with embeddings
- CDS: Clustering-based diverse selection
- Random: Baseline random sampling
| Dataset | Domain | Size | Description | ICL k | Both Providers |
|---|---|---|---|---|---|
| GSM8K | Elementary Math | 8.5K | Grade School Math word problems | 2 | β |
| SVAMP | Arithmetic | 1K | Simple arithmetic word problems with variations | 2 | β |
| TabMWP | Tabular Math | 38K | Math problems involving tables and charts | 2 | β |
| TAT-QA | Financial QA | 16K | Table-and-text QA for financial documents | 3 | β |
| FinQA | Financial Analysis | 8K | Complex financial reasoning and calculations | 2 | β |
Each dataset includes:
- Training set: For candidate generation and policy training
- Test set: For evaluation and comparison
- Cross-provider evaluation: Test with both OpenAI and Claude
- API cost tracking: Monitor usage across providers
- Python: 3.8+ (tested on 3.10, 3.11, 3.13)
- Memory: 4GB minimum, 8GB recommended for Policy Network training
- Storage: 2GB for datasets and embeddings
- API Keys: OpenAI or Anthropic account with API access
# Clone repository git clone https://github.com/your-username/MathCoRL.git cd MathCoRL # Install dependencies pip install -r requirements.txt # Configure API keys (optional for open-source models) cp env.example .env # Edit .env with your API keys: # OPENAI_API_KEY=your_openai_key # For API models # ANTHROPIC_API_KEY=your_anthropic_key # For Claude # LLM_PROVIDER=openai # Default provider (openai/claude)
# Test with DeepSeek-R1 7B on GSM8K python mathcorl_os.py test --method zero_shot --model deepseek_r1_7b --dataset GSM8K --samples 10 # Compare all 3 methods (zero-shot, few-shot, fpp+policy) python mathcorl_os.py compare --model deepseek_r1_7b --dataset GSM8K --samples 50 # Test with Qwen2.5-Math 7B python mathcorl_os.py compare --model qwen_math_7b --dataset TAT-QA --samples 50 # Available models: deepseek_r1_7b, deepseek_r1_1.5b, qwen_math_7b, qwen_math_72b
# Single problem solving python -m mint.cli solve --method fpp --question "What is 15 + 27?" --provider openai python -m mint.cli solve --method cot --question "John has 20 apples..." --provider claude # Dataset evaluation python -m mint.cli test --method fpp --dataset SVAMP --limit 100 --provider openai python -m mint.cli test --method cot --dataset GSM8K --limit 50 --provider claude # Interactive mode python -m mint.cli interactive --provider openai
# Step 1: Generate candidate examples with embeddings python generate_candidates.py --dataset TAT-QA --n-candidates 30 --seed 42 # Step 2: Train Policy Network for example selection python train_policy.py --dataset TAT-QA --epochs 20 --seed 42 # Step 3: Compare ICL methods (works with both API and open-source) python run_comparison.py --dataset TAT-QA --samples 101 --seed 42 # Test with open-source models + policy network python mathcorl_os.py test --method fpp_policy --model deepseek_r1_7b --dataset GSM8K --samples 50
# Real-time usage statistics python -m mint.cli stats # All providers, last 24h python -m mint.cli stats --hours 12 # Last 12 hours python -m mint.cli stats --provider claude # Claude only # Export detailed usage data python -m mint.cli export --format csv # CSV export python -m mint.cli export --format json # JSON export
# Pool size ablation (ICL research) python run_pool_size_ablation.py --dataset GSM8K --samples 101 # Method comparison ablation python run_ablation_study.py --dataset SVAMP --methods fpp,cot,pal
- Zero-Shot: Direct problem solving without examples
- Few-Shot: Random k examples from candidate pool
- FPP (Function Prototype Prompting): Structured reasoning with math functions + policy network selection
- CoT (Chain-of-Thought): Step-by-step natural language reasoning (API only)
- PAL/PoT: Program-based reasoning (API only)
- Policy Network: Reinforcement learning-based adaptive selection (1536Dβ768D transformer)
- KATE: k-Nearest neighbors with embedding similarity
- CDS: Clustering-based diverse selection
- Random: Baseline random sampling
- API Models: OpenAI/Claude via REST APIs with token tracking
- Open-Source: HuggingFace models with local GPU inference
- Unified Interface: Same prompting methods across all backends
- Cost Comparison: 0γγ« for open-source vs API pricing
mint/ # Core package
βββ cli.py # Unified command-line interface
βββ config.py # Multi-provider configuration
βββ tracking.py # Universal API tracking
βββ reproducibility.py # Seed fixing for reproducibility
βββ core.py # FPP implementation
βββ cot.py, pal.py, pot.py # Alternative prompting methods
βββ zero_shot.py # Zero-shot baseline
βββ icrl/ # In-Context RL components
β βββ candidate_generator.py # Training example extraction
β βββ policy_network.py # Neural selection model
β βββ trainer.py # PPO training implementation
β βββ evaluator.py # Multi-method evaluation
βββ utils.py # Evaluation utilities
βββ testing.py # Testing framework
CLI Interface β Provider Selection β Method Execution β Universal Tracking β Results
β β β β
User Input [OpenAI|Claude] [FPP|CoT|PAL|PoT] Cost/Token Tracking
- β Dual LLM Provider Support: Full OpenAI and Claude integration
- β Universal API Tracking: Accurate cost monitoring across providers
- β Reproducibility: Comprehensive seed fixing for consistent results
- β Complete Method Suite: 5 prompting methods + 5 ICL strategies
- β Interactive CLI: Real-time problem solving and testing
- β Advanced Visualization: Charts, exports, and analysis tools
- β Reinforcement Learning: Policy network training for example selection
- β Production Ready: Comprehensive logging, error handling, and documentation
- π¬ Method Comparison: Systematic evaluation of reasoning approaches
- π Cross-Provider Analysis: Performance comparison between OpenAI and Claude
- π° Cost Optimization: Detailed tracking for budget-conscious research
- π― ICL Research: Advanced in-context learning with neural selection
- π Scalability: Support for large-scale dataset evaluation
- π Reproducibility: Comprehensive configuration and result tracking
Comprehensive guides available in docs/ directory:
- Usage Guide : Complete usage guide for both research tasks
- API Tracking : API usage tracking and cost monitoring
- Tracking Examples : Practical examples with tracking
- Claude Integration : Claude setup and configuration
- Datasets : Dataset descriptions and preprocessing
- Policy Network : Neural network architecture and training
- Charts & Visualization : Analysis and visualization tools
- Technical Notes : Implementation details and refactoring history
- Compare structured vs. free-form reasoning approaches
- Evaluate mathematical reasoning capabilities across different LLMs
- Study cost-effectiveness of different prompting strategies
- Analyze reasoning quality and interpretability
- Investigate optimal example selection strategies
- Study reinforcement learning for demonstration selection
- Compare neural vs. similarity-based selection methods
- Explore curriculum learning effects in mathematical reasoning
- Evaluate reasoning capabilities: OpenAI vs Claude
- Compare cost efficiency across providers and methods
- Study model-specific optimal prompting strategies
- Analyze scaling laws for mathematical reasoning
- Track accuracy per dollar across methods and providers
- Optimize API usage for budget-constrained environments
- Study token efficiency patterns in mathematical reasoning
# Provider configuration LLM_PROVIDER=openai # Default: openai | claude OPENAI_API_KEY=your_openai_key # Required for OpenAI ANTHROPIC_API_KEY=your_anthropic_key # Required for Claude # Model selection OPENAI_MODEL=gpt-4o-mini # OpenAI model choice ANTHROPIC_MODEL=claude-3-5-sonnet-20241022 # Claude model choice # Generation parameters TEMPERATURE=0.1 # Response randomness MAX_TOKENS=4000 # Maximum response length
# Programmatic configuration from mint.config import create_llm_client, get_config # Create provider-specific clients openai_client = create_llm_client(provider="openai") claude_client = create_llm_client(provider="claude") # Access configuration config = get_config() print(f"Current provider: {config.provider}") print(f"Current model: {config.get_current_model_name()}")
See CONTRIBUTING.md for guidelines on:
- Code style and testing requirements
- Pull request process
- Research contribution areas
Import Error: ModuleNotFoundError: No module named 'mint'
pip install -e . # Install package in development mode
API Key Error: openai.error.AuthenticationError
# Verify .env file exists and contains valid keys cat .env | grep API_KEY export OPENAI_API_KEY=your_key_here # Set directly if needed
CUDA/MPS Device Error: RuntimeError: MPS backend out of memory
# Use CPU instead of GPU export PYTORCH_ENABLE_MPS_FALLBACK=1 # Or reduce batch size in configs/hyperparameters.yaml
Embedding Generation Slow: Taking too long on large datasets
# Use smaller candidate pools python generate_candidates.py --n-candidates 50 # Default is 100
Policy Network Training Unstable: Loss not decreasing
# Adjust learning rate and epochs in configs/hyperparameters.yaml # Try: lr: 0.0001 (lower) or epochs: 5 (more training)
For additional support, see documentation or open an issue on GitHub.
MathCoRL welcomes contributions in:
- New Prompting Methods: Additional structured reasoning approaches
- LLM Provider Integration: Support for new language models
- ICL Strategies: Novel example selection algorithms
- Datasets: Additional mathematical reasoning domains
- Evaluation Metrics: Advanced correctness and efficiency measures
- Cost Optimization: More efficient API usage patterns
This project is licensed under the MIT License - see the LICENSE file for details.