Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

openalchemy/recruitGPT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

7 Commits

Repository files navigation

RecruitGPT

An open-source AI recruiting pipeline that combines fine-tuned embeddings, cross-encoder reranking, knowledge graph signals, and LLM reasoning to match candidates with jobs.

License: MIT Python 3.10+


How It Works

RecruitGPT is a 5-stage retrieval-augmented matching pipeline. Each stage narrows and refines the candidate pool, ending with a human-readable explanation.

JD / Hiring Query
 │
 ▼
┌───────────────────┐
│ 1 Query Parsing │ LLM extracts structured intent: skills, seniority,
│ (Qwen3.5 0.8B) │ industry, hard constraints, nice-to-haves
└────────┬──────────┘
 │
 ▼
┌───────────────────┐
│ 2 Retrieval │ Fine-tuned BGE encodes query → FAISS ANN search
│ (BGE-large) │ over candidate embeddings → Top-K recall
└────────┬──────────┘
 │
 ▼
┌───────────────────┐
│ 3 Reranking │ Cross-encoder scores each (query, candidate) pair
│ (bge-reranker) │ with full attention → Top-N precision
└────────┬──────────┘
 │
 ▼
┌───────────────────┐
│ 4 Graph Boost │ Knowledge graph (skills, companies, industries)
│ (NetworkX) │ adds structural signals: career similarity,
│ │ skill adjacency, company-tier overlap
└────────┬──────────┘
 │
 ▼
┌───────────────────┐
│ 5 Explanation │ LLM generates per-candidate match report:
│ (Qwen3.5 0.8B) │ strengths, gaps, interview focus areas
└───────────────────┘

Why Fine-tune BGE?

Generic embedding models treat "5 years of distributed systems at a fintech" and "entry-level web developer" as vaguely similar — they're both "software engineering." A fine-tuned BGE model learns the recruiting domain's similarity structure:

  • Seniority matters: Senior backend ≠ junior backend
  • Skill overlap is nuanced: "Kubernetes + Go" is closer to "Docker + Rust" than to "Excel + VBA"
  • Context changes meaning: "Python" in a data science JD ≠ "Python" in a DevOps JD

We fine-tune with contrastive learning on (JD, good-match resume, bad-match resume) triplets, including hard negatives mined from the model itself.

Quick Start

Installation

git clone https://github.com/your-org/recruitGPT.git
cd recruitGPT
pip install -r requirements.txt

Configure API Keys

cp .env.example .env
# Fill in at least one teacher model key (DeepSeek recommended — cheapest, no license issues)

Step 1 — Generate Training Data via Distillation

A large teacher model (DeepSeek-V3, GPT-4o, or Claude) generates high-quality training data for the smaller student model.

# Generate LLM training data (query parsing + match explanation)
python scripts/distill_data.py \
 --teacher deepseek \
 --tasks query_parsing,match_explanation \
 --num_per_task 500
# Build embedding triplets
python scripts/build_embedding_pairs.py \
 --resumes data/resumes/ \
 --jds data/jds/ \
 --output data/pairs/train_triplets.jsonl
# Mine hard negatives using current model
python scripts/mine_hard_negatives.py \
 --triplets data/pairs/train_triplets.jsonl \
 --model BAAI/bge-large-zh-v1.5 \
 --output data/pairs/hard_negatives.jsonl
# Quality filtering
python scripts/filter_data.py \
 --input data/generated/train.jsonl \
 --output data/generated/train_clean.jsonl

Step 2 — Fine-tune BGE Embedding

python src/embedding/train_embedding.py --config configs/bge_finetune.yaml

This trains with InfoNCE loss + in-batch negatives + hard negatives. A single A6000 handles it in under an hour for a few thousand triplets.

Step 3 — Fine-tune LLM (Query Parsing + Explanation)

python src/train.py --config configs/qlora_qwen3_5_0_8b.yaml

QLoRA on Qwen3.5-0.8B — runs on any GPU with 6–8 GB VRAM (RTX 3060, T4, etc.). Merge LoRA weights after training for faster inference.

Step 4 — Build Index & Run Pipeline

# Index your candidate pool
python src/pipeline/index.py \
 --resumes data/resumes/ \
 --model outputs/bge-recruit/
# Interactive matching
python src/pipeline/match.py \
 --jd "Your job description here" \
 --top_k 20 \
 --interactive

Project Structure

recruitGPT/
│
├── configs/
│ ├── qlora_qwen3_5_0_8b.yaml # LLM fine-tuning (student model)
│ ├── qlora_qwen7b.yaml # LLM fine-tuning (teacher reference)
│ ├── qlora_qwen3b.yaml # LLM low-resource alternative
│ ├── bge_finetune.yaml # BGE embedding fine-tuning
│ └── reranker_finetune.yaml # Cross-encoder fine-tuning
│
├── data/
│ ├── seed/ # Hand-written seed examples
│ ├── pairs/ # Embedding training triplets
│ ├── reranker/ # Reranker training pairs
│ ├── resumes/ # Candidate resume corpus
│ ├── jds/ # Job description corpus
│ └── generated/ # Distilled training data
│
├── scripts/
│ ├── distill_data.py # Teacher → student data generation
│ ├── build_embedding_pairs.py # Build (query, pos, neg) triplets
│ ├── mine_hard_negatives.py # Hard negative mining
│ ├── build_reranker_data.py # Reranker training data
│ ├── build_graph.py # Knowledge graph construction
│ ├── filter_data.py # Data quality filtering
│ └── convert_format.py # Format conversion utility
│
├── src/
│ ├── embedding/ # Stage 2
│ │ ├── train_embedding.py # BGE contrastive fine-tuning
│ │ ├── eval_embedding.py # Recall@K, MRR evaluation
│ │ ├── encode.py # Encode & retrieve
│ │ └── losses.py # InfoNCE, triplet loss
│ │
│ ├── reranker/ # Stage 3
│ │ ├── train_reranker.py # Cross-encoder fine-tuning
│ │ ├── eval_reranker.py # NDCG, MAP evaluation
│ │ └── rerank.py # Reranking inference
│ │
│ ├── graph/ # Stage 4
│ │ ├── schema.py # Graph schema definition
│ │ ├── builder.py # Build skill/company/industry graph
│ │ └── boost.py # Graph signal scoring
│ │
│ ├── pipeline/ # End-to-end pipeline
│ │ ├── query_parser.py # Stage 1 — LLM query parsing
│ │ ├── retriever.py # Stage 2 — vector retrieval
│ │ ├── reranker_stage.py # Stage 3 — reranking
│ │ ├── graph_stage.py # Stage 4 — graph signal
│ │ ├── explainer.py # Stage 5 — LLM explanation
│ │ ├── index.py # FAISS index management
│ │ └── match.py # Main orchestrator
│ │
│ ├── teacher.py # Unified teacher model interface
│ ├── prompts.py # All prompt templates
│ ├── train.py # LLM QLoRA training (Unsloth)
│ ├── evaluate.py # LLM-as-Judge evaluation
│ └── inference.py # LLM interactive inference
│
├── docs/
│ └── cloud_infra.md # Cloud infrastructure guide (GCS, Vertex AI, serving)
│
├── eval/
│ ├── eval_set.jsonl # LLM evaluation set
│ └── retrieval_benchmark.jsonl # Embedding retrieval benchmark
│
├── notebooks/
│ ├── 01_data_exploration.ipynb
│ ├── 02_embedding_analysis.ipynb
│ └── 03_pipeline_demo.ipynb
│
├── requirements.txt
├── .env.example
├── .gitignore
└── README.md

Models Used

Component Base Model Fine-tune Method GPU Requirement
Query Parser Qwen/Qwen3.5-0.8B-Instruct QLoRA (4-bit) 6–8 GB
Explainer (optional) Qwen/Qwen3.5-0.8B-Instruct QLoRA (4-bit) 6–8 GB
Embedding BAAI/bge-large-zh-v1.5 Contrastive learning 12–16 GB
Reranker BAAI/bge-reranker-v2-m3 Cross-encoder 12–16 GB
Graph NetworkX No training CPU only

Teacher models (for distillation data generation only): DeepSeek-V3, GPT-4o, or Claude via API.

Cost Estimate

Assuming you use RunPod or AutoDL for GPU rental:

Step Estimated Cost
Distill 3,000 LLM training samples (DeepSeek API) ~2ドル–5
Mine hard negatives + build triplets ~1ドル–2 (GPU)
Fine-tune BGE embedding ~1ドル–3 (A6000, <1hr)
Fine-tune LLM QLoRA (Qwen3.5-0.8B) ~0ドル.5–2 (T4/A10G, <1hr)
Total ~7ドル–18

Evaluation

Embedding Retrieval

python src/embedding/eval_embedding.py \
 --model outputs/bge-recruit/ \
 --eval_data data/pairs/eval_triplets.jsonl
# Outputs: Recall@10, Recall@50, MRR

Reranker

python src/reranker/eval_reranker.py \
 --model outputs/reranker/ \
 --eval_data data/reranker/eval.jsonl
# Outputs: NDCG@5, NDCG@10, MAP

LLM (Judge-based)

python src/evaluate.py \
 --model_path outputs/qwen3_5_0_8b-recruit/merged \
 --eval_data eval/eval_set.jsonl \
 --judge deepseek
# Outputs: Accuracy, Format, Professionalism, Usefulness (1–5 scale)

Roadmap

  • LLM distillation pipeline (query parsing + explanation)
  • BGE embedding fine-tuning with hard negative mining
  • Cross-encoder reranker
  • Skill/company knowledge graph
  • Multi-language support (EN/ZH/JA)
  • Resume PDF parsing (OCR + layout)
  • Real-time indexing API
  • Web UI demo
  • DPO alignment for explanation quality

MLOps Roadmap (GCP)

This section describes the path to a production-grade MLOps system on Google Cloud Platform.

Maturity Levels

Level 0 (current) → Manual scripts, local GPU
Level 1 → Reproducible ML pipelines, experiment tracking
Level 2 → CI/CD for ML, automated retraining & deployment

Target Architecture

┌─────────────────────────────────────────────────────────────────┐
│ CI/CD Layer │
│ GitHub → Cloud Build → Artifact Registry → Pipeline │
└──────────────────────────────┬──────────────────────────────────┘
 │
┌──────────────────────────────▼──────────────────────────────────┐
│ Data & Experiment Layer │
│ GCS (raw/processed/artifacts) BigQuery DVC │
│ Vertex AI Experiments (metrics, hyperparams, artifacts) │
└──────────────────────────────┬──────────────────────────────────┘
 │
┌──────────────────────────────▼──────────────────────────────────┐
│ Training Pipeline (Vertex AI Pipelines) │
│ │
│ [distill_data] → [build_pairs] → [mine_negatives] │
│ │ │
│ ┌───────────────┼───────────────┐ │
│ [train_bge] [train_reranker] [train_llm] │
│ └───────────────┼───────────────┘ │
│ [evaluate] │
│ │ │
│ [register → Model Registry] │
└──────────────────────────────┬──────────────────────────────────┘
 │
┌──────────────────────────────▼──────────────────────────────────┐
│ Serving Layer │
│ Vertex AI Endpoints (online) Batch Prediction (batch) │
│ Cloud Run (FAISS index API) │
└──────────────────────────────┬──────────────────────────────────┘
 │
┌──────────────────────────────▼──────────────────────────────────┐
│ Monitoring Layer │
│ Vertex AI Model Monitoring Cloud Monitoring Looker Studio │
└─────────────────────────────────────────────────────────────────┘

GCP Services by Function

Function GCP Service Purpose
Raw data & artifacts Cloud Storage (GCS) resumes, JDs, model checkpoints
Structured metrics BigQuery eval results, match history, experiment comparison
Data versioning DVC + GCS backend track changes to data/pairs/, data/generated/
Experiment tracking Vertex AI Experiments loss curves, hyperparams, Recall@K per run
GPU training jobs Vertex AI Training (Custom Jobs) BGE, reranker, QLoRA fine-tuning
Training images Artifact Registry versioned Docker images for each training job
Pipeline orchestration Vertex AI Pipelines (KFP v2) DAG with caching, retry, conditional steps
Scheduled retraining Cloud Scheduler cron-triggered pipeline runs
Model versioning Vertex AI Model Registry promote models with eval thresholds
Online inference Vertex AI Endpoints real-time JD → candidate matching API
Batch inference Vertex AI Batch Prediction periodic full-pool rescoring
FAISS index API Cloud Run stateless index serving, loaded from GCS
CI/CD trigger Cloud Build PR merge → rebuild image → run pipeline
Data drift detection Vertex AI Model Monitoring embedding distribution shift alerts
Dashboards Looker Studio + BigQuery matching quality trends, pipeline health

Phased Rollout

Phase Goal Key Services
Phase 1 Reproducible training GCS + Vertex AI Training + Experiments
Phase 2 Automated pipeline DAG Vertex AI Pipelines + Model Registry
Phase 3 CI/CD integration Cloud Build + Artifact Registry
Phase 4 Production serving Vertex AI Endpoints + Cloud Run
Phase 5 Monitoring & alerting Model Monitoring + BigQuery + Looker Studio

GPU Requirements on GCP

Training Job Recommended Instance Estimated Duration
BGE embedding fine-tune a2-highgpu-1g (A100 40GB) < 1 hr
Cross-encoder reranker a2-highgpu-1g (A100 40GB) 1–3 hr
QLoRA Qwen3.5-0.8B n1-standard-4 + T4 (16GB) < 1 hr
Hard negative mining n1-standard-8 (CPU) or GPU < 30 min

Note: GCP A100 quota is 0 by default. Request an increase via IAM & Admin → Quotas at least 3–5 business days before your training run.

  • Phase 1 — GCS data lake + Vertex AI Training + Experiments
  • Phase 2 — Vertex AI Pipelines DAG + Model Registry
  • Phase 3 — Cloud Build CI/CD + Artifact Registry
  • Phase 4 — Vertex AI Endpoints + Cloud Run serving
  • Phase 5 — Model Monitoring + BigQuery + Looker Studio dashboards

Contributing

Contributions are welcome. Please open an issue first to discuss what you'd like to change.

License

MIT

About

Open-source Candidate-Job Matching System: Distilled LLM + Fine-tuned BGE + Cross-Encoder + Knowledge Graph

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

Languages

  • Jupyter Notebook 80.0%
  • Python 20.0%

AltStyle によって変換されたページ (->オリジナル) /