Open Source Browserless Web Scraping API with Human-like Behavior
License: MIT Node.js Playwright GitHub Open Source
π― Unified Solution: Website + API on a single domain
π§ Human-like Behavior: 40+ anti-detection techniques
π Deploy Anywhere: Docker, Node.js+PM2, or Development
- π Unified Architecture: Website and API on one domain
- π§ Human-like Intelligence: Natural mouse movements, smart scrolling, behavioral randomization
- π Multiple Formats: HTML, text, screenshots, PDFs
- β‘ Batch Processing: Handle multiple URLs efficiently
- π Production Ready: Docker, PM2, Nginx, SSL support
- π‘οΈ Anti-Detection: 40+ stealth techniques for reliable scraping
# 1. Clone and configure git clone https://github.com/SaifyXPRO/HeadlessX.git cd HeadlessX # Quick setup (makes scripts executable + creates .env) chmod +x scripts/quick-setup.sh && ./scripts/quick-setup.sh # Then edit: nano .env # Update DOMAIN, SUBDOMAIN, and AUTH_TOKEN
Choose your deployment:
| Method | Command | Best For |
|---|---|---|
| π³ Docker | docker-compose up -d |
Production, easy deployment |
| π§ Auto Setup | chmod +x scripts/setup.sh && sudo ./scripts/setup.sh |
VPS/Server with full control |
| π» Development | npm install && npm start |
Local development, testing |
Access your HeadlessX:
π Website: https://your-subdomain.yourdomain.com
π§ Health: https://your-subdomain.yourdomain.com/api/health
π Status: https://your-subdomain.yourdomain.com/api/status?token=YOUR_AUTH_TOKEN
HeadlessX v1.2.0 introduces a completely refactored modular architecture for better maintainability, scalability, and development experience.
- π§ Separation of Concerns: Distinct modules for configuration, services, controllers, and middleware
- π Better Performance: Optimized browser management and resource usage
- π οΈ Developer Experience: Clear module boundaries and dependency injection
- π¦ Production Ready: Enhanced error handling and logging with correlation IDs
- π Security: Improved authentication and rate limiting
- π Monitoring: Structured logging and health monitoring
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Routes βββββΆβ Controllers βββββΆβ Services β
β (api.js) β β (rendering.js)β β (browser.js) β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Middleware β β Utils β β Config β
β (auth.js) β β (logger.js) β β (index.js) β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
Quick Migration from v1.1.0:
- The original
src/server.js(3079 lines) has been broken down into 20+ focused modules - Environment variable
TOKENis nowAUTH_TOKEN - PM2 config moved from
config/ecosystem.config.jstoecosystem.config.js - All functionality preserved with improved performance and maintainability
π Detailed Documentation: MODULAR_ARCHITECTURE.md
# Install Docker (if needed) curl -fsSL https://get.docker.com | sh sudo usermod -aG docker $USER # Deploy HeadlessX git clone https://github.com/SaifyXPRO/HeadlessX.git cd HeadlessX cp .env.example .env nano .env # Configure DOMAIN, SUBDOMAIN, AUTH_TOKEN # Start services docker-compose up -d # Optional: Setup SSL sudo apt install certbot sudo certbot --standalone -d your-subdomain.yourdomain.com
Docker Management:
docker-compose ps # Check status docker-compose logs headlessx # View logs docker-compose restart # Restart services docker-compose down # Stop services
# Automated setup (recommended) git clone https://github.com/SaifyXPRO/HeadlessX.git cd HeadlessX cp .env.example .env nano .env # Configure environment chmod +x scripts/setup.sh sudo ./scripts/setup.sh # Installs dependencies, builds website, starts PM2
π Nginx Configuration (Auto-handled by setup script):
The setup script automatically configures nginx, but if you need to manually configure:
# Copy and configure nginx site sudo cp nginx/headlessx.conf /etc/nginx/sites-available/headlessx # Replace placeholders with your actual domain sudo sed -i 's/SUBDOMAIN.DOMAIN.COM/your-subdomain.yourdomain.com/g' /etc/nginx/sites-available/headlessx # Enable the site sudo ln -sf /etc/nginx/sites-available/headlessx /etc/nginx/sites-enabled/ sudo rm -f /etc/nginx/sites-enabled/default # Test and reload nginx sudo nginx -t && sudo systemctl reload nginx
Manual setup (if not using setup script):
sudo apt update && sudo apt upgrade -y curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash - sudo apt install -y nodejs build-essential npm install && npm run build sudo npm install -g pm2 npm run pm2:start
PM2 Management:
npm run pm2:status # Check status npm run pm2:logs # View logs npm run pm2:restart # Restart server npm run pm2:stop # Stop server
git clone https://github.com/SaifyXPRO/HeadlessX.git cd HeadlessX cp .env.example .env nano .env # Set AUTH_TOKEN, DOMAIN=localhost, SUBDOMAIN=headlessx # Make scripts executable chmod +x scripts/*.sh # Install dependencies npm install cd website && npm install && npm run build && cd .. # Start development server npm start # Access at http://localhost:3000
HeadlessX Routes:
βββ /favicon.ico β Favicon
βββ /robots.txt β SEO robots file
βββ /api/health β Health check (no auth required)
βββ /api/status β Server status (requires token)
βββ /api/render β Full page rendering
βββ /api/html β HTML extraction
βββ /api/content β Clean text extraction
βββ /api/screenshot β Screenshot generation
βββ /api/pdf β PDF generation
βββ /api/batch β Batch URL processing
π Request Flow:
- Nginx receives request on port 80/443
- Proxies to Node.js server on port 3000
- Server routes based on path:
/api/*β API endpoints/*β Website files (built Next.js app)
curl https://your-subdomain.yourdomain.com/api/health
curl -X POST "https://your-subdomain.yourdomain.com/api/html?token=YOUR_AUTH_TOKEN" \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com", "timeout": 30000}'
curl "https://your-subdomain.yourdomain.com/api/screenshot?token=YOUR_AUTH_TOKEN&url=https://example.com&fullPage=true" \
-o screenshot.pngcurl -X POST "https://your-subdomain.yourdomain.com/api/text?token=YOUR_AUTH_TOKEN" \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com", "waitForSelector": "main"}'
curl -X POST "https://your-subdomain.yourdomain.com/api/pdf?token=YOUR_AUTH_TOKEN" \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com", "format": "A4"}' \ -o document.pdf
HTTP Request Module Configuration:
{
"url": "https://your-subdomain.yourdomain.com/api/html",
"method": "POST",
"headers": {
"Content-Type": "application/json"
},
"qs": {
"token": "YOUR_AUTH_TOKEN"
},
"body": {
"url": "{{url_to_scrape}}",
"timeout": 30000,
"waitForSelector": "{{optional_selector}}"
}
}Webhooks by Zapier Setup:
- URL:
https://your-subdomain.yourdomain.com/api/html?token=YOUR_AUTH_TOKEN - Method: POST
- Headers:
Content-Type: application/json - Body:
{
"url": "{{url_from_trigger}}",
"timeout": 30000,
"humanBehavior": true
}HTTP Request Node:
{
"url": "https://your-subdomain.yourdomain.com/api/html",
"method": "POST",
"authentication": "queryAuth",
"query": {
"token": "YOUR_AUTH_TOKEN"
},
"headers": {
"Content-Type": "application/json"
},
"body": {
"url": "={{$json.url}}",
"timeout": 30000,
"humanBehavior": true
}
}Available via n8n Community Node:
- Install:
npm install n8n-nodes-headlessx - GitHub Repository
import requests def scrape_with_headlessx(url, token): response = requests.post( "https://your-subdomain.yourdomain.com/api/html", params={"token": token}, json={ "url": url, "timeout": 30000, "humanBehavior": True } ) return response.json() # Usage result = scrape_with_headlessx("https://example.com", "YOUR_TOKEN") print(result['html'])
const axios = require('axios'); async function scrapeWithHeadlessX(url, token) { try { const response = await axios.post( `https://your-subdomain.yourdomain.com/api/html?token=${token}`, { url: url, timeout: 30000, humanBehavior: true } ); return response.data; } catch (error) { console.error('Scraping failed:', error.message); throw error; } } // Usage scrapeWithHeadlessX('https://example.com', 'YOUR_TOKEN') .then(result => console.log(result.html)) .catch(error => console.error(error));
curl -X POST "https://your-subdomain.yourdomain.com/api/batch?token=YOUR_AUTH_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "urls": [ "https://example1.com", "https://example2.com", "https://example3.com" ], "timeout": 30000, "humanBehavior": true }'
curl -X POST "https://your-subdomain.yourdomain.com/api/batch?token=YOUR_AUTH_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "urls": ["https://example.com", "https://httpbin.org"], "format": "text", "options": {"timeout": 30000} }'
HeadlessX v1.2.0 - Modular Architecture/
βββ π src/ # Modular application source
β βββ π config/ # Configuration management
β β βββ index.js # Main configuration loader
β β βββ browser.js # Browser-specific settings
β βββ π utils/ # Utility functions
β β βββ errors.js # Error handling & categorization
β β βββ logger.js # Structured logging
β β βββ helpers.js # Common utilities
β βββ π services/ # Business logic services
β β βββ browser.js # Browser lifecycle management
β β βββ stealth.js # Anti-detection techniques
β β βββ interaction.js # Human-like behavior
β β βββ rendering.js # Core rendering logic
β βββ π middleware/ # Express middleware
β β βββ auth.js # Authentication
β β βββ error.js # Error handling
β βββ π controllers/ # Request handlers
β β βββ system.js # Health & status endpoints
β β βββ rendering.js # Main rendering endpoints
β β βββ batch.js # Batch processing
β β βββ get.js # GET endpoints & docs
β βββ π routes/ # Route definitions
β β βββ api.js # API route mappings
β β βββ static.js # Static file serving
β βββ app.js # Main application setup
β βββ server.js # Entry point for PM2
β βββ rate-limiter.js # Rate limiting implementation
βββ π website/ # Next.js website (unchanged)
β βββ app/ # Next.js 13+ app directory
β βββ components/ # React components
β βββ .env.example # Website environment template
β βββ next.config.js # Next.js configuration
β βββ package.json # Website dependencies
βββ π scripts/ # Deployment & management scripts
β βββ setup.sh # Automated installation (updated)
β βββ update_server.sh # Server update script (updated)
β βββ verify-domain.sh # Domain verification
β βββ test-routing.sh # Integration testing
βββ π nginx/ # Nginx configuration
β βββ headlessx.conf # Nginx proxy config
βββ π docker/ # Docker deployment (updated)
β βββ Dockerfile # Container definition
β βββ docker-compose.yml # Docker Compose setup
βββ ecosystem.config.js # PM2 configuration (moved to root)
βββ .env.example # Environment template (updated)
βββ package.json # Server dependencies (updated)
βββ MODULAR_ARCHITECTURE.md # Architecture documentation
βββ README.md # This file
# 1. Install dependencies npm install # 2. Build website cd website npm install npm run build cd .. # 3. Set environment variables export AUTH_TOKEN="development_token_123" export DOMAIN="localhost" export SUBDOMAIN="headlessx" # 4. Start server npm start # Uses src/app.js # 5. Access locally # Website: http://localhost:3000 # API: http://localhost:3000/api/health
# Test server and website integration bash scripts/test-routing.sh localhost # Test with environment variables bash scripts/verify-domain.sh
Create your .env file from the template:
cp .env.example .env nano .env
Required configuration:
# Security Token (Generate a secure random string) AUTH_TOKEN=your_secure_token_here # Domain Configuration DOMAIN=yourdomain.com SUBDOMAIN=headlessx # Optional: Browser Settings BROWSER_TIMEOUT=60000 MAX_CONCURRENT_BROWSERS=5 # Optional: Server Settings PORT=3000 NODE_ENV=production
Option 1: Automatic (Recommended)
# The setup script automatically replaces domain placeholders
sudo ./scripts/setup.shOption 2: Manual Configuration
# Copy nginx configuration sudo cp nginx/headlessx.conf /etc/nginx/sites-available/headlessx # Replace domain placeholders (replace with your actual domain) sudo sed -i 's/SUBDOMAIN.DOMAIN.COM/headlessx.yourdomain.com/g' /etc/nginx/sites-available/headlessx # Example: If your domain is "api.example.com" sudo sed -i 's/SUBDOMAIN.DOMAIN.COM/api.example.com/g' /etc/nginx/sites-available/headlessx # Enable site and reload nginx sudo ln -sf /etc/nginx/sites-available/headlessx /etc/nginx/sites-enabled/ sudo nginx -t && sudo systemctl reload nginx
Your final URLs will be:
- Website:
https://your-subdomain.yourdomain.com - API Health:
https://your-subdomain.yourdomain.com/api/health - API Endpoints:
https://your-subdomain.yourdomain.com/api/*
| Endpoint | Method | Description | Auth Required |
|---|---|---|---|
/api/health |
GET | Health check | β |
/api/status |
GET | Server status | β |
/api/render |
POST | Full page rendering (JSON) | β |
/api/html |
GET/POST | Raw HTML extraction | β |
/api/content |
GET/POST | Clean text extraction | β |
/api/screenshot |
GET | Screenshot generation | β |
/api/pdf |
GET | PDF generation | β |
/api/batch |
POST | Batch URL processing | β |
All endpoints (except /api/health) require a token via:
- Query parameter:
?token=YOUR_TOKEN - Header:
X-Token: YOUR_TOKEN - Header:
Authorization: Bearer YOUR_TOKEN
Visit your HeadlessX website for full API documentation with examples, or check:
curl https://your-subdomain.yourdomain.com/api/health
curl "https://your-subdomain.yourdomain.com/api/status?token=YOUR_TOKEN"# PM2 logs npm run pm2:logs pm2 logs headlessx --lines 100 # Docker logs docker-compose logs -f headlessx # Nginx logs sudo tail -f /var/log/nginx/access.log
git pull origin main npm run build # Rebuild website npm run pm2:restart # PM2 # OR docker-compose restart # Docker
"npm ci" Error (missing package-lock.json):
chmod +x scripts/generate-lockfiles.sh ./scripts/generate-lockfiles.sh # Generate lock files # OR npm install --production # Use install instead
"Cannot find module 'express'":
npm install # Install dependenciesSystem dependency errors (Ubuntu):
sudo apt update && sudo apt install -y \
libatk1.0-0t64 libatk-bridge2.0-0t64 libcups2t64 \
libatspi2.0-0t64 libasound2t64 libxcomposite1PM2 not starting:
sudo npm install -g pm2 chmod +x scripts/setup.sh # Make script executable pm2 start config/ecosystem.config.js pm2 logs headlessx # Check errors
Script permission errors:
# Make all scripts executable chmod +x scripts/*.sh # Or use the quick setup chmod +x scripts/quick-setup.sh && ./scripts/quick-setup.sh
Playwright browser installation errors:
# Use dedicated Playwright setup script chmod +x scripts/setup-playwright.sh ./scripts/setup-playwright.sh # Or install manually: sudo apt update && sudo apt install -y \ libgtk-3-0t64 libpangocairo-1.0-0 libcairo-gobject2 \ libgdk-pixbuf-2.0-0 libdrm2 libxss1 libxrandr2 \ libasound2t64 libatk1.0-0t64 libnss3 # Install only Chromium (most stable) npx playwright install chromium # Alternative: Use Docker (avoids dependency issues) docker-compose up -d
- Token Authentication: Secure API access with custom tokens
- Rate Limiting: Nginx-level request throttling
- Security Headers: XSS, CSRF, and clickjacking protection
- Bot Protection: Common attack vector blocking
- SSL/TLS: Automatic HTTPS with Let's Encrypt
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- π Documentation: Visit your deployed website for full API docs
- π Issues: GitHub Issues
- π¬ Discussions: GitHub Discussions
HeadlessX v1.1.0 - The most advanced open-source browserless web scraping solution.
Made with β€οΈ for the developer community.