A comprehensive and robust YouTube scraper built with Scrapy framework that efficiently extracts video search results and detailed channel information. This professional-grade scraper is designed for data researchers, content analysts, marketers, and developers who need reliable YouTube data extraction capabilities.
Before using this scraper, we recommend checking out our comprehensive YouTube Website Analyzer which provides detailed insights about:
- Scraping Difficulty: Current anti-bot measures and detection methods
- Legal Considerations: Terms of Service compliance and legal implications
- Technical Challenges: JavaScript rendering, rate limiting, and IP blocking
- Best Practices: Recommended approaches for ethical and efficient scraping
- Alternative Solutions: When to use official APIs vs web scraping
For a complete guide on how to scrape YouTube effectively, check out our Original YouTube Scraping Guide.
- π YouTube Search Scraper: Extract video URLs, titles, channel handles, view counts, and upload dates
- πΊ YouTube Channel Scraper: Gather profile pictures, subscriber counts, video counts, and channel descriptions
- π‘οΈ Anti-Bot Protection: Advanced middleware with user-agent rotation and request throttling
- π Data Export: Multiple formats (CSV, JSON) with comprehensive data validation
- π Continuation Support: Automatic pagination for large result sets
- π― Smart Data Extraction: Handles both JavaScript-rendered content and HTML fallbacks
- β‘ Performance Optimized: Concurrent requests with intelligent rate limiting
- π§Ή Data Cleaning: Automatic normalization of view counts, subscriber numbers, and durations
# Clone the repository git clone https://github.com/Simple-Python-Scrapy-Scrapers/youtube-scrapy-scraper.git cd youtube-scrapy-scraper # Install dependencies pip install -r requirements.txt # Install ScrapeOps Proxy (recommended) pip install scrapeops-scrapy-proxy-sdk
Search YouTube Videos:
# Search for videos with default query scrapy crawl youtube_search # Search with custom query and result limit scrapy crawl youtube_search -a query="python programming" -a max_results=100 # Search for specific topics scrapy crawl youtube_search -a query="machine learning tutorial" -a max_results=50
Scrape YouTube Channels:
# Scrape channels by handles scrapy crawl youtube_channel -a channel_handles="@freecodecamp,@programmingwithmosh,@TechWithTim" # Scrape channels by URLs scrapy crawl youtube_channel -a channel_urls="https://www.youtube.com/@freecodecamp,https://www.youtube.com/@scrapeops" # Mix handles and URLs scrapy crawl youtube_channel -a channel_handles="@mkbhd" -a channel_urls="https://www.youtube.com/@veritasium"
- Video Information: URL, ID, title, description, thumbnail URL, duration
- Channel Data: Name, handle, URL, verification status
- Metrics: View count (raw and normalized), upload date, search position
- Metadata: Search query, page number, content type detection
- Video Properties: Live status, shorts detection, premium content flags
- Basic Info: Channel ID, name, handle, custom URL, description
- Visual Assets: Profile picture (multiple resolutions), banner image
- Statistics: Subscriber count, video count, total views (raw and normalized)
- Details: Join date, country, language, verification badges
- Social Links: Website, Twitter, Instagram, Facebook URLs
- Performance: Engagement rate, channel category, content analysis
- Content Trend Analysis: Track viral videos and trending topics
- Creator Performance Studies: Analyze channel growth and engagement patterns
- Market Research: Understand audience preferences and content gaps
- Academic Research: Large-scale YouTube ecosystem studies
- Competitor Analysis: Monitor competitor channels and content strategies
- Influencer Discovery: Find relevant creators for brand partnerships
- Content Strategy: Optimize video titles and descriptions based on successful patterns
- ROI Measurement: Track campaign performance and brand mentions
- API Alternative: Cost-effective alternative to YouTube Data API quotas - Get free ScrapeOps API key
- Data Pipeline: Feed YouTube data into analytics platforms
- Content Curation: Automate content discovery and recommendation systems
- Monitoring Tools: Track brand mentions and content performance
# Search Spider Configuration SEARCH_SETTINGS = { 'query': 'your search term', 'max_results': 100, 'sort_by': 'relevance', # relevance, date, views, rating 'upload_date': 'any', # hour, today, week, month, year 'duration': 'any', # short, medium, long 'quality': 'any' # hd, hq, sd } # Channel Spider Configuration CHANNEL_SETTINGS = { 'include_about_page': True, 'extract_social_links': True, 'get_recent_videos': False, 'analyze_performance': True }
# Data Export Configuration FEEDS = { 'data/youtube_search_%(time)s.csv': { 'format': 'csv', 'encoding': 'utf8', 'fields': ['video_url', 'title', 'channel_name', 'views', 'date_uploaded'] }, 'data/youtube_channels_%(time)s.json': { 'format': 'json', 'encoding': 'utf8', 'indent': 2 } }
This YouTube scraper uses ScrapeOps Proxy as the proxy solution. ScrapeOps has a free plan that allows you to make up to 1,000 requests per month which makes it ideal for the development phase, but can be easily scaled up to millions of pages per month if needs be.
To use the ScrapeOps Proxy you need to first install the proxy middleware:
pip install scrapeops-scrapy-proxy-sdk
Then activate the ScrapeOps Proxy by adding your API key to the SCRAPEOPS_API_KEY in the settings.py file.
SCRAPEOPS_API_KEY = 'YOUR_API_KEY' SCRAPEOPS_PROXY_ENABLED = True DOWNLOADER_MIDDLEWARES = { 'scrapeops_scrapy_proxy_sdk.scrapeops_scrapy_proxy_sdk.ScrapeOpsScrapyProxySdk': 725, }
The scraper employs multiple extraction strategies:
- Primary Method: Extracts data from
ytInitialDataJavaScript objects - Fallback Method: HTML parsing when JavaScript extraction fails
- Continuation API: Handles pagination through YouTube's internal APIs
- Error Recovery: Automatic retries with exponential backoff
- URL Validation: Ensures all YouTube URLs are properly formatted
- Number Normalization: Converts "1.2M views" to 1200000
- Text Cleaning: Removes extra whitespace and special characters
- Duplicate Detection: Prevents duplicate entries in datasets
- Data Enrichment: Adds calculated fields like engagement rates
- Concurrent Processing: Multiple requests processed simultaneously
- Smart Caching: HTTP caching reduces redundant requests
- Request Throttling: Adaptive delays prevent rate limiting
- Memory Management: Efficient memory usage for large datasets
import pandas as pd import matplotlib.pyplot as plt # Load scraped data search_data = pd.read_csv('data/youtube_search_results.csv') channel_data = pd.read_csv('data/youtube_channels.csv') # Analyze view distribution search_data['views_normalized'].hist(bins=50) plt.title('Distribution of Video Views') plt.xlabel('Views') plt.ylabel('Frequency') plt.show() # Top channels by subscribers top_channels = channel_data.nlargest(10, 'subscriber_count_normalized') print(top_channels[['channel_name', 'subscriber_count', 'video_count']]) # Engagement rate analysis channel_data['engagement_rate'] = ( channel_data['average_views'] / channel_data['subscriber_count_normalized'] * 100 ) high_engagement = channel_data.nlargest(10, 'engagement_rate')
-- Most popular videos by view count SELECT title, channel_name, views_normalized, date_uploaded FROM youtube_search_results ORDER BY views_normalized DESC LIMIT 20; -- Channel performance metrics SELECT channel_name, subscriber_count_normalized, video_count, (subscriber_count_normalized / video_count) as subscribers_per_video FROM youtube_channels WHERE subscriber_count_normalized > 100000 ORDER BY subscribers_per_video DESC; -- Content type analysis SELECT content_type, COUNT(*) as video_count, AVG(views_normalized) as avg_views FROM youtube_search_results GROUP BY content_type ORDER BY avg_views DESC;
- Respect Rate Limits: Use appropriate delays between requests
- Terms of Service: Ensure compliance with YouTube's ToS
- Data Privacy: Handle scraped data responsibly and securely
- Attribution: Provide proper attribution when using scraped data
- Monitor Usage: Track request volumes and response times
- Error Handling: Implement robust error handling and logging
- Data Storage: Use secure storage methods for sensitive data
- Regular Updates: Keep scraper updated with website changes
JavaScript Extraction Fails
# Enable debug logging scrapy crawl youtube_search -L DEBUG # Check for blocked requests grep "403\|429" scrapy.log
Rate Limiting
# Increase delays in settings.py DOWNLOAD_DELAY = 5 RANDOMIZE_DOWNLOAD_DELAY = True AUTOTHROTTLE_ENABLED = True
Data Quality Issues
# Enable data validation pipeline ITEM_PIPELINES = { 'youtube_scraper.pipelines.YoutubeDataValidationPipeline': 200, 'youtube_scraper.pipelines.YoutubeDataCleaningPipeline': 300, }
# Test single video extraction scrapy shell "https://www.youtube.com/watch?v=VIDEO_ID" # Validate channel URL scrapy shell "https://www.youtube.com/@channelhandle" # Check middleware functionality scrapy crawl youtube_search -s LOG_LEVEL=DEBUG
- Search Speed: ~50 videos/minute with rate limiting
- Channel Speed: ~20 channels/minute including about pages
- Data Accuracy: >95% successful field extraction
- Memory Usage: <500MB for 1000+ video dataset
- Success Rate: >90% even with anti-bot measures
- Small Scale: 1-100 videos/channels - Perfect for research projects
- Medium Scale: 100-10,000 items - Suitable for market analysis
- Large Scale: 10,000+ items - Enterprise data collection
We welcome contributions! Please see our Contributing Guidelines for details.
# Fork and clone the repository git clone https://github.com/Simple-Python-Scrapy-Scrapers/youtube-scrapy-scraper.git # Create virtual environment python -m venv venv source venv/bin/activate # Linux/Mac venv\Scripts\activate # Windows # Install development dependencies pip install -r requirements-dev.txt # Run tests python -m pytest tests/
This project is licensed under the MIT License - see the LICENSE file for details.
This tool is for educational and research purposes only. Users are responsible for complying with YouTube's Terms of Service and applicable laws. The authors are not responsible for any misuse of this software.
Keywords: YouTube scraper, Scrapy, video data extraction, channel analytics, YouTube API alternative, content analysis, social media scraping, Python web scraping, data mining, YouTube research tool