Unified Event Scraper Guide
Overview
The unified event scraper (event_pipeline_run.py) combines two powerful event discovery methods into a single intelligent command:
- DataForSEO API - Google event search results via API
- Direct Link Scraping - Event calendar pages on venue websites
Key Features
✅ Intelligent Routing - Automatically chooses the best scraping method for each venue ✅ Dual Coverage - Can run both methods on the same venue for maximum discovery ✅ Deduplication - Automatically merges and deduplicates events from both sources ✅ Parallel Processing - Concurrent execution for faster results ✅ Unified Reporting - Single comprehensive report from both sources ✅ Flexible Filtering - Filter by location, category, venue type, etc.
Quick Start
# Run both methods (recommended for maximum coverage)
python manage.py event_pipeline_run
# Test with a small batch
python manage.py event_pipeline_run --num_per_batch 3 --debug
# Focus on a specific city
python manage.py event_pipeline_run --address "San Francisco, CA"
# Only use API method (faster, but may miss calendar-only events)
python manage.py event_pipeline_run --strategy api_only
# Only use direct scraping (slower, but gets calendar events)
python manage.py event_pipeline_run --strategy links_only
Command-Line Arguments
Strategy Selection
| Argument | Options | Default | Description |
|---|---|---|---|
--strategy | both, api_only, links_only | both | Choose scraping method(s) |
Strategy Guide:
both- Maximum coverage, tries both methods (recommended)api_only- Faster, uses Google event search via DataForSEOlinks_only- Slower, scrapes event pages directly
Venue Filtering
| Argument | Type | Example | Description |
|---|---|---|---|
--venue_search_term | string | "concert hall" | Filter venues by search term |
--editorial_category | string | "events_feed" | Filter by editorial category |
--address | string | "Brooklyn, NY" | Filter by location/address |
--boundary | string | "POLYGON((...))" | Filter by geographic boundary |
Performance Settings
| Argument | Type | Default | Description |
|---|---|---|---|
--num_per_batch | integer | 10 | Number of venues to process |
--max_workers | integer | 5 | Concurrent workers for parallel processing |
--parallel | flag | True | Enable parallel processing |
--no-parallel | flag | - | Disable parallel processing |
Debug and Testing
| Argument | Type | Description |
|---|---|---|
--debug | flag | Enable verbose logging |
--dry_run | flag | Run without saving data (testing) |
How It Works
1. Venue Discovery & Classification
The scraper queries the database for venues based on your filters, then classifies each venue:
🔍 Venue Classification Logic:
├── Has event link + Good Google presence → Both methods
├── Has event link only → Direct scraping
├── Good Google presence only → API method
└── Neither → API method (fallback)
2. Intelligent Method Selection
DataForSEO API Method is used when:
- Venue has good Google presence (website URL)
- Venue shows up in Google event searches
- Fast results needed
Direct Link Scraping is used when:
- Venue has event calendar/listing pages
- Events are on the venue's own website
- ICS feeds available
Both Methods when:
- Venue qualifies for both
- Maximum coverage requested (
--strategy both)
3. Event Processing Pipeline
For each venue:
├── API Method:
│ ├── Search Google events via DataForSEO
│ ├── Extract event data from API results
│ ├── Enrich with additional details
│ └── Create/update database records
│
└── Link Method:
├── Check for ICS calendar feeds
├── Extract event links from pages
├── Scrape individual event pages
└── Create/update database records
4. Deduplication & Merging
Events are deduplicated based on:
- URL matching (exact match)
- Name + Venue + Date matching
- Data source identifiers
Priority given to events with more complete data.
Usage Examples
Production Use Cases
Daily event sync for a city:
python manage.py event_pipeline_run \
--address "Austin, TX" \
--num_per_batch 50 \
--strategy both
Quick update for music venues:
python manage.py event_pipeline_run \
--venue_search_term "concert" \
--editorial_category "events_feed" \
--num_per_batch 20 \
--strategy api_only
Comprehensive scrape with calendar focus:
python manage.py event_pipeline_run \
--strategy links_only \
--num_per_batch 30 \
--max_workers 10
Testing & Development
Test with debug logging:
python manage.py event_pipeline_run \
--num_per_batch 2 \
--debug \
--no-parallel
Dry run (no database writes):
python manage.py event_pipeline_run \
--dry_run \
--num_per_batch 5 \
--debug
Performance Considerations
Speed Optimization
- API method is generally 2-3x faster than link scraping
- Parallel processing provides 3-5x speedup with
--max_workers 10 - Batch size affects memory usage (keep under 100 for stability)
Resource Usage
| Strategy | Speed | API Calls | Web Requests | Coverage |
|---|---|---|---|---|
api_only | Fast | High | Low | Good |
links_only | Slow | None | High | Excellent |
both | Medium | High | High | Maximum |
Recommended Settings
For speed:
--strategy api_only --num_per_batch 50 --max_workers 10
For coverage:
--strategy both --num_per_batch 30 --max_workers 5
For calendar events:
--strategy links_only --num_per_batch 20 --max_workers 3
Output & Reporting
Console Output
🚀 UNIFIED EVENT SCRAPING STARTED
Strategy: both
Batch size: 10
🔍 Discovering venues for scraping...
Found 47 venues matching criteria
📊 Venue classification:
- API suitable: 12
- Link scraping suitable: 18
- Both methods: 17
🌐 Starting API-based scraping for 29 venues...
✅ Found 15 API results for Blue Note Jazz Club
...
🔗 Starting link-based scraping for 35 venues...
✅ Found 8 events from ICS feed
✅ Found 23 event links
...
🔄 Merging and deduplicating events...
✅ Deduplication complete:
- Total events from both sources: 156
- Unique events: 142
- Duplicates removed: 14
📊 SCRAPING SUMMARY
Total venues processed: 47
- Via API: 29
- Via links: 35
Total unique events found: 142
- From API: 87
- From links: 69
- Duplicates removed: 14
Email Report
An HTML email report is automatically sent with:
- Summary statistics
- List of new events created
- Source attribution (API vs. links)
- Venue processing details
- Any warnings or errors
Troubleshooting
Common Issues
No events found:
# Check if venues have event links
# Try increasing batch size
--num_per_batch 50
API errors:
# Check DataForSEO credentials
# Try links-only strategy
--strategy links_only
Slow performance:
# Increase workers
--max_workers 10
# Or use API-only for speed
--strategy api_only
Memory issues:
# Reduce batch size
--num_per_batch 10
# Disable parallel processing
--no-parallel
Comparison with Individual Commands
| Feature | event_pipeline_run | scrape_data_for_seo | event_pipeline_scrape (legacy name) |
|---|---|---|---|
| API scraping | ✅ | ✅ | ❌ |
| Link scraping | ✅ | ❌ | ✅ |
| Deduplication | ✅ Both sources | Single source | Single source |
| Venue routing | ✅ Intelligent | Manual | Manual |
| Combined report | ✅ | ❌ | ❌ |
| Flexibility | ✅ High | Medium | Medium |
Migration Guide
From scrape_data_for_seo.py
Replace:
python manage.py scrape_data_for_seo --search_type events --num_per_batch 12
With:
python manage.py event_pipeline_run --strategy api_only --num_per_batch 12
From event_pipeline_scrape (legacy name).py
Replace:
python manage.py event_pipeline_scrape (legacy name) --search-venues --max-workers 5
With:
python manage.py event_pipeline_run --strategy links_only --max_workers 5
Using Both (Recommended)
Instead of running both commands separately:
# Old way
python manage.py scrape_data_for_seo --search_type events
python manage.py event_pipeline_scrape (legacy name) --search-venues
# New way - single command with deduplication
python manage.py event_pipeline_run --strategy both
Cron Job Setup
# Run daily at 3 AM for comprehensive event discovery
0 3 * * * cd /home/vibemap/Vibemap-Analysis-dev && python manage.py event_pipeline_run --strategy both --num_per_batch 100
# Run every 6 hours for quick API updates
0 */6 * * * cd /home/vibemap/Vibemap-Analysis-dev && python manage.py event_pipeline_run --strategy api_only --num_per_batch 50
Advanced Configuration
Custom Venue Selection
You can combine filters for precise targeting:
# Music venues in NYC with event calendars
python manage.py event_pipeline_run \
--venue_search_term "music" \
--address "New York, NY" \
--editorial_category "events_feed" \
--strategy both
Performance Tuning
For maximum throughput on a powerful server:
python manage.py event_pipeline_run \
--num_per_batch 200 \
--max_workers 20 \
--strategy api_only
For careful, thorough scraping:
python manage.py event_pipeline_run \
--num_per_batch 10 \
--max_workers 2 \
--strategy both \
--no-parallel \
--debug
Future Enhancements
Planned improvements:
- Machine learning for optimal method selection
- Adaptive rate limiting
- Real-time progress dashboard
- Event quality scoring
- Automatic retry logic for failed venues
- Venue success rate tracking
Support
For issues or questions:
- Check logs in
/var/log/vibemap/ - Review Sentry error reports
- Run with
--debugflag for detailed output - Check venue data quality in Django admin