Skip to main content

Unified Event Scraper Guide

Overview

The unified event scraper (event_pipeline_run.py) combines two powerful event discovery methods into a single intelligent command:

  1. DataForSEO API - Google event search results via API
  2. Direct Link Scraping - Event calendar pages on venue websites

Key Features

Intelligent Routing - Automatically chooses the best scraping method for each venue ✅ Dual Coverage - Can run both methods on the same venue for maximum discovery ✅ Deduplication - Automatically merges and deduplicates events from both sources ✅ Parallel Processing - Concurrent execution for faster results ✅ Unified Reporting - Single comprehensive report from both sources ✅ Flexible Filtering - Filter by location, category, venue type, etc.

Quick Start

# Run both methods (recommended for maximum coverage)
python manage.py event_pipeline_run

# Test with a small batch
python manage.py event_pipeline_run --num_per_batch 3 --debug

# Focus on a specific city
python manage.py event_pipeline_run --address "San Francisco, CA"

# Only use API method (faster, but may miss calendar-only events)
python manage.py event_pipeline_run --strategy api_only

# Only use direct scraping (slower, but gets calendar events)
python manage.py event_pipeline_run --strategy links_only

Command-Line Arguments

Strategy Selection

ArgumentOptionsDefaultDescription
--strategyboth, api_only, links_onlybothChoose scraping method(s)

Strategy Guide:

  • both - Maximum coverage, tries both methods (recommended)
  • api_only - Faster, uses Google event search via DataForSEO
  • links_only - Slower, scrapes event pages directly

Venue Filtering

ArgumentTypeExampleDescription
--venue_search_termstring"concert hall"Filter venues by search term
--editorial_categorystring"events_feed"Filter by editorial category
--addressstring"Brooklyn, NY"Filter by location/address
--boundarystring"POLYGON((...))"Filter by geographic boundary

Performance Settings

ArgumentTypeDefaultDescription
--num_per_batchinteger10Number of venues to process
--max_workersinteger5Concurrent workers for parallel processing
--parallelflagTrueEnable parallel processing
--no-parallelflag-Disable parallel processing

Debug and Testing

ArgumentTypeDescription
--debugflagEnable verbose logging
--dry_runflagRun without saving data (testing)

How It Works

1. Venue Discovery & Classification

The scraper queries the database for venues based on your filters, then classifies each venue:

🔍 Venue Classification Logic:
├── Has event link + Good Google presence → Both methods
├── Has event link only → Direct scraping
├── Good Google presence only → API method
└── Neither → API method (fallback)

2. Intelligent Method Selection

DataForSEO API Method is used when:

  • Venue has good Google presence (website URL)
  • Venue shows up in Google event searches
  • Fast results needed

Direct Link Scraping is used when:

  • Venue has event calendar/listing pages
  • Events are on the venue's own website
  • ICS feeds available

Both Methods when:

  • Venue qualifies for both
  • Maximum coverage requested (--strategy both)

3. Event Processing Pipeline

For each venue:
├── API Method:
│ ├── Search Google events via DataForSEO
│ ├── Extract event data from API results
│ ├── Enrich with additional details
│ └── Create/update database records

└── Link Method:
├── Check for ICS calendar feeds
├── Extract event links from pages
├── Scrape individual event pages
└── Create/update database records

4. Deduplication & Merging

Events are deduplicated based on:

  • URL matching (exact match)
  • Name + Venue + Date matching
  • Data source identifiers

Priority given to events with more complete data.

Usage Examples

Production Use Cases

Daily event sync for a city:

python manage.py event_pipeline_run \
--address "Austin, TX" \
--num_per_batch 50 \
--strategy both

Quick update for music venues:

python manage.py event_pipeline_run \
--venue_search_term "concert" \
--editorial_category "events_feed" \
--num_per_batch 20 \
--strategy api_only

Comprehensive scrape with calendar focus:

python manage.py event_pipeline_run \
--strategy links_only \
--num_per_batch 30 \
--max_workers 10

Testing & Development

Test with debug logging:

python manage.py event_pipeline_run \
--num_per_batch 2 \
--debug \
--no-parallel

Dry run (no database writes):

python manage.py event_pipeline_run \
--dry_run \
--num_per_batch 5 \
--debug

Performance Considerations

Speed Optimization

  • API method is generally 2-3x faster than link scraping
  • Parallel processing provides 3-5x speedup with --max_workers 10
  • Batch size affects memory usage (keep under 100 for stability)

Resource Usage

StrategySpeedAPI CallsWeb RequestsCoverage
api_onlyFastHighLowGood
links_onlySlowNoneHighExcellent
bothMediumHighHighMaximum

For speed:

--strategy api_only --num_per_batch 50 --max_workers 10

For coverage:

--strategy both --num_per_batch 30 --max_workers 5

For calendar events:

--strategy links_only --num_per_batch 20 --max_workers 3

Output & Reporting

Console Output

🚀 UNIFIED EVENT SCRAPING STARTED
Strategy: both
Batch size: 10

🔍 Discovering venues for scraping...
Found 47 venues matching criteria

📊 Venue classification:
- API suitable: 12
- Link scraping suitable: 18
- Both methods: 17

🌐 Starting API-based scraping for 29 venues...
✅ Found 15 API results for Blue Note Jazz Club
...

🔗 Starting link-based scraping for 35 venues...
✅ Found 8 events from ICS feed
✅ Found 23 event links
...

🔄 Merging and deduplicating events...
✅ Deduplication complete:
- Total events from both sources: 156
- Unique events: 142
- Duplicates removed: 14

📊 SCRAPING SUMMARY
Total venues processed: 47
- Via API: 29
- Via links: 35
Total unique events found: 142
- From API: 87
- From links: 69
- Duplicates removed: 14

Email Report

An HTML email report is automatically sent with:

  • Summary statistics
  • List of new events created
  • Source attribution (API vs. links)
  • Venue processing details
  • Any warnings or errors

Troubleshooting

Common Issues

No events found:

# Check if venues have event links
# Try increasing batch size
--num_per_batch 50

API errors:

# Check DataForSEO credentials
# Try links-only strategy
--strategy links_only

Slow performance:

# Increase workers
--max_workers 10

# Or use API-only for speed
--strategy api_only

Memory issues:

# Reduce batch size
--num_per_batch 10

# Disable parallel processing
--no-parallel

Comparison with Individual Commands

Featureevent_pipeline_runscrape_data_for_seoevent_pipeline_scrape (legacy name)
API scraping
Link scraping
Deduplication✅ Both sourcesSingle sourceSingle source
Venue routing✅ IntelligentManualManual
Combined report
Flexibility✅ HighMediumMedium

Migration Guide

From scrape_data_for_seo.py

Replace:

python manage.py scrape_data_for_seo --search_type events --num_per_batch 12

With:

python manage.py event_pipeline_run --strategy api_only --num_per_batch 12

From event_pipeline_scrape (legacy name).py

Replace:

python manage.py event_pipeline_scrape (legacy name) --search-venues --max-workers 5

With:

python manage.py event_pipeline_run --strategy links_only --max_workers 5

Instead of running both commands separately:

# Old way
python manage.py scrape_data_for_seo --search_type events
python manage.py event_pipeline_scrape (legacy name) --search-venues

# New way - single command with deduplication
python manage.py event_pipeline_run --strategy both

Cron Job Setup

# Run daily at 3 AM for comprehensive event discovery
0 3 * * * cd /home/vibemap/Vibemap-Analysis-dev && python manage.py event_pipeline_run --strategy both --num_per_batch 100

# Run every 6 hours for quick API updates
0 */6 * * * cd /home/vibemap/Vibemap-Analysis-dev && python manage.py event_pipeline_run --strategy api_only --num_per_batch 50

Advanced Configuration

Custom Venue Selection

You can combine filters for precise targeting:

# Music venues in NYC with event calendars
python manage.py event_pipeline_run \
--venue_search_term "music" \
--address "New York, NY" \
--editorial_category "events_feed" \
--strategy both

Performance Tuning

For maximum throughput on a powerful server:

python manage.py event_pipeline_run \
--num_per_batch 200 \
--max_workers 20 \
--strategy api_only

For careful, thorough scraping:

python manage.py event_pipeline_run \
--num_per_batch 10 \
--max_workers 2 \
--strategy both \
--no-parallel \
--debug

Future Enhancements

Planned improvements:

  • Machine learning for optimal method selection
  • Adaptive rate limiting
  • Real-time progress dashboard
  • Event quality scoring
  • Automatic retry logic for failed venues
  • Venue success rate tracking

Support

For issues or questions:

  1. Check logs in /var/log/vibemap/
  2. Review Sentry error reports
  3. Run with --debug flag for detailed output
  4. Check venue data quality in Django admin