Unified Event Scraper Guide

Overview

The unified event scraper (event_pipeline_run.py) combines two powerful event discovery methods into a single intelligent command:

DataForSEO API - Google event search results via API
Direct Link Scraping - Event calendar pages on venue websites

Key Features

✅ Intelligent Routing - Automatically chooses the best scraping method for each venue ✅ Dual Coverage - Can run both methods on the same venue for maximum discovery ✅ Deduplication - Automatically merges and deduplicates events from both sources ✅ Parallel Processing - Concurrent execution for faster results ✅ Unified Reporting - Single comprehensive report from both sources ✅ Flexible Filtering - Filter by location, category, venue type, etc.

Quick Start

# Run both methods (recommended for maximum coverage)
python manage.py event_pipeline_run

# Test with a small batch
python manage.py event_pipeline_run --num_per_batch 3 --debug

# Focus on a specific city
python manage.py event_pipeline_run --address "San Francisco, CA"

# Only use API method (faster, but may miss calendar-only events)
python manage.py event_pipeline_run --strategy api_only

# Only use direct scraping (slower, but gets calendar events)
python manage.py event_pipeline_run --strategy links_only

Command-Line Arguments

Strategy Selection

Argument	Options	Default	Description
`--strategy`	`both`, `api_only`, `links_only`	`both`	Choose scraping method(s)

Strategy Guide:

both - Maximum coverage, tries both methods (recommended)
api_only - Faster, uses Google event search via DataForSEO
links_only - Slower, scrapes event pages directly

Venue Filtering

Argument	Type	Example	Description
`--venue_search_term`	string	`"concert hall"`	Filter venues by search term
`--editorial_category`	string	`"events_feed"`	Filter by editorial category
`--address`	string	`"Brooklyn, NY"`	Filter by location/address
`--boundary`	string	`"POLYGON((...))"`	Filter by geographic boundary

Performance Settings

Argument	Type	Default	Description
`--num_per_batch`	integer	`10`	Number of venues to process
`--max_workers`	integer	`5`	Concurrent workers for parallel processing
`--parallel`	flag	`True`	Enable parallel processing
`--no-parallel`	flag	-	Disable parallel processing

Debug and Testing

Argument	Type	Description
`--debug`	flag	Enable verbose logging
`--dry_run`	flag	Run without saving data (testing)

How It Works

1. Venue Discovery & Classification

The scraper queries the database for venues based on your filters, then classifies each venue:

🔍 Venue Classification Logic:
├── Has event link + Good Google presence → Both methods
├── Has event link only → Direct scraping
├── Good Google presence only → API method
└── Neither → API method (fallback)

2. Intelligent Method Selection

DataForSEO API Method is used when:

Venue has good Google presence (website URL)
Venue shows up in Google event searches
Fast results needed

Direct Link Scraping is used when:

Venue has event calendar/listing pages
Events are on the venue's own website
ICS feeds available

Both Methods when:

Venue qualifies for both
Maximum coverage requested (--strategy both)

3. Event Processing Pipeline

For each venue:
├── API Method:
│   ├── Search Google events via DataForSEO
│   ├── Extract event data from API results
│   ├── Enrich with additional details
│   └── Create/update database records
│
└── Link Method:
    ├── Check for ICS calendar feeds
    ├── Extract event links from pages
    ├── Scrape individual event pages
    └── Create/update database records

4. Deduplication & Merging

Events are deduplicated based on:

URL matching (exact match)
Name + Venue + Date matching
Data source identifiers

Priority given to events with more complete data.

Usage Examples

Production Use Cases

Daily event sync for a city:

python manage.py event_pipeline_run \
  --address "Austin, TX" \
  --num_per_batch 50 \
  --strategy both

Quick update for music venues:

python manage.py event_pipeline_run \
  --venue_search_term "concert" \
  --editorial_category "events_feed" \
  --num_per_batch 20 \
  --strategy api_only

Comprehensive scrape with calendar focus:

python manage.py event_pipeline_run \
  --strategy links_only \
  --num_per_batch 30 \
  --max_workers 10

Testing & Development

Test with debug logging:

python manage.py event_pipeline_run \
  --num_per_batch 2 \
  --debug \
  --no-parallel

Dry run (no database writes):

python manage.py event_pipeline_run \
  --dry_run \
  --num_per_batch 5 \
  --debug

Performance Considerations

Speed Optimization

API method is generally 2-3x faster than link scraping
Parallel processing provides 3-5x speedup with --max_workers 10
Batch size affects memory usage (keep under 100 for stability)

Resource Usage

Strategy	Speed	API Calls	Web Requests	Coverage
`api_only`	Fast	High	Low	Good
`links_only`	Slow	None	High	Excellent
`both`	Medium	High	High	Maximum

Recommended Settings

For speed:

--strategy api_only --num_per_batch 50 --max_workers 10

For coverage:

--strategy both --num_per_batch 30 --max_workers 5

For calendar events:

--strategy links_only --num_per_batch 20 --max_workers 3

Output & Reporting

Console Output

🚀 UNIFIED EVENT SCRAPING STARTED
Strategy: both
Batch size: 10

🔍 Discovering venues for scraping...
Found 47 venues matching criteria

📊 Venue classification:
  - API suitable: 12
  - Link scraping suitable: 18
  - Both methods: 17

🌐 Starting API-based scraping for 29 venues...
  ✅ Found 15 API results for Blue Note Jazz Club
  ...

🔗 Starting link-based scraping for 35 venues...
  ✅ Found 8 events from ICS feed
  ✅ Found 23 event links
  ...

🔄 Merging and deduplicating events...
✅ Deduplication complete:
  - Total events from both sources: 156
  - Unique events: 142
  - Duplicates removed: 14

📊 SCRAPING SUMMARY
Total venues processed: 47
  - Via API: 29
  - Via links: 35
Total unique events found: 142
  - From API: 87
  - From links: 69
  - Duplicates removed: 14

Email Report

An HTML email report is automatically sent with:

Summary statistics
List of new events created
Source attribution (API vs. links)
Venue processing details
Any warnings or errors

Troubleshooting

Common Issues

No events found:

# Check if venues have event links
# Try increasing batch size
--num_per_batch 50

API errors:

# Check DataForSEO credentials
# Try links-only strategy
--strategy links_only

Slow performance:

# Increase workers
--max_workers 10

# Or use API-only for speed
--strategy api_only

Memory issues:

# Reduce batch size
--num_per_batch 10

# Disable parallel processing
--no-parallel

Comparison with Individual Commands

Feature	`event_pipeline_run`	`scrape_data_for_seo`	`event_pipeline_scrape (legacy name)`
API scraping	✅	✅	❌
Link scraping	✅	❌	✅
Deduplication	✅ Both sources	Single source	Single source
Venue routing	✅ Intelligent	Manual	Manual
Combined report	✅	❌	❌
Flexibility	✅ High	Medium	Medium

Migration Guide

From `scrape_data_for_seo.py`

Replace:

python manage.py scrape_data_for_seo --search_type events --num_per_batch 12

With:

python manage.py event_pipeline_run --strategy api_only --num_per_batch 12

From `event_pipeline_scrape (legacy name).py`

Replace:

python manage.py event_pipeline_scrape (legacy name) --search-venues --max-workers 5

With:

python manage.py event_pipeline_run --strategy links_only --max_workers 5

Using Both (Recommended)

Instead of running both commands separately:

# Old way
python manage.py scrape_data_for_seo --search_type events
python manage.py event_pipeline_scrape (legacy name) --search-venues

# New way - single command with deduplication
python manage.py event_pipeline_run --strategy both

Cron Job Setup

# Run daily at 3 AM for comprehensive event discovery
0 3 * * * cd /home/vibemap/Vibemap-Analysis-dev && python manage.py event_pipeline_run --strategy both --num_per_batch 100

# Run every 6 hours for quick API updates
0 */6 * * * cd /home/vibemap/Vibemap-Analysis-dev && python manage.py event_pipeline_run --strategy api_only --num_per_batch 50

Advanced Configuration

Custom Venue Selection

You can combine filters for precise targeting:

# Music venues in NYC with event calendars
python manage.py event_pipeline_run \
  --venue_search_term "music" \
  --address "New York, NY" \
  --editorial_category "events_feed" \
  --strategy both

Performance Tuning

For maximum throughput on a powerful server:

python manage.py event_pipeline_run \
  --num_per_batch 200 \
  --max_workers 20 \
  --strategy api_only

For careful, thorough scraping:

python manage.py event_pipeline_run \
  --num_per_batch 10 \
  --max_workers 2 \
  --strategy both \
  --no-parallel \
  --debug

Future Enhancements

Planned improvements:

Machine learning for optimal method selection
Adaptive rate limiting
Real-time progress dashboard
Event quality scoring
Automatic retry logic for failed venues
Venue success rate tracking

Support

For issues or questions:

Check logs in /var/log/vibemap/
Review Sentry error reports
Run with --debug flag for detailed output
Check venue data quality in Django admin

Overview​

Key Features​

Quick Start​

Command-Line Arguments​

Strategy Selection​

Venue Filtering​

Performance Settings​

Debug and Testing​

How It Works​

1. Venue Discovery & Classification​

2. Intelligent Method Selection​

3. Event Processing Pipeline​

4. Deduplication & Merging​

Usage Examples​

Production Use Cases​

Testing & Development​

Performance Considerations​

Speed Optimization​

Resource Usage​

Recommended Settings​

Output & Reporting​

Console Output​

Email Report​

Troubleshooting​

Common Issues​

Comparison with Individual Commands​

Migration Guide​

From scrape_data_for_seo.py​

From event_pipeline_scrape (legacy name).py​

Using Both (Recommended)​

Cron Job Setup​

Advanced Configuration​

Custom Venue Selection​

Performance Tuning​

Future Enhancements​

Support​