Place Duplicate Removal

Overview

The Vibemap API automatically removes duplicate places from search results based on the combination of place name and address. This ensures that users don't see the same place multiple times in their search results.

How It Works

Duplicate Detection Logic

Duplicates are identified using a case-insensitive comparison of the place name and address:

Name normalization: Convert to lowercase and trim whitespace
Address normalization: Convert to lowercase and trim whitespace
Unique key generation: Combine normalized name and address with a separator (name|address)
Deduplication: Keep only the first occurrence of each unique name+address combination

Examples

Exact Duplicates

# These are considered duplicates:
Place 1: name="Coffee Shop", address="123 Main St"
Place 2: name="Coffee Shop", address="123 Main St"

# Result: Only Place 1 is kept

Case-Insensitive Matching

# These are considered duplicates:
Place 1: name="Coffee Shop", address="123 Main St"
Place 2: name="COFFEE SHOP", address="123 MAIN ST"

# Result: Only Place 1 is kept

Whitespace Handling

# These are considered duplicates:
Place 1: name="  Coffee Shop  ", address="  123 Main St  "
Place 2: name="Coffee Shop", address="123 Main St"

# Result: Only Place 1 is kept

Different Places with Same Name (NOT duplicates)

# These are NOT duplicates (different addresses):
Place 1: name="Starbucks", address="123 Main St"
Place 2: name="Starbucks", address="456 Elm St"

# Result: Both are kept

Different Places at Same Address (NOT duplicates)

# These are NOT duplicates (different names):
Place 1: name="Coffee Shop", address="123 Main St"
Place 2: name="Bookstore", address="123 Main St"

# Result: Both are kept (e.g., different businesses in same building)

Implementation Details

Location in Codebase

The duplicate removal is implemented in:

File: search_indexes/viewsets/mixins.py
Class: GeoJSONResultMixin
Method: list() -> remove_place_duplicates()

When It Runs

The duplicate removal runs:

Endpoint: Place search endpoints (any endpoint using GeoJSONResultMixin with is_places=True)
Timing: After search/filtering but before GeoJSON conversion
Frequency: On every place search request

Logging

When duplicates are found, they are logged using loguru:

logger.info(f"Found {len(duplicates_found)} duplicate place(s) based on name+address combination")
logger.info(
    f"Duplicate removed - ID: {dup['id']}, "
    f"Name: '{dup['name']}', "
    f"Address: '{dup['address']}', "
    f"Duplicate of ID: {dup['duplicate_of_id']}"
)

Log Level: INFO Log Location: Configured via Django/loguru settings

Edge Cases

Places with Missing Data

Places with missing name or address are preserved (not removed):

# These are NOT checked for duplicates:
Place 1: name=None, address="123 Main St"
Place 2: name="Coffee Shop", address=None
Place 3: name="", address="456 Elm St"

# Result: All are kept

Rationale: Without both name and address, we cannot reliably determine if something is a duplicate. These entries may need manual cleanup or enrichment.

Multiple Duplicates of Same Place

# Input:
Place 1: name="Cafe", address="123 Main St"
Place 2: name="Cafe", address="123 Main St"
Place 3: name="Cafe", address="123 Main St"
Place 4: name="Cafe", address="123 Main St"

# Result: Only Place 1 is kept
# Logs show 3 duplicates removed, all pointing to Place 1

Testing

Running Tests

Standalone test file:

python test_duplicate_removal.py

Django test suite:

python manage.py test search_indexes.tests.TestRemovePlaceDuplicates

Test Coverage

The implementation includes tests for:

✓ Exact duplicates
✓ Case-insensitive matching
✓ Whitespace handling
✓ Multiple duplicates of same place
✓ Different places with same name
✓ Different places at same address
✓ Missing name or address
✓ Complex real-world scenarios

Performance Considerations

Time Complexity

Best case: O(n) - no duplicates
Worst case: O(n) - all entries are duplicates
Space: O(n) - stores unique keys in dictionary

Impact on Response Time

Minimal: Single pass through results list
Typical overhead: < 10ms for 100 places

Monitoring

To monitor duplicate removal:

Check logs for duplicate removal messages
Count frequency of duplicates by analyzing logs
Identify patterns in duplicate sources (specific data sources, import processes, etc.)

Example Log Query

grep "duplicate place(s) based on name+address" /path/to/logs | wc -l

Implementation: search_indexes/viewsets/mixins.py:405-461
Tests: search_indexes/tests.py
Standalone test: test_duplicate_removal.py

Future Improvements

Potential enhancements:

Fuzzy matching: Use string similarity (e.g., Levenshtein distance) for near-duplicates
Address normalization: Standardize addresses before comparison (e.g., "St" vs "Street")
Merge data: Combine data from duplicates instead of just removing them
Duplicate prevention: Add unique constraints at database level
Admin interface: Provide UI for reviewing and managing duplicates

Merge Duplicates Command - Database-level duplicate merging
ETL Duplicate Detection - Duplicate detection in ETL pipeline

Overview​

How It Works​

Duplicate Detection Logic​

Examples​

Exact Duplicates​

Case-Insensitive Matching​

Whitespace Handling​

Different Places with Same Name (NOT duplicates)​

Different Places at Same Address (NOT duplicates)​

Implementation Details​

Location in Codebase​

When It Runs​

Logging​

Edge Cases​

Places with Missing Data​

Multiple Duplicates of Same Place​

Testing​

Running Tests​

Test Coverage​

Performance Considerations​

Time Complexity​

Impact on Response Time​

Monitoring​

Example Log Query​

Related Files​

Future Improvements​

Related Documentation​