Place Duplicate Removal
Overview
The Vibemap API automatically removes duplicate places from search results based on the combination of place name and address. This ensures that users don't see the same place multiple times in their search results.
How It Works
Duplicate Detection Logic
Duplicates are identified using a case-insensitive comparison of the place name and address:
- Name normalization: Convert to lowercase and trim whitespace
- Address normalization: Convert to lowercase and trim whitespace
- Unique key generation: Combine normalized name and address with a separator (
name|address) - Deduplication: Keep only the first occurrence of each unique name+address combination
Examples
Exact Duplicates
# These are considered duplicates:
Place 1: name="Coffee Shop", address="123 Main St"
Place 2: name="Coffee Shop", address="123 Main St"
# Result: Only Place 1 is kept
Case-Insensitive Matching
# These are considered duplicates:
Place 1: name="Coffee Shop", address="123 Main St"
Place 2: name="COFFEE SHOP", address="123 MAIN ST"
# Result: Only Place 1 is kept
Whitespace Handling
# These are considered duplicates:
Place 1: name=" Coffee Shop ", address=" 123 Main St "
Place 2: name="Coffee Shop", address="123 Main St"
# Result: Only Place 1 is kept
Different Places with Same Name (NOT duplicates)
# These are NOT duplicates (different addresses):
Place 1: name="Starbucks", address="123 Main St"
Place 2: name="Starbucks", address="456 Elm St"
# Result: Both are kept
Different Places at Same Address (NOT duplicates)
# These are NOT duplicates (different names):
Place 1: name="Coffee Shop", address="123 Main St"
Place 2: name="Bookstore", address="123 Main St"
# Result: Both are kept (e.g., different businesses in same building)
Implementation Details
Location in Codebase
The duplicate removal is implemented in:
- File:
search_indexes/viewsets/mixins.py - Class:
GeoJSONResultMixin - Method:
list()->remove_place_duplicates()
When It Runs
The duplicate removal runs:
- Endpoint: Place search endpoints (any endpoint using
GeoJSONResultMixinwithis_places=True) - Timing: After search/filtering but before GeoJSON conversion
- Frequency: On every place search request
Logging
When duplicates are found, they are logged using loguru:
logger.info(f"Found {len(duplicates_found)} duplicate place(s) based on name+address combination")
logger.info(
f"Duplicate removed - ID: {dup['id']}, "
f"Name: '{dup['name']}', "
f"Address: '{dup['address']}', "
f"Duplicate of ID: {dup['duplicate_of_id']}"
)
Log Level: INFO Log Location: Configured via Django/loguru settings
Edge Cases
Places with Missing Data
Places with missing name or address are preserved (not removed):
# These are NOT checked for duplicates:
Place 1: name=None, address="123 Main St"
Place 2: name="Coffee Shop", address=None
Place 3: name="", address="456 Elm St"
# Result: All are kept
Rationale: Without both name and address, we cannot reliably determine if something is a duplicate. These entries may need manual cleanup or enrichment.
Multiple Duplicates of Same Place
# Input:
Place 1: name="Cafe", address="123 Main St"
Place 2: name="Cafe", address="123 Main St"
Place 3: name="Cafe", address="123 Main St"
Place 4: name="Cafe", address="123 Main St"
# Result: Only Place 1 is kept
# Logs show 3 duplicates removed, all pointing to Place 1
Testing
Running Tests
Standalone test file:
python test_duplicate_removal.py
Django test suite:
python manage.py test search_indexes.tests.TestRemovePlaceDuplicates
Test Coverage
The implementation includes tests for:
- ✓ Exact duplicates
- ✓ Case-insensitive matching
- ✓ Whitespace handling
- ✓ Multiple duplicates of same place
- ✓ Different places with same name
- ✓ Different places at same address
- ✓ Missing name or address
- ✓ Complex real-world scenarios
Performance Considerations
Time Complexity
- Best case: O(n) - no duplicates
- Worst case: O(n) - all entries are duplicates
- Space: O(n) - stores unique keys in dictionary
Impact on Response Time
- Minimal: Single pass through results list
- Typical overhead: < 10ms for 100 places
Monitoring
To monitor duplicate removal:
- Check logs for duplicate removal messages
- Count frequency of duplicates by analyzing logs
- Identify patterns in duplicate sources (specific data sources, import processes, etc.)
Example Log Query
grep "duplicate place(s) based on name+address" /path/to/logs | wc -l
Related Files
- Implementation:
search_indexes/viewsets/mixins.py:405-461 - Tests:
search_indexes/tests.py - Standalone test:
test_duplicate_removal.py
Future Improvements
Potential enhancements:
- Fuzzy matching: Use string similarity (e.g., Levenshtein distance) for near-duplicates
- Address normalization: Standardize addresses before comparison (e.g., "St" vs "Street")
- Merge data: Combine data from duplicates instead of just removing them
- Duplicate prevention: Add unique constraints at database level
- Admin interface: Provide UI for reviewing and managing duplicates
Related Documentation
- Merge Duplicates Command - Database-level duplicate merging
- ETL Duplicate Detection - Duplicate detection in ETL pipeline