Skip to main content

Place Duplicate Removal

Overview

The Vibemap API automatically removes duplicate places from search results based on the combination of place name and address. This ensures that users don't see the same place multiple times in their search results.

How It Works

Duplicate Detection Logic

Duplicates are identified using a case-insensitive comparison of the place name and address:

  1. Name normalization: Convert to lowercase and trim whitespace
  2. Address normalization: Convert to lowercase and trim whitespace
  3. Unique key generation: Combine normalized name and address with a separator (name|address)
  4. Deduplication: Keep only the first occurrence of each unique name+address combination

Examples

Exact Duplicates

# These are considered duplicates:
Place 1: name="Coffee Shop", address="123 Main St"
Place 2: name="Coffee Shop", address="123 Main St"

# Result: Only Place 1 is kept

Case-Insensitive Matching

# These are considered duplicates:
Place 1: name="Coffee Shop", address="123 Main St"
Place 2: name="COFFEE SHOP", address="123 MAIN ST"

# Result: Only Place 1 is kept

Whitespace Handling

# These are considered duplicates:
Place 1: name=" Coffee Shop ", address=" 123 Main St "
Place 2: name="Coffee Shop", address="123 Main St"

# Result: Only Place 1 is kept

Different Places with Same Name (NOT duplicates)

# These are NOT duplicates (different addresses):
Place 1: name="Starbucks", address="123 Main St"
Place 2: name="Starbucks", address="456 Elm St"

# Result: Both are kept

Different Places at Same Address (NOT duplicates)

# These are NOT duplicates (different names):
Place 1: name="Coffee Shop", address="123 Main St"
Place 2: name="Bookstore", address="123 Main St"

# Result: Both are kept (e.g., different businesses in same building)

Implementation Details

Location in Codebase

The duplicate removal is implemented in:

  • File: search_indexes/viewsets/mixins.py
  • Class: GeoJSONResultMixin
  • Method: list() -> remove_place_duplicates()

When It Runs

The duplicate removal runs:

  • Endpoint: Place search endpoints (any endpoint using GeoJSONResultMixin with is_places=True)
  • Timing: After search/filtering but before GeoJSON conversion
  • Frequency: On every place search request

Logging

When duplicates are found, they are logged using loguru:

logger.info(f"Found {len(duplicates_found)} duplicate place(s) based on name+address combination")
logger.info(
f"Duplicate removed - ID: {dup['id']}, "
f"Name: '{dup['name']}', "
f"Address: '{dup['address']}', "
f"Duplicate of ID: {dup['duplicate_of_id']}"
)

Log Level: INFO Log Location: Configured via Django/loguru settings

Edge Cases

Places with Missing Data

Places with missing name or address are preserved (not removed):

# These are NOT checked for duplicates:
Place 1: name=None, address="123 Main St"
Place 2: name="Coffee Shop", address=None
Place 3: name="", address="456 Elm St"

# Result: All are kept

Rationale: Without both name and address, we cannot reliably determine if something is a duplicate. These entries may need manual cleanup or enrichment.

Multiple Duplicates of Same Place

# Input:
Place 1: name="Cafe", address="123 Main St"
Place 2: name="Cafe", address="123 Main St"
Place 3: name="Cafe", address="123 Main St"
Place 4: name="Cafe", address="123 Main St"

# Result: Only Place 1 is kept
# Logs show 3 duplicates removed, all pointing to Place 1

Testing

Running Tests

Standalone test file:

python test_duplicate_removal.py

Django test suite:

python manage.py test search_indexes.tests.TestRemovePlaceDuplicates

Test Coverage

The implementation includes tests for:

  • ✓ Exact duplicates
  • ✓ Case-insensitive matching
  • ✓ Whitespace handling
  • ✓ Multiple duplicates of same place
  • ✓ Different places with same name
  • ✓ Different places at same address
  • ✓ Missing name or address
  • ✓ Complex real-world scenarios

Performance Considerations

Time Complexity

  • Best case: O(n) - no duplicates
  • Worst case: O(n) - all entries are duplicates
  • Space: O(n) - stores unique keys in dictionary

Impact on Response Time

  • Minimal: Single pass through results list
  • Typical overhead: < 10ms for 100 places

Monitoring

To monitor duplicate removal:

  1. Check logs for duplicate removal messages
  2. Count frequency of duplicates by analyzing logs
  3. Identify patterns in duplicate sources (specific data sources, import processes, etc.)

Example Log Query

grep "duplicate place(s) based on name+address" /path/to/logs | wc -l

Future Improvements

Potential enhancements:

  1. Fuzzy matching: Use string similarity (e.g., Levenshtein distance) for near-duplicates
  2. Address normalization: Standardize addresses before comparison (e.g., "St" vs "Street")
  3. Merge data: Combine data from duplicates instead of just removing them
  4. Duplicate prevention: Add unique constraints at database level
  5. Admin interface: Provide UI for reviewing and managing duplicates