Progressive Person Matching

Problem Statement

The Registration Portal allows principals (logged-in users) to link other persons to their account for registration purposes - such as family members, team members, or dependents. This creates a sensitive security surface: we must enable legitimate linking while preventing attackers from using the person lookup feature to enumerate personal data.

Security Risks

Risk Description

Data Enumeration

An attacker could probe the system with partial data (e.g., common surnames) to discover if specific individuals exist in the database, violating POPIA principles.

Brute Force Matching

Without rate limiting and minimum proof-of-knowledge requirements, attackers could systematically guess personal details to link unauthorized persons.

Primary Key Exposure

Exposing internal User.id values in API responses enables targeted attacks and leaks implementation details.

Information Disclosure

Returning full personal details for partial matches reveals protected information to unauthorized parties.

POPIA Compliance Requirements

The Protection of Personal Information Act (POPIA) requires:

  • Purpose Limitation - Personal data only used for the stated purpose (registration linking)

  • Security Safeguards - Technical measures to prevent unauthorized access

  • Data Minimisation - Only reveal necessary information

  • Accountability - Audit trail of all access attempts

The current implementation uses a two-step approach:

1. User enters search criteria (name, ID, etc.)
2. System returns list of matching persons
3. User selects from results
4. System creates LinkedPerson record

Current Security Gaps

Gap Risk Impact

List-based results

Returns multiple matches with personal data

Enables enumeration of who exists in system

No minimum criteria

Can search with minimal information

Too easy to probe with common data

User.id exposure

Primary key returned in responses

Enables targeted attacks

No progressive disclosure

Full details shown immediately

Violates data minimisation principle

Rate limiting gaps

Per-endpoint only, not per-session

Determined attacker can work around

To-Be: Progressive Matching with Weighted Scoring

Design Overview

The new design replaces "search-then-link" with "type → auto-match → confirm":

1. User types fields progressively
2. Frontend calculates local score (gate before API call)
3. Backend scores candidates, returns ONLY if unique match
4. Masked suggestion shown after delay
5. User confirms → LinkedPerson created with opaque token

Key Security Features

Feature Implementation

Frontend Pre-Scoring Gate

20-point minimum before calling backend. Prevents trivial probing.

Weighted Field Scoring

Different fields contribute different points based on uniqueness value.

Uniqueness Requirement

Backend only returns a match if exactly one candidate meets threshold. Multiple matches return "AMBIGUOUS" requiring more fields.

Delayed Reveal

3-second delay at threshold 50; immediate reveal only at 80+. Gives user time to add more fields for better access rights.

Masked Data Only

Suggestions show masked name (J* S*), age range, gender - never full details until linked.

Opaque Match Tokens

matchToken used during search; LinkedPerson.id returned after linking. User.id never exposed.

Access Level by Score

Score 50-79 → READ (masked view); Score 80+ → READ_WRITE (full access)

Cross-Account Trust

+20 boost if same physical User has verified links in another account (prevents re-verification friction)

Frontend Pre-Scoring (API Gate)

The frontend calculates a local score before calling the backend:

Field Points Rules

First Name

4-7

Sliding scale: 2 chars=4, 3=5, 4=6, 5+=7. Minimum 2 chars.

Last Name

4-7

Sliding scale: 2 chars=4, 3=5, 4=6, 5+=7. Minimum 2 chars.

Date of Birth

5

Full date required. Ignored if full ID provided.

Gender

4

Single selection. Ignored if full ID provided.

ID Number (full)

15

Must be 13 chars AND pass Luhn checksum. Excludes DOB/Gender (no double-dipping).

ID Number (partial)

5

6 digits matching valid YYMMDD (no 13th month). Same value as DOB.

Membership Number

10

Organisation-specific identifier

Email

10

Valid email format

Phone Number

10

Valid phone format

Threshold: 20 points required to call backend or submit form.

Name Sliding Scale Formula

points = min(7, max(4, length + 2))
// 2 chars → 4 pts
// 3 chars → 5 pts
// 4 chars → 6 pts
// 5+ chars → 7 pts

ID Number Validation

Full ID (13 characters):

  1. Exactly 13 digits

  2. Positions 1-6: Valid date (YYMMDD) - no invalid months/days

  3. Position 13: Luhn checksum digit must validate

Double-Dip Prevention: When full valid ID matches, DOB and Gender fields contribute 0 points (information already encoded in ID).

Partial ID (6 digits): Must match valid YYMMDD pattern to score 5 points.

Backend Scoring Weights

person-matching:
  weights:
    # ID Number (authoritative)
    sa-id-number-exact: 60      # Full 13 chars + Luhn; excludes DOB/Gender
    sa-id-number-partial: 25    # First 6 digits (YYMMDD)

    # Date of Birth (ignored if full ID provided)
    date-of-birth-exact: 25

    # Gender (ignored if full ID provided)
    gender-exact: 8

    # Names (sliding scale)
    surname-exact-min: 8        # 2 chars
    surname-exact-max: 15       # 5+ chars
    first-name-exact-min: 8     # 2 chars
    first-name-exact-max: 15    # 5+ chars

    # Contact info
    email-exact: 20
    mobile-exact: 20
    membership-number-exact: 40

  thresholds:
    minimum-to-suggest: 50      # Show masked suggestions
    confident-match: 80         # Highlight as likely match

  cross-account-trust-boost: 20 # Same User verified in other account

Response States

Status Meaning UI Action

NO_MATCH

No candidates meet threshold

Continue to new person creation

AMBIGUOUS

Multiple candidates qualify

Prompt for more fields (show suggestions)

UNIQUE_MATCH

Exactly one candidate

Show masked suggestion

Cross-Account Trust Model

When a user logs in via different authentication methods (e.g., Facebook vs username/password), they create separate OrgUser accounts but are the same physical User.

Trust Resolution:

  • If the same User has a verified link (accessLevel=READ_WRITE) to a person in another account, +20 boost applied

  • Alternatively, if the link has been active for >30 days, trust is assumed

  • After linking 2+ people that match another account’s verified links, existing READ links are upgraded to READ_WRITE

API Endpoints

Progressive Match

POST /api/people/progressive-match
Content-Type: application/json

Request:
{
  "input": {
    "idNumber": "8501015009087",
    "firstName": "Johan",
    "surname": null,
    "dateOfBirth": null,
    "gender": "MALE"
  },
  "organisationId": 1
}

Response (UNIQUE_MATCH):
{
  "status": "UNIQUE_MATCH",
  "confidenceScore": 65,
  "suggestion": {
    "matchToken": "abc123...",       // Opaque, time-limited
    "maskedName": "J**** S****",
    "gender": "Male",
    "ageRange": "35-40",
    "matchedFields": ["ID_NUMBER_EXACT", "GENDER"]
  },
  "suggestedAccessLevel": "READ"
}

Response (AMBIGUOUS):
{
  "status": "AMBIGUOUS",
  "candidateCount": 3,
  "message": "Multiple potential matches. Please provide more details.",
  "suggestedFields": ["surname", "dateOfBirth"]
}
POST /api/linked-people/link-by-token

Request:
{
  "matchToken": "abc123...",
  "linkType": "FAMILY",
  "organisationId": 1
}

Response:
{
  "linkedPersonId": 456,           // LinkedPerson.id (safe to expose)
  "accessLevel": "READ",           // Based on score
  "validTo": "2026-01-27T00:00:00Z"
}

Rationale

Why Weighted Scoring?

Different fields have different uniqueness value:

  • ID Number is highly unique - strong proof of knowledge

  • Common names like "John" provide less assurance than "Bartholomew"

  • Email and phone are good identifiers but can be socially engineered

Weighted scoring reflects real-world identification confidence.

Why Uniqueness Requirement?

Returning multiple matches would:

  1. Reveal that multiple people with those characteristics exist (enumeration)

  2. Force UI to display a list (requiring personal data disclosure)

  3. Enable attackers to narrow down targets

By requiring exactly one match, we force the user to provide enough information to unambiguously identify the person - proving genuine knowledge.

Why Delayed Reveal?

The 3-second delay at threshold 50:

  1. Encourages users to add more fields (potentially reaching 80+ for better access)

  2. Prevents rapid probing (rate limiting enhancement)

  3. Gives legitimate users time to complete the form naturally

Why Opaque Tokens?

matchToken is:

  • Time-limited (expires after 5 minutes)

  • Single-use (invalidated after redemption)

  • Cryptographically secure (cannot be guessed)

  • Opaque (reveals nothing about the underlying data)

This ensures the client never learns the User.id, preventing:

  • IDOR (Insecure Direct Object Reference) attacks

  • Targeted enumeration based on sequential IDs

  • Correlation between sessions

Why Cross-Account Trust?

Real users may:

  • Create accounts via different auth methods over time

  • Forget they already linked family members in another login

Cross-account trust:

  • Reduces friction for legitimate users

  • Only applies to verified relationships

  • Includes time constraint (30 days) to prevent abuse

Prerequisites

Deduplication must complete before Progressive Matching goes live.

Current database duplicates would cause all matches to return "AMBIGUOUS", frustrating legitimate users. A separate Deduplication Epic should:

  1. Identify duplicate candidates (same ID number, similar names + DOB)

  2. Build admin merge UI for review

  3. Implement merge logic preserving relationships and audit trail

  4. Complete before Progressive Matching deployment

Security Controls Summary

Control Implementation

Enumeration Prevention

Uniqueness requirement - only returns if exactly one match

Proof of Knowledge

20-point frontend gate + 50-point backend threshold

Rate Limiting

Per-session limits on progressive match calls

Data Minimisation

Masked suggestions only; full data after verified link

PK Protection

User.id never exposed; matchToken and LinkedPerson.id only

Timing Attack Prevention

Consistent response times regardless of match status

Audit Trail

All progressive match attempts and token redemptions logged

Implementation Checklist

  • PersonMatchScorer - Weighted scoring with configurable weights

  • PersonMatchConfig - YAML configuration for weights and thresholds

  • ProgressiveMatchResponse - Response DTOs for match states

  • PersonResource.progressiveMatch() - New endpoint

  • LinkedPersonResourceEx.linkByToken() - Token redemption endpoint

  • Frontend pre-scoring component

  • Delayed reveal UI with timer

  • Rate limiting aspect adaptation

  • Cross-account trust resolution

  • Progressive access level upgrade