Progressive Person Matching

Problem Statement

The Registration Portal allows principals (logged-in users) to link other persons to their account for registration purposes - such as family members, team members, or dependents. This creates a sensitive security surface: we must enable legitimate linking while preventing attackers from using the person lookup feature to enumerate personal data.

Security Risks

Risk Description

Risk	Description
Data Enumeration	An attacker could probe the system with partial data (e.g., common surnames) to discover if specific individuals exist in the database, violating POPIA principles.
Brute Force Matching	Without rate limiting and minimum proof-of-knowledge requirements, attackers could systematically guess personal details to link unauthorized persons.
Primary Key Exposure	Exposing internal `User.id` values in API responses enables targeted attacks and leaks implementation details.
Information Disclosure	Returning full personal details for partial matches reveals protected information to unauthorized parties.

Data Enumeration

An attacker could probe the system with partial data (e.g., common surnames) to discover if specific individuals exist in the database, violating POPIA principles.

Brute Force Matching

Without rate limiting and minimum proof-of-knowledge requirements, attackers could systematically guess personal details to link unauthorized persons.

Primary Key Exposure

Exposing internal User.id values in API responses enables targeted attacks and leaks implementation details.

Information Disclosure

Returning full personal details for partial matches reveals protected information to unauthorized parties.

POPIA Compliance Requirements

The Protection of Personal Information Act (POPIA) requires:

Purpose Limitation - Personal data only used for the stated purpose (registration linking)
Security Safeguards - Technical measures to prevent unauthorized access
Data Minimisation - Only reveal necessary information
Accountability - Audit trail of all access attempts

As-Is: Search-Then-Link Flow

The current implementation uses a two-step approach:

1. User enters search criteria (name, ID, etc.)
2. System returns list of matching persons
3. User selects from results
4. System creates LinkedPerson record

Current Security Gaps

Gap	Risk	Impact
List-based results	Returns multiple matches with personal data	Enables enumeration of who exists in system
No minimum criteria	Can search with minimal information	Too easy to probe with common data
User.id exposure	Primary key returned in responses	Enables targeted attacks
No progressive disclosure	Full details shown immediately	Violates data minimisation principle
Rate limiting gaps	Per-endpoint only, not per-session	Determined attacker can work around

Gap

Risk

Impact

List-based results

Returns multiple matches with personal data

Enables enumeration of who exists in system

No minimum criteria

Can search with minimal information

Too easy to probe with common data

User.id exposure

Primary key returned in responses

Enables targeted attacks

No progressive disclosure

Full details shown immediately

Violates data minimisation principle

Rate limiting gaps

Per-endpoint only, not per-session

Determined attacker can work around

To-Be: Progressive Matching with Weighted Scoring

Design Overview

The new design replaces "search-then-link" with "type → auto-match → confirm":

1. User types fields progressively
2. Frontend calculates local score (gate before API call)
3. Backend scores candidates, returns ONLY if unique match
4. Masked suggestion shown after delay
5. User confirms → LinkedPerson created with opaque token

Key Security Features

Feature Implementation

Feature	Implementation
Frontend Pre-Scoring Gate	20-point minimum before calling backend. Prevents trivial probing.
Weighted Field Scoring	Different fields contribute different points based on uniqueness value.
Uniqueness Requirement	Backend only returns a match if exactly one candidate meets threshold. Multiple matches return "AMBIGUOUS" requiring more fields.
Delayed Reveal	3-second delay at threshold 50; immediate reveal only at 80+. Gives user time to add more fields for better access rights.
Masked Data Only	Suggestions show deterministically masked name (e.g., `Jn**n`), age range, gender - never full details until linked. Masking reveals `floor(length 0.25) + 1` characters using a hash-based deterministic selection. User-typed name prefixes that match are also revealed (no new information disclosed).
Opaque Match Tokens	`matchToken` used during search; `LinkedPerson.id` returned after linking. `User.id` never exposed.
Access Level by Score	Score 50-79 → `READ` (masked view); Score 80+ → `READ_WRITE` (full access)
Cross-Account Trust	+20 boost if same physical User has verified links in another account (prevents re-verification friction)

Frontend Pre-Scoring Gate

20-point minimum before calling backend. Prevents trivial probing.

Weighted Field Scoring

Different fields contribute different points based on uniqueness value.

Uniqueness Requirement

Backend only returns a match if exactly one candidate meets threshold. Multiple matches return "AMBIGUOUS" requiring more fields.

Delayed Reveal

3-second delay at threshold 50; immediate reveal only at 80+. Gives user time to add more fields for better access rights.

Masked Data Only

Suggestions show deterministically masked name (e.g., J*n****n), age range, gender - never full details until linked. Masking reveals floor(length * 0.25) + 1 characters using a hash-based deterministic selection. User-typed name prefixes that match are also revealed (no new information disclosed).

Opaque Match Tokens

matchToken used during search; LinkedPerson.id returned after linking. User.id never exposed.

Access Level by Score

Score 50-79 → READ (masked view); Score 80+ → READ_WRITE (full access)

Cross-Account Trust

+20 boost if same physical User has verified links in another account (prevents re-verification friction)

Frontend Pre-Scoring (API Gate)

The frontend calculates a local score before calling the backend:

Field	Points	Rules
First Name	4-7	Sliding scale: 2 chars=4, 3=5, 4=6, 5+=7. Minimum 2 chars.
Last Name	4-7	Sliding scale: 2 chars=4, 3=5, 4=6, 5+=7. Minimum 2 chars.
Date of Birth	5	Full date required. Ignored if full ID provided.
Gender	4	Single selection. Ignored if full ID provided.
ID Number (full)	15	Must be 13 chars AND pass Luhn checksum. Excludes DOB/Gender (no double-dipping).
ID Number (partial)	5	6 digits matching valid YYMMDD (no 13th month). Same value as DOB.
Membership Number	10	Organisation-specific identifier
Email	10	Valid email format
Phone Number	10	Valid phone format

Field

Points

Rules

First Name

4-7

Sliding scale: 2 chars=4, 3=5, 4=6, 5+=7. Minimum 2 chars.

Last Name

4-7

Sliding scale: 2 chars=4, 3=5, 4=6, 5+=7. Minimum 2 chars.

Date of Birth

Full date required. Ignored if full ID provided.

Gender

Single selection. Ignored if full ID provided.

ID Number (full)

Must be 13 chars AND pass Luhn checksum. Excludes DOB/Gender (no double-dipping).

ID Number (partial)

6 digits matching valid YYMMDD (no 13th month). Same value as DOB.

Membership Number

Organisation-specific identifier

Valid email format

Phone Number

Valid phone format

Threshold: 20 points required to call backend or submit form.

Name Sliding Scale Formula

points = min(7, max(4, length + 2))
// 2 chars → 4 pts
// 3 chars → 5 pts
// 4 chars → 6 pts
// 5+ chars → 7 pts

ID Number Validation

Full ID (13 characters):

Exactly 13 digits
Positions 1-6: Valid date (YYMMDD) - no invalid months/days
Position 13: Luhn checksum digit must validate

Double-Dip Prevention: When full valid ID matches, DOB and Gender fields contribute 0 points (information already encoded in ID).

Partial ID (6 digits): Must match valid YYMMDD pattern to score 5 points.

Backend Scoring Weights

person-matching:
  weights:
    # ID Number (authoritative)
    sa-id-number-exact: 60      # Full 13 chars + Luhn; excludes DOB/Gender
    sa-id-number-partial: 25    # First 6 digits (YYMMDD)

    # Date of Birth (ignored if full ID provided)
    date-of-birth-exact: 25

    # Gender (ignored if full ID provided)
    gender-exact: 8

    # Names (sliding scale)
    surname-exact-min: 8        # 2 chars
    surname-exact-max: 15       # 5+ chars
    first-name-exact-min: 8     # 2 chars
    first-name-exact-max: 15    # 5+ chars

    # Contact info
    email-exact: 20
    mobile-exact: 20
    membership-number-exact: 40

  thresholds:
    minimum-to-suggest: 50      # Show masked suggestions
    confident-match: 80         # Highlight as likely match

  cross-account-trust-boost: 20 # Same User verified in other account

Response States

Status Meaning UI Action

Status	Meaning	UI Action
`NO_MATCH`	No candidates meet threshold	Continue to new person creation
`AMBIGUOUS`	Multiple candidates qualify	Prompt for more fields (show suggestions)
`UNIQUE_MATCH`	Exactly one candidate	Show masked suggestion

NO_MATCH

No candidates meet threshold

Continue to new person creation

AMBIGUOUS

Multiple candidates qualify

Prompt for more fields (show suggestions)

UNIQUE_MATCH

Exactly one candidate

Show masked suggestion

Cross-Account Trust Model

When a user logs in via different authentication methods (e.g., Facebook vs username/password), they create separate OrgUser accounts but are the same physical User.

Trust Resolution:

If the same User has a verified link (accessLevel=READ_WRITE) to a person in another account, +20 boost applied
Alternatively, if the link has been active for >30 days, trust is assumed
After linking 2+ people that match another account’s verified links, existing READ links are upgraded to READ_WRITE

API Endpoints

Progressive Match

POST /api/people/progressive-match
Content-Type: application/json

Request:
{
  "input": {
    "idNumber": "8501015009087",
    "firstName": "Johan",
    "surname": null,
    "dateOfBirth": null,
    "gender": "MALE"
  },
  "organisationId": 1
}

Response (UNIQUE_MATCH):
{
  "status": "UNIQUE_MATCH",
  "confidenceScore": 65,
  "suggestion": {
    "matchToken": "abc123...",       // Opaque, time-limited
    "maskedName": "J*h** S***h",
    "gender": "Male",
    "ageRange": "35-40",
    "matchedFields": ["ID_NUMBER_EXACT", "GENDER"]
  },
  "suggestedAccessLevel": "READ"
}

Response (AMBIGUOUS):
{
  "status": "AMBIGUOUS",
  "candidateCount": 3,
  "message": "Multiple potential matches. Please provide more details.",
  "suggestedFields": ["surname", "dateOfBirth"]
}

Link by Token

POST /api/linked-people/link-by-token

Request:
{
  "matchToken": "abc123...",
  "linkType": "FAMILY",
  "organisationId": 1
}

Response:
{
  "linkedPersonId": 456,           // LinkedPerson.id (safe to expose)
  "accessLevel": "READ",           // Based on score
  "validTo": "2026-01-27T00:00:00Z"
}

Rationale

Why Weighted Scoring?

Different fields have different uniqueness value:

ID Number is highly unique - strong proof of knowledge
Common names like "John" provide less assurance than "Bartholomew"
Email and phone are good identifiers but can be socially engineered

Weighted scoring reflects real-world identification confidence.

Why Uniqueness Requirement?

Returning multiple matches would:

Reveal that multiple people with those characteristics exist (enumeration)
Force UI to display a list (requiring personal data disclosure)
Enable attackers to narrow down targets

By requiring exactly one match, we force the user to provide enough information to unambiguously identify the person - proving genuine knowledge.

Why Delayed Reveal?

The 3-second delay at threshold 50:

Encourages users to add more fields (potentially reaching 80+ for better access)
Prevents rapid probing (rate limiting enhancement)
Gives legitimate users time to complete the form naturally

Why Opaque Tokens?

matchToken is:

Time-limited (expires after 5 minutes)
Single-use (invalidated after redemption)
Cryptographically secure (cannot be guessed)
Opaque (reveals nothing about the underlying data)

This ensures the client never learns the User.id, preventing:

IDOR (Insecure Direct Object Reference) attacks
Targeted enumeration based on sequential IDs
Correlation between sessions

Why Cross-Account Trust?

Real users may:

Create accounts via different auth methods over time
Forget they already linked family members in another login

Cross-account trust:

Reduces friction for legitimate users
Only applies to verified relationships
Includes time constraint (30 days) to prevent abuse

Prerequisites

Deduplication must complete before Progressive Matching goes live.

Current database duplicates would cause all matches to return "AMBIGUOUS", frustrating legitimate users. A separate Deduplication Epic should:

Identify duplicate candidates (same ID number, similar names + DOB)
Build admin merge UI for review
Implement merge logic preserving relationships and audit trail
Complete before Progressive Matching deployment

Security Controls Summary

Control	Implementation
Enumeration Prevention	Uniqueness requirement - only returns if exactly one match
Proof of Knowledge	20-point frontend gate + 50-point backend threshold
Rate Limiting	Per-session limits on progressive match calls
Data Minimisation	Masked suggestions only; full data after verified link
PK Protection	User.id never exposed; matchToken and LinkedPerson.id only
Timing Attack Prevention	Consistent response times regardless of match status
Audit Trail	All progressive match attempts and token redemptions logged

Control

Implementation

Enumeration Prevention

Uniqueness requirement - only returns if exactly one match

Proof of Knowledge

20-point frontend gate + 50-point backend threshold

Rate Limiting

Per-session limits on progressive match calls

Data Minimisation

Masked suggestions only; full data after verified link

PK Protection

User.id never exposed; matchToken and LinkedPerson.id only

Timing Attack Prevention

Consistent response times regardless of match status

Audit Trail

All progressive match attempts and token redemptions logged

Implementation Checklist

PersonMatchScorer - Weighted scoring with configurable weights
PersonMatchConfig - YAML configuration for weights and thresholds
ProgressiveMatchResponse - Response DTOs for match states
PersonResource.progressiveMatch() - New endpoint
LinkedPersonResourceEx.linkByToken() - Token redemption endpoint
Frontend pre-scoring component
Delayed reveal UI with timer
Rate limiting aspect adaptation
Cross-account trust resolution
Progressive access level upgrade