Back to Blog
Deep Dive

PII Detection for LLMs: A Technical Guide

February 20, 2026  ·  9 min read  ·  Yusuf Jacobs

Every LLM-powered agent is one prompt away from leaking PII. User messages contain names, emails, phone numbers, and social security numbers. Model responses can hallucinate real PII from training data. Here's how to build a detection pipeline that catches it all.

The PII Problem in LLM Pipelines

PII leakage in LLM applications happens at three points:

Input — Users submit personal information in their prompts. A support chat might include “My SSN is 123-45-6789, can you check my account?”

Context — RAG systems retrieve documents containing PII from vector databases. An employee handbook might include salary data.

Output — The model includes PII in its response, either echoed from input, retrieved from context, or hallucinated from training data.

A comprehensive PII detection system must scan all three points. Scanning only outputs misses PII that gets stored in conversation logs. Scanning only inputs misses model hallucination.

Approaches to PII Detection

There are three main approaches, each with trade-offs:

1. Regex-Based Detection

Pattern matching using regular expressions. Fast, deterministic, and zero-dependency.

// US Social Security Number
/\b\d{3}-\d{2}-\d{4}\b/

// Australian Medicare Number
/\b\d{4}[\s-]?\d{5}[\s-]?\d{1}\b/

// UK NHS Number
/\b\d{3}[\s-]?\d{3}[\s-]?\d{4}\b/

// Credit Card (Luhn-validated)
/\b(?:4\d{3}|5[1-5]\d{2}|3[47]\d{2}|6(?:011|5\d{2}))[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b/

Pros: Sub-millisecond latency, no external dependencies, works offline, deterministic.
Cons: Can't detect contextual PII (names, addresses), struggles with varied formats.

2. NER-Based Detection

Named Entity Recognition models (spaCy, Presidio, GLiNER) identify PII through machine learning:

# Using Microsoft Presidio
from presidio_analyzer import AnalyzerEngine

analyzer = AnalyzerEngine()
results = analyzer.analyze(text=user_input, language='en')
# Returns: [type=PERSON, start=11, end=21, score=0.85]

Pros: Catches names, addresses, and contextual PII. Higher recall for free-form text.
Cons: Slower (10-50ms per scan), requires model download, non-deterministic, can miss structured formats.

3. Hybrid Detection (What Tork Uses)

Tork combines both approaches. Regex patterns run first for structured PII (SSNs, credit cards, phone numbers), then a lightweight NER pass catches contextual PII (names, addresses). The result is high precision and high recall:

const result = await tork.pii.scan(text, {
  mode: 'hybrid', // regex + NER
  confidence: 0.8, // minimum confidence threshold
  regions: ['US', 'AU', 'EU', 'UK'],
});

The 50+ PII Types Tork Detects

Tork's detection engine covers PII across 13 countries. Here's the breakdown by category:

Identity
SSN, Passport, Driver's License, National ID, Aadhaar, MyNumber
Financial
Credit Card, Bank Account, IBAN, SWIFT/BIC, Tax File Number, EIN
Healthcare
Medicare (AU), NHS (UK), Health Insurance ID, Medical Record #
Contact
Email, Phone (15 formats), Address, Postcode, IP Address
Personal
Full Name, Date of Birth, Age, Gender, Ethnicity, Religion
Digital
API Key, JWT Token, AWS Key, Password Hash, SSH Key, OAuth Token

Redaction Strategies

Detecting PII is half the battle. The other half is what you do with it. Tork supports four redaction modes:

// 1. Placeholder redaction (default)
"My SSN is [SSN]"

// 2. Masked redaction
"My SSN is ***-**-6789"

// 3. Synthetic replacement
"My SSN is 555-00-1234" // Valid format, fake data

// 4. Hash-based pseudonymisation
"My SSN is PII_abc123def456" // Reversible with key

Synthetic replacement is particularly useful for testing and development — your agents behave identically because the data format is preserved, but no real PII exists in your dev environment.

Performance Considerations

PII detection adds latency to your agent pipeline. Here's what to expect with Tork's SDK:

ModeLatency (1KB)Latency (10KB)Coverage
Regex only< 1ms2-5msStructured PII
NER only8-15ms25-50msContextual PII
Hybrid8-16ms27-55msAll PII types
Regex (Rust SDK)< 0.1ms< 1msStructured PII

For comparison, a typical LLM API call takes 500-3000ms. PII detection adds negligible overhead relative to model inference time.

Implementation Architecture

The recommended architecture scans at both the input and output boundaries of your LLM pipeline:

User Input
   ↓
[Tork PII Scan — Input] → Redact before sending to LLM
   ↓
LLM / Agent Processing
   ↓
[Tork PII Scan — Output] → Redact before returning to user
   ↓
Clean Response

For RAG pipelines, add a third scan point at document retrieval:

// Scan retrieved documents before injecting into context
const docs = await vectorStore.similaritySearch(query);

const cleanDocs = await Promise.all(
  docs.map(doc => tork.pii.redact(doc.pageContent))
);

// Pass cleaned documents to LLM
const response = await llm.chat(cleanDocs);

Common Pitfalls

We've seen teams make these mistakes when implementing PII detection:

Only scanning outputs — If PII enters your system in a user prompt, it's already in your conversation logs and potentially your vector database. Scan inputs too.

Ignoring international formats — A US-only regex won't catch Australian Medicare numbers or UK NHS numbers. If you have international users, detect international PII.

Scanning in the wrong layer — Don't scan in the frontend (JavaScript). Users can bypass client-side checks. Always scan server-side before logging or storage.

Not handling false positives — The number 123-45-6789 looks like an SSN but might be a product code. Use confidence thresholds and contextual validation.

Compliance Requirements

Different regulations have different requirements for PII handling:

GDPR (EU) — Requires data minimisation, right to erasure, and data processing records. Tork's redaction + receipts satisfy all three.

HIPAA (US) — PHI must be de-identified using Safe Harbor or Expert Determination methods. Tork's 18 HIPAA identifier types cover Safe Harbor requirements.

Privacy Act (AU) — Australian Privacy Principles require reasonable steps to protect personal information. Tork detects TFN, Medicare, and other AU-specific PII.

SOC 2 — Requires evidence of data protection controls. Tork's cryptographic receipts provide tamper-evident audit trails.

Try It Yourself

Test PII detection in your browser with our interactive demo — paste any text and see detections in real time. Or get started with the SDK:

npm install tork-governance
pip install tork-governance
go get github.com/torkjacobs/tork-go-sdk
cargo add tork-governance

Free tier includes 10,000 scans/month. Sign up here — no credit card required.

Tork Network Pty Ltd — Sydney, Australia