PII Detection for LLMs: A Technical Guide
February 20, 2026 · 9 min read · Yusuf Jacobs
Every LLM-powered agent is one prompt away from leaking PII. User messages contain names, emails, phone numbers, and social security numbers. Model responses can hallucinate real PII from training data. Here's how to build a detection pipeline that catches it all.
The PII Problem in LLM Pipelines
PII leakage in LLM applications happens at three points:
Input — Users submit personal information in their prompts. A support chat might include “My SSN is 123-45-6789, can you check my account?”
Context — RAG systems retrieve documents containing PII from vector databases. An employee handbook might include salary data.
Output — The model includes PII in its response, either echoed from input, retrieved from context, or hallucinated from training data.
A comprehensive PII detection system must scan all three points. Scanning only outputs misses PII that gets stored in conversation logs. Scanning only inputs misses model hallucination.
Approaches to PII Detection
There are three main approaches, each with trade-offs:
1. Regex-Based Detection
Pattern matching using regular expressions. Fast, deterministic, and zero-dependency.
/\b\d{3}-\d{2}-\d{4}\b/
// Australian Medicare Number
/\b\d{4}[\s-]?\d{5}[\s-]?\d{1}\b/
// UK NHS Number
/\b\d{3}[\s-]?\d{3}[\s-]?\d{4}\b/
// Credit Card (Luhn-validated)
/\b(?:4\d{3}|5[1-5]\d{2}|3[47]\d{2}|6(?:011|5\d{2}))[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b/
Pros: Sub-millisecond latency, no external dependencies, works offline, deterministic.
Cons: Can't detect contextual PII (names, addresses), struggles with varied formats.
2. NER-Based Detection
Named Entity Recognition models (spaCy, Presidio, GLiNER) identify PII through machine learning:
from presidio_analyzer import AnalyzerEngine
analyzer = AnalyzerEngine()
results = analyzer.analyze(text=user_input, language='en')
# Returns: [type=PERSON, start=11, end=21, score=0.85]
Pros: Catches names, addresses, and contextual PII. Higher recall for free-form text.
Cons: Slower (10-50ms per scan), requires model download, non-deterministic, can miss structured formats.
3. Hybrid Detection (What Tork Uses)
Tork combines both approaches. Regex patterns run first for structured PII (SSNs, credit cards, phone numbers), then a lightweight NER pass catches contextual PII (names, addresses). The result is high precision and high recall:
mode: 'hybrid', // regex + NER
confidence: 0.8, // minimum confidence threshold
regions: ['US', 'AU', 'EU', 'UK'],
});
The 50+ PII Types Tork Detects
Tork's detection engine covers PII across 13 countries. Here's the breakdown by category:
Redaction Strategies
Detecting PII is half the battle. The other half is what you do with it. Tork supports four redaction modes:
"My SSN is [SSN]"
// 2. Masked redaction
"My SSN is ***-**-6789"
// 3. Synthetic replacement
"My SSN is 555-00-1234" // Valid format, fake data
// 4. Hash-based pseudonymisation
"My SSN is PII_abc123def456" // Reversible with key
Synthetic replacement is particularly useful for testing and development — your agents behave identically because the data format is preserved, but no real PII exists in your dev environment.
Performance Considerations
PII detection adds latency to your agent pipeline. Here's what to expect with Tork's SDK:
| Mode | Latency (1KB) | Latency (10KB) | Coverage |
|---|---|---|---|
| Regex only | < 1ms | 2-5ms | Structured PII |
| NER only | 8-15ms | 25-50ms | Contextual PII |
| Hybrid | 8-16ms | 27-55ms | All PII types |
| Regex (Rust SDK) | < 0.1ms | < 1ms | Structured PII |
For comparison, a typical LLM API call takes 500-3000ms. PII detection adds negligible overhead relative to model inference time.
Implementation Architecture
The recommended architecture scans at both the input and output boundaries of your LLM pipeline:
↓
[Tork PII Scan — Input] → Redact before sending to LLM
↓
LLM / Agent Processing
↓
[Tork PII Scan — Output] → Redact before returning to user
↓
Clean Response
For RAG pipelines, add a third scan point at document retrieval:
const docs = await vectorStore.similaritySearch(query);
const cleanDocs = await Promise.all(
docs.map(doc => tork.pii.redact(doc.pageContent))
);
// Pass cleaned documents to LLM
const response = await llm.chat(cleanDocs);
Common Pitfalls
We've seen teams make these mistakes when implementing PII detection:
Only scanning outputs — If PII enters your system in a user prompt, it's already in your conversation logs and potentially your vector database. Scan inputs too.
Ignoring international formats — A US-only regex won't catch Australian Medicare numbers or UK NHS numbers. If you have international users, detect international PII.
Scanning in the wrong layer — Don't scan in the frontend (JavaScript). Users can bypass client-side checks. Always scan server-side before logging or storage.
Not handling false positives — The number 123-45-6789 looks like an SSN but might be a product code. Use confidence thresholds and contextual validation.
Compliance Requirements
Different regulations have different requirements for PII handling:
GDPR (EU) — Requires data minimisation, right to erasure, and data processing records. Tork's redaction + receipts satisfy all three.
HIPAA (US) — PHI must be de-identified using Safe Harbor or Expert Determination methods. Tork's 18 HIPAA identifier types cover Safe Harbor requirements.
Privacy Act (AU) — Australian Privacy Principles require reasonable steps to protect personal information. Tork detects TFN, Medicare, and other AU-specific PII.
SOC 2 — Requires evidence of data protection controls. Tork's cryptographic receipts provide tamper-evident audit trails.
Try It Yourself
Test PII detection in your browser with our interactive demo — paste any text and see detections in real time. Or get started with the SDK:
pip install tork-governance
go get github.com/torkjacobs/tork-go-sdk
cargo add tork-governance
Free tier includes 10,000 scans/month. Sign up here — no credit card required.
Tork Network Pty Ltd — Sydney, Australia