Core Features
PII Detection v1.1
Detect and redact personally identifiable information across 50+ entity types, 13 countries, and 3 industries with a 5-layer detection pipeline averaging 6ms latency.
Overview
Tork's PII Detection engine scans text through a 5-layer pipeline before it reaches AI models or is stored in logs. v1.1 introduces three major capabilities:
- Entity Slot Detection Engine — detects names by sentence structure (labels, titles, self-identification), not dictionary lookup. Works with African, Arabic, Indian, Chinese, and compound names that traditional approaches miss.
- Composable Regional PII Profiles — activate only the country and industry patterns you need via the optional
regionandindustryparameters. Zero overhead when not used. - Supersession logic — wider regional matches correctly override shorter L0 matches (e.g., Emirates ID supersedes a partial phone match).
Backward compatible: Existing API calls work unchanged. The region and industry parameters are optional — when omitted, only the 29 universal L0 patterns run, exactly as before.
50+ Entity Types
Emails, phones, SSNs, credit cards, IBAN, IPs, DOB, URLs, addresses, passwords, names, and more
13 Countries + 3 Industries
AU, US, GB, EU, AE, SA, NG, IN, JP, CN, KR, BR plus healthcare, finance, and legal
~6ms Cached
Sub-10ms with Redis cache hit. ~700ms cold start. Regional patterns add <3ms
Detection Pipeline
PII detection runs through 5 layers in sequence. Each layer only fires on text regions not already matched by an earlier layer, preventing double-counting. All positions are computed against the original text, and a single-pass redaction is applied at the end.
Regex Patterns
29 universal patterns that are always active: email, phone (US + international), SSN, credit card (Visa/MC/Amex), IBAN, SWIFT/BIC, IP address (v4 + v6), date of birth, URL, crypto addresses (BTC + ETH), UK NINO, US EIN, French SSN, DEA number, medical record numbers, and more. Quick-check signal gating skips entire pattern families when their signal is absent (no '@' in text = skip email regex, no digits = skip SSN/phone/credit card).
Regional Patterns
31 country-specific patterns activated by the optional region parameter. Emirates ID (784-XXXX-XXXXXXX-X), Aadhaar (12 digits with context keywords), PAN, Chinese Resident ID (18 digits with date validation), Korean RRN (with gender-digit validation), Nigerian NIN/BVN, Brazilian CPF/CNPJ, and more. Only fires when region[] is specified — zero overhead otherwise. Supersession logic ensures wider regional matches correctly replace shorter L0 matches that overlap.
Slot Detection
50+ structural patterns that detect WHERE PII appears by analyzing sentence structure, not WHAT the token is. This is the key to cross-cultural name detection. Categories: identity labels ("name:", "patient:", "client:"), self-identification ("my name is", "I am"), title prefixes ("Dr.", "Sheikh", "Alhaji", "Mrs."), possessive identifiers ("my address is", "my email is"), lives-at patterns ("lives at", "resides in"), structured data (JSON fields, XML tags), conjunction chains ("John and Sarah"), and referral patterns ("referred by", "assigned to").
Context Patterns
Password, secret, API key, access token, and address patterns triggered by keyword proximity matching. Catches freeform credential leaks ("my password is X", "api_key: sk-...") and address blocks with street/city/state/zip structure. Only fires when keywords like "password", "secret", "address", or "account" appear nearby.
Dictionary Fallback
Fallback name detection using dictionaries of 2,000+ first names and 5,000+ last names covering Western, African, Arabic, Indian, East Asian, and Latin American names. Catches standalone names that have no slot context (e.g., "John" appearing alone in text). An exclusion list of 200+ common words prevents false positives ("Will", "Grace", "Mark", etc.).
Regional PII Profiles
Activate country-specific and industry-specific PII patterns by passing the optional region and industry parameters. When omitted, only the 29 universal L0 patterns run — fully backward compatible with zero overhead.
Profiles are composable: pass multiple regions and an industry to build exactly the detection surface you need. For example, region: ["ae", "in"], industry: "healthcare" activates UAE, India, and healthcare patterns together.
Available Regions
| Code | Country | PII Types Detected |
|---|---|---|
| AU | Australia | TFN, ABN, Medicare number, AU phone (+61), AU driver license, ACN |
| US | United States | SSN, ITIN, US passport, ZIP+4, EIN |
| GB | United Kingdom | NINO, NHS number, UK phone (+44/0), sort code, UK passport |
| EU | European Union | German Tax ID (Steuer-IdNr), IBAN, French SSN, EU VAT |
| AE | UAE | Emirates ID (784-XXXX-XXXXXXX-X), UAE phone (+971), PO Box |
| SA | Saudi Arabia | National ID / Iqama (context-gated), SA phone (+966) |
| NG | Nigeria | NIN (context-gated), BVN (context-gated), Nigerian phone (+234) |
| IN | India | Aadhaar (context-gated), PAN, IFSC, Indian phone (+91), pincode |
| JP | Japan | My Number (context-gated), Japanese phone (+81) |
| CN | China | Resident ID (18-digit with date validation), Chinese phone (+86) |
| KR | South Korea | RRN (gender-digit validated), Korean phone (+82) |
| BR | Brazil | CPF, CNPJ |
Available Industries
| Code | Industry | PII Types Detected |
|---|---|---|
| healthcare | Healthcare (HIPAA) | MRN, ICD-10 diagnosis codes, CPT procedure codes, NDC drug codes, health plan IDs (BCBS, UHC, Aetna, Cigna, Kaiser) |
| finance | Finance (PCI-DSS) | CUSIP, ISIN, bank account numbers (context-gated) |
| legal | Legal | Court case / docket numbers, attorney bar numbers, court file numbers |
Composability
Combine multiple regions and an industry in a single request. Each profile adds its patterns to the detection surface without duplicating universal L0 patterns.
Context-gated patterns: Generic number formats (11-digit NIN, 12-digit Aadhaar, 10-digit SA ID) require nearby keywords like “NIN”, “Aadhaar”, or “National ID” within 60 characters to fire, preventing false positives on random digit sequences.
Cross-Cultural Name Detection
Traditional dictionary-based name detection fails for non-Western names. No dictionary can contain every Nigerian, Arabic, Indian, or Chinese name — and names from these cultures often have compound forms, honorifics, and structures that don't match Western first/last patterns.
Tork solves this with the Entity Slot Detection Engine (L1), which detects names by analyzing sentence structure rather than relying on dictionary lookup. It identifies the “slot” where a name appears and extracts whatever occupies that slot.
Before & After
Dictionary only (L3)
"Patient: Chukwuemeka Okonkwo"
MISSED — name not in dictionary
Slot Detection (L1)
"Patient: [NAME_REDACTED]"
CAUGHT — via “Patient:” label pattern
This approach works with names from any culture without requiring the name to exist in a dictionary:
- African names: Chukwuemeka Okonkwo, Oluwaseun Adeyemi, Ngozi Okafor
- Arabic names: Abd al-Rahman al-Farsi, Sheikh Mohammed bin Rashid
- Indian names: Venkatanarasimharajuvaripeta, Subramaniam Chandrasekhar
- Chinese names: Zhang Wei, Li Xiaoming
- Compound names: Mary-Jane Watson, Jean-Pierre Dupont
Slot Pattern Categories
| Category | Pattern | Example |
|---|---|---|
| IDENTITY_LABEL | Label followed by name | "Patient: Abd al-Rahman", "Client: Kim Soo-jin" |
| SELF_IDENTIFICATION | "my name is" / "I am" | "my name is Chukwuemeka Okonkwo" |
| TITLE_PREFIX | Honorific before name | "Dr. Venkatanarasimharajuvaripeta", "Sheikh Mohammed" |
| POSSESSIVE_IDENTIFIER | "my X is" context | "my address is 50 Elizabeth St", "my email is" |
| LIVES_AT | Residence indicators | "lives at 123 Main St", "resides in Sydney" |
| REFERRAL | Assignment/referral | "assigned to Oluwaseun Adeyemi", "referred by" |
| STRUCTURED_DATA | JSON/XML fields | {"full_name": "Zhang Wei"}, <name>Li Ming</name> |
| CONJUNCTION_CHAIN | Names joined by "and" | "John and Sarah attended", "Dr. Smith and Prof. Lee" |
Slot + Dictionary: The L3 dictionary layer acts as a fallback, catching standalone common names (e.g., “John” appearing alone without structural context) that slots can't detect. Together, the two layers provide comprehensive name coverage across all cultures.
Code Examples
Integrate PII detection into your application with a single API call.
Performance
PII detection is optimized for production workloads with multiple performance layers.
~6ms
Cache hit latency
Average with Redis cached validation
~700ms
Cache miss latency
First request (cold key validation)
717 req/s
Throughput
Sustained at 20ms latency target
| Component | Latency | Notes |
|---|---|---|
| L0 Regex | <1ms | Quick-check gating skips irrelevant patterns |
| L0.5 Regional | <3ms | Only fires when region[] is specified |
| L1 Slot Detection | ~2ms | Structural analysis of sentence context |
| L1.5 Context | <1ms | Keyword proximity matching |
| L3 Dictionary | <1ms | Hash-set lookup against name lists |
| Redis cache | TTL 60s | API key validation cached for 60 seconds |
Scanning for PII
Scan text to detect PII entities without redacting:
Redacting PII
Automatically redact detected PII with customizable replacement patterns:
Supported Entity Types
| Category | Entity Types |
|---|---|
| Personal | NAME, EMAIL, PHONE, ADDRESS, DATE_OF_BIRTH |
| Financial | CREDIT_CARD, BANK_ACCOUNT, SSN, TAX_ID, IBAN, SWIFT_BIC, CRYPTO, CUSIP, ISIN |
| Medical | MEDICAL_RECORD, HEALTH_PLAN, ICD10, CPT, NDC, NPI, DEA |
| Identity | PASSPORT, DRIVER_LICENSE, NATIONAL_ID, NINO, EIN, AADHAAR, PAN, EMIRATES_ID, NIN, CPF |
| Network | IP_ADDRESS, IPV6_ADDRESS, URL |
| Credentials | PASSWORD, SECRET, API_KEY, ACCESS_TOKEN |
Country-Specific Patterns
Enable country-specific PII patterns for localized detection via the region parameter:
Tip: Use the Admin Dashboard to view PII detection analytics and configure custom patterns.