Core Features

PII Detection v1.1

Detect and redact personally identifiable information across 50+ entity types, 13 countries, and 3 industries with a 5-layer detection pipeline averaging 6ms latency.

Overview

Tork's PII Detection engine scans text through a 5-layer pipeline before it reaches AI models or is stored in logs. v1.1 introduces three major capabilities:

  • Entity Slot Detection Engine — detects names by sentence structure (labels, titles, self-identification), not dictionary lookup. Works with African, Arabic, Indian, Chinese, and compound names that traditional approaches miss.
  • Composable Regional PII Profiles — activate only the country and industry patterns you need via the optional region and industry parameters. Zero overhead when not used.
  • Supersession logic — wider regional matches correctly override shorter L0 matches (e.g., Emirates ID supersedes a partial phone match).

Backward compatible: Existing API calls work unchanged. The region and industry parameters are optional — when omitted, only the 29 universal L0 patterns run, exactly as before.

50+ Entity Types

Emails, phones, SSNs, credit cards, IBAN, IPs, DOB, URLs, addresses, passwords, names, and more

13 Countries + 3 Industries

AU, US, GB, EU, AE, SA, NG, IN, JP, CN, KR, BR plus healthcare, finance, and legal

~6ms Cached

Sub-10ms with Redis cache hit. ~700ms cold start. Regional patterns add <3ms

Detection Pipeline

PII detection runs through 5 layers in sequence. Each layer only fires on text regions not already matched by an earlier layer, preventing double-counting. All positions are computed against the original text, and a single-pass redaction is applied at the end.

InputL0 RegexL0.5 RegionalL1 SlotsL1.5 ContextL3 DictionaryOutput
L0

Regex Patterns

29 universal patterns that are always active: email, phone (US + international), SSN, credit card (Visa/MC/Amex), IBAN, SWIFT/BIC, IP address (v4 + v6), date of birth, URL, crypto addresses (BTC + ETH), UK NINO, US EIN, French SSN, DEA number, medical record numbers, and more. Quick-check signal gating skips entire pattern families when their signal is absent (no '@' in text = skip email regex, no digits = skip SSN/phone/credit card).

L0.5

Regional Patterns

31 country-specific patterns activated by the optional region parameter. Emirates ID (784-XXXX-XXXXXXX-X), Aadhaar (12 digits with context keywords), PAN, Chinese Resident ID (18 digits with date validation), Korean RRN (with gender-digit validation), Nigerian NIN/BVN, Brazilian CPF/CNPJ, and more. Only fires when region[] is specified — zero overhead otherwise. Supersession logic ensures wider regional matches correctly replace shorter L0 matches that overlap.

L1

Slot Detection

50+ structural patterns that detect WHERE PII appears by analyzing sentence structure, not WHAT the token is. This is the key to cross-cultural name detection. Categories: identity labels ("name:", "patient:", "client:"), self-identification ("my name is", "I am"), title prefixes ("Dr.", "Sheikh", "Alhaji", "Mrs."), possessive identifiers ("my address is", "my email is"), lives-at patterns ("lives at", "resides in"), structured data (JSON fields, XML tags), conjunction chains ("John and Sarah"), and referral patterns ("referred by", "assigned to").

L1.5

Context Patterns

Password, secret, API key, access token, and address patterns triggered by keyword proximity matching. Catches freeform credential leaks ("my password is X", "api_key: sk-...") and address blocks with street/city/state/zip structure. Only fires when keywords like "password", "secret", "address", or "account" appear nearby.

L3

Dictionary Fallback

Fallback name detection using dictionaries of 2,000+ first names and 5,000+ last names covering Western, African, Arabic, Indian, East Asian, and Latin American names. Catches standalone names that have no slot context (e.g., "John" appearing alone in text). An exclusion list of 200+ common words prevents false positives ("Will", "Grace", "Mark", etc.).

Regional PII Profiles

Activate country-specific and industry-specific PII patterns by passing the optional region and industry parameters. When omitted, only the 29 universal L0 patterns run — fully backward compatible with zero overhead.

Profiles are composable: pass multiple regions and an industry to build exactly the detection surface you need. For example, region: ["ae", "in"], industry: "healthcare" activates UAE, India, and healthcare patterns together.

Available Regions

CodeCountryPII Types Detected
AUAustraliaTFN, ABN, Medicare number, AU phone (+61), AU driver license, ACN
USUnited StatesSSN, ITIN, US passport, ZIP+4, EIN
GBUnited KingdomNINO, NHS number, UK phone (+44/0), sort code, UK passport
EUEuropean UnionGerman Tax ID (Steuer-IdNr), IBAN, French SSN, EU VAT
AEUAEEmirates ID (784-XXXX-XXXXXXX-X), UAE phone (+971), PO Box
SASaudi ArabiaNational ID / Iqama (context-gated), SA phone (+966)
NGNigeriaNIN (context-gated), BVN (context-gated), Nigerian phone (+234)
INIndiaAadhaar (context-gated), PAN, IFSC, Indian phone (+91), pincode
JPJapanMy Number (context-gated), Japanese phone (+81)
CNChinaResident ID (18-digit with date validation), Chinese phone (+86)
KRSouth KoreaRRN (gender-digit validated), Korean phone (+82)
BRBrazilCPF, CNPJ

Available Industries

CodeIndustryPII Types Detected
healthcareHealthcare (HIPAA)MRN, ICD-10 diagnosis codes, CPT procedure codes, NDC drug codes, health plan IDs (BCBS, UHC, Aetna, Cigna, Kaiser)
financeFinance (PCI-DSS)CUSIP, ISIN, bank account numbers (context-gated)
legalLegalCourt case / docket numbers, attorney bar numbers, court file numbers

Composability

Combine multiple regions and an industry in a single request. Each profile adds its patterns to the detection surface without duplicating universal L0 patterns.

python
# UAE + India + Healthcare: detects Emirates ID, Aadhaar, PAN,
# ICD-10 codes, MRN, and all 29 universal patterns
result = tork.govern(
    content="Emirates ID: 784-1234-1234567-1, PAN: ABCDE1234F, ICD-10: J45.0",
    region=["ae", "in"],
    industry="healthcare"
)

# Nigeria + Finance: detects NIN, BVN, CUSIP, ISIN,
# plus all universal patterns
result = tork.govern(
    content="NIN: 12345678901, CUSIP: 037833100",
    region=["ng"],
    industry="finance"
)

Context-gated patterns: Generic number formats (11-digit NIN, 12-digit Aadhaar, 10-digit SA ID) require nearby keywords like “NIN”, “Aadhaar”, or “National ID” within 60 characters to fire, preventing false positives on random digit sequences.

Cross-Cultural Name Detection

Traditional dictionary-based name detection fails for non-Western names. No dictionary can contain every Nigerian, Arabic, Indian, or Chinese name — and names from these cultures often have compound forms, honorifics, and structures that don't match Western first/last patterns.

Tork solves this with the Entity Slot Detection Engine (L1), which detects names by analyzing sentence structure rather than relying on dictionary lookup. It identifies the “slot” where a name appears and extracts whatever occupies that slot.

Before & After

Dictionary only (L3)

"Patient: Chukwuemeka Okonkwo"

MISSED — name not in dictionary

Slot Detection (L1)

"Patient: [NAME_REDACTED]"

CAUGHT — via “Patient:” label pattern

This approach works with names from any culture without requiring the name to exist in a dictionary:

  • African names: Chukwuemeka Okonkwo, Oluwaseun Adeyemi, Ngozi Okafor
  • Arabic names: Abd al-Rahman al-Farsi, Sheikh Mohammed bin Rashid
  • Indian names: Venkatanarasimharajuvaripeta, Subramaniam Chandrasekhar
  • Chinese names: Zhang Wei, Li Xiaoming
  • Compound names: Mary-Jane Watson, Jean-Pierre Dupont

Slot Pattern Categories

CategoryPatternExample
IDENTITY_LABELLabel followed by name"Patient: Abd al-Rahman", "Client: Kim Soo-jin"
SELF_IDENTIFICATION"my name is" / "I am""my name is Chukwuemeka Okonkwo"
TITLE_PREFIXHonorific before name"Dr. Venkatanarasimharajuvaripeta", "Sheikh Mohammed"
POSSESSIVE_IDENTIFIER"my X is" context"my address is 50 Elizabeth St", "my email is"
LIVES_ATResidence indicators"lives at 123 Main St", "resides in Sydney"
REFERRALAssignment/referral"assigned to Oluwaseun Adeyemi", "referred by"
STRUCTURED_DATAJSON/XML fields{"full_name": "Zhang Wei"}, <name>Li Ming</name>
CONJUNCTION_CHAINNames joined by "and""John and Sarah attended", "Dr. Smith and Prof. Lee"

Slot + Dictionary: The L3 dictionary layer acts as a fallback, catching standalone common names (e.g., “John” appearing alone without structural context) that slots can't detect. Together, the two layers provide comprehensive name coverage across all cultures.

Code Examples

Integrate PII detection into your application with a single API call.

import tork

# Basic (universal patterns only)
result = tork.govern(content="My SSN is 123-45-6789")

# Regional profiles
result = tork.govern(
    content="Emirates ID: 784-1234-1234567-1, Phone: +971 50 123 4567",
    region=["ae"]
)

# Multi-region + industry
result = tork.govern(
    content="Aadhaar: 1234 5678 9012, ICD-10: J45.20",
    region=["in"],
    industry="healthcare"
)

# Access results
print(result["action"])        # "redact"
print(result["output"])        # redacted text
print(result["pii_detected"])  # [{"type": "aadhaar", "count": 1}, ...]

Performance

PII detection is optimized for production workloads with multiple performance layers.

~6ms

Cache hit latency

Average with Redis cached validation

~700ms

Cache miss latency

First request (cold key validation)

717 req/s

Throughput

Sustained at 20ms latency target

ComponentLatencyNotes
L0 Regex<1msQuick-check gating skips irrelevant patterns
L0.5 Regional<3msOnly fires when region[] is specified
L1 Slot Detection~2msStructural analysis of sentence context
L1.5 Context<1msKeyword proximity matching
L3 Dictionary<1msHash-set lookup against name lists
Redis cacheTTL 60sAPI key validation cached for 60 seconds

Scanning for PII

Scan text to detect PII entities without redacting:

python
from tork_governance import TorkClient

client = TorkClient()

# Scan text for PII
result = client.pii.scan(
    text="Contact John Smith at john@example.com or 555-123-4567",
    entity_types=["EMAIL", "PHONE", "PERSON"]
)

for entity in result.entities:
    print(f"Found {entity.type}: {entity.text} (confidence: {entity.confidence})")
# Output:
# Found PERSON: John Smith (confidence: 0.95)
# Found EMAIL: john@example.com (confidence: 0.99)
# Found PHONE: 555-123-4567 (confidence: 0.98)

Redacting PII

Automatically redact detected PII with customizable replacement patterns:

python
# Redact PII from text
result = client.pii.redact(
    text="My SSN is 123-45-6789 and my email is jane@company.com",
    redaction_style="placeholder"  # or "mask", "hash", "synthetic"
)

print(result.redacted_text)
# Output: "My SSN is [SSN] and my email is [EMAIL]"

# Use masking style
result = client.pii.redact(
    text="Card: 4111-1111-1111-1111",
    redaction_style="mask"
)
print(result.redacted_text)
# Output: "Card: ****-****-****-1111"

Supported Entity Types

CategoryEntity Types
PersonalNAME, EMAIL, PHONE, ADDRESS, DATE_OF_BIRTH
FinancialCREDIT_CARD, BANK_ACCOUNT, SSN, TAX_ID, IBAN, SWIFT_BIC, CRYPTO, CUSIP, ISIN
MedicalMEDICAL_RECORD, HEALTH_PLAN, ICD10, CPT, NDC, NPI, DEA
IdentityPASSPORT, DRIVER_LICENSE, NATIONAL_ID, NINO, EIN, AADHAAR, PAN, EMIRATES_ID, NIN, CPF
NetworkIP_ADDRESS, IPV6_ADDRESS, URL
CredentialsPASSWORD, SECRET, API_KEY, ACCESS_TOKEN

Country-Specific Patterns

Enable country-specific PII patterns for localized detection via the region parameter:

python
# Enable UK-specific patterns
result = tork.govern(
    content="My NI number is AB 12 34 56 C and sort code is 12-34-56 for my bank account",
    region=["gb"]
)

# Enable multiple regions
result = tork.govern(
    content="Emirates ID: 784-1234-1234567-1, Aadhaar: 1234 5678 9012",
    region=["ae", "in"]
)

# Supported regions: AU, US, GB, EU, AE, SA, NG, IN, JP, CN, KR, BR

Tip: Use the Admin Dashboard to view PII detection analytics and configure custom patterns.

Documentation

Learn to integrate TORK

Upgrade Plan

Current: free

Support

Get help from our team