De-identifying clinical text in LATAM: how LGPD changes the game

HIPAA Safe Harbor lists 18 identifiers; LGPD demands risk-based proof — here is what we built.

Take this synthetic Brazilian clinical note and run it through any de-identification pipeline built for the US market:

"Paciente Maria Silva, CPF 123.456.789-00, CNS 7000 0000 0000 0000, 64 anos, Dra. Ferreira solicitou TC abdome em 20/04/2026 no HC-USP. Histórico de FA crônica em uso de warfarina 5mg."

A HIPAA-tuned redactor catches the obvious. It masks "Maria Silva" because the name list overlaps US census data. It catches "20/04/2026" because dates are in Safe Harbor's 18-identifier list. It redacts "HC-USP" only if its hospital dictionary has been extended with LATAM facility names. Most have not.

What it almost certainly misses: the CPF, the CNS (Cartão Nacional de Saúde), and "Dra. Ferreira" (US-tuned name recognizers are optimized for "Dr." and "Mr./Ms.", not the Portuguese feminine). It also misses the RG, the CRM (medical license number), the IBGE municipality code, and the CNES facility code that uniquely identifies a Brazilian health unit. None of those are in HIPAA Safe Harbor. All are quasi-identifiers under LGPD.

Why LGPD is broader than HIPAA

HIPAA Safe Harbor is mechanical. Strip the 18 identifiers, and you have a compliant de-identified dataset. Binary standard, checklist audit.

LGPD gives you no checklist. LGPD Art. 5 §II defines dado pessoal sensível to include health data on its own — not just when joined with a name. Article 12 sets the anonymization standard: the data must lose "all reasonable links to a person," judged against the means reasonably available to anyone who might attempt re-identification. A risk-based standard, with the burden of evidence on the controller.

In practice: a CPF stripped but a CNS left in is non-compliant. A name redacted but a CRM left in is non-compliant. "ESF da unidade Vila Nova" plus an age and a rare diagnosis is non-compliant, because in a town of 8,000 that is one patient.

The Cortex layered approach

We do not trust a single redactor. We layer four passes; FAIL-CLOSED is the load-bearing piece:

1. Microsoft Presidio with LATAM-tuned recognizers as the first pass. Presidio handles names, dates, locations, and emails across languages once configured. We extend its name lists with Brazilian IBGE registers and Mexican RENAPO patterns so "Maria" is treated as a first name everywhere it appears. 2. Custom recognizers for CPF, CNS, RG, and CRM — each with checksum validation. The CPF modulo-11 check eliminates roughly 40% of the false positives a pure-regex matcher produces; an 11-digit number that fails the check is almost always a phone or an order ID. CNS has its own modulo-11 variant. CRM uses state-prefixed formats (CRM/SP, CRM/RJ) that vary in length. 3. An LLM safety-rail pass as a tripwire, never the primary redactor. LLMs leak context — they were trained on text containing real PHI and will sometimes paraphrase a redacted name back into the output. We use the LLM only to flag passages that may still contain a quasi-identifier. Output goes to a review queue, not the final text. 4. FAIL-CLOSED policy. If Presidio errors, if a custom recognizer throws, if the LLM tripwire times out — the redaction blocks. We do not ship unredacted text. The default is refusal, not delivery. The only inversion that survives an LGPD audit.

The multilingual angle

US-English tooling misses LATAM names in two directions. "Maria" overlaps US lists and usually gets caught — but "Cleuza," "Adenilson," and "Genivaldo" do not, and all three are common Brazilian first names. In Spanish-speaking LATAM, compound surnames ("García López") trip up models expecting a single last-name token. Our recognizers use census-tier dictionaries for BR (IBGE) and MX/CO (RENAPO, DANE) populations.

The audit angle

LGPD Art. 37 requires the controller to keep a record of all processing operations. Every redaction Cortex performs emits an audit-log entry: a hashed, non-reversible token mapping, the recognizer pack version, the model version (if the LLM tripwire fired), and the timestamp. If a regulator asks how a record was de-identified on April 21, we can replay it.

The honest part

Regex plus dictionaries gets you to 80%. The last 20% — context-dependent quasi-identifiers, rare CRM formats, idiosyncratic hospital narratives, the one ESF unit name that uniquely identifies a 6,000-person town — is where every vendor's "LGPD-compliant" claim fails on the first audit.

We do not pretend to solve that 20%. We FAIL-CLOSED on it. The note goes to a human reviewer, not to a downstream system. Slower, more expensive, and the only stance that lets us face a DPO with evidence in hand.