DataRedactor
A Ruby gem with a C extension for high-performance regex-based redaction of sensitive data from strings.
What it does
DataRedactor scans text for sensitive patterns and replaces matches with [REDACTED]. It uses a C extension backed by POSIX regex.h so the heavy lifting happens outside the Ruby VM, making it fast enough for large payloads.
Usage
require "data_redactor"
text = "User CF is RSSMRA85M01H501Z and key is AKIAIOSFODNN7EXAMPLE"
DataRedactor.redact(text)
# => "User CF is [REDACTED] and key is [REDACTED]"Filtering by tag or pattern name
only: and except: both accept a single value or an Array, mixing Symbols (tag names) and Strings (specific pattern names).
DataRedactor.tags
# => [:credentials, :financial, :tax_id, :national_id, :contact, :network, :travel, :other, :custom]
DataRedactor.pattern_names
# => ["aws_s3_presigned_url", "aws_access_key_id", "email", "phone_e164", "ipv4", ...]
# Tag-level filtering
DataRedactor.redact(text, only: [:credentials])
DataRedactor.redact(text, except: :contact)
# Single specific pattern
DataRedactor.redact(text, only: ["aws_access_key_id"])
# Mix — every credentials pattern PLUS aws_access_key_id (even if it lived in another tag)
DataRedactor.redact(text, only: [:credentials, "aws_access_key_id"])
# Combine — every contact pattern EXCEPT email
DataRedactor.redact(text, only: :contact, except: ["email"])Precedence: a pattern is redacted iff (only is nil OR matches only:) AND (does not match except:). except: always wins when the two overlap, so only: :contact, except: :contact produces a no-op (everything is excluded).
Errors: an unknown tag Symbol raises DataRedactor::UnknownTagError; an unknown pattern name String raises DataRedactor::UnknownPatternError.
Configurable placeholder
By default every match is replaced with [REDACTED]. Use the placeholder: keyword to change this:
# Plain string — any replacement text
DataRedactor.redact(text, placeholder: "***")
DataRedactor.redact(text, placeholder: "")
# Tagged — embeds the pattern's tag name so you know what was redacted
DataRedactor.redact(text, placeholder: :tagged)
# "user@example.com" → "[REDACTED:CONTACT]"
# "AKIAIOSFODNN7EXAMPLE" → "[REDACTED:CREDENTIALS]"
# "DE89370400440532013000" → "[REDACTED:FINANCIAL]"
# Hash — deterministic 4-hex suffix of the matched value
# Same value always produces the same token — useful for correlating
# redactions across log lines without leaking the original.
DataRedactor.redact(text, placeholder: :hash)
# "user@example.com" → "[CONTACT_3d7a]"
# "user@example.com" → "[CONTACT_3d7a]" (same every time)
# "other@example.com" → "[CONTACT_91fc]" (different value, different hash)All three modes compose with only: and except::
DataRedactor.redact(text, only: :contact, placeholder: :tagged)Scan / dry-run mode
DataRedactor.scan returns every match alongside the redacted string — useful for auditing, tuning false positives, and compliance pipelines:
result = DataRedactor.scan("User AKIAIOSFODNN7EXAMPLE logged in from 192.168.1.1")
# => {
# redacted: "User [REDACTED] logged in from [REDACTED]",
# matches: [
# { tag: :credentials, name: "aws_access_key_id", value: "AKIAIOSFODNN7EXAMPLE", start: 5, length: 20 },
# { tag: :network, name: "ipv4", value: "192.168.1.1", start: 35, length: 11 }
# ]
# }
# :start and :length are byte offsets into the original string
m = result[:matches].first
original_text.byteslice(m[:start], m[:length]) # => "AKIAIOSFODNN7EXAMPLE"
# Accepts the same filters as redact (tags + specific pattern names)
DataRedactor.scan(text, only: :credentials)
DataRedactor.scan(text, except: :network)
DataRedactor.scan(text, only: :contact, except: ["email"])Custom patterns
Teams often have internal IDs that the gem can't ship. Register them at boot:
# String (POSIX ERE) or Regexp — both accepted
DataRedactor.add_pattern(name: "employee_id", regex: "EMP-[0-9]{6}")
DataRedactor.add_pattern(name: "ticket_ref", regex: /TICKET-[A-Z]{2}[0-9]{4}/, boundary: true)
# Custom patterns are tagged :custom by default; pass any built-in tag to group differently
DataRedactor.add_pattern(name: "internal_key", regex: "INT-[A-Z]{3}", tag: :credentials)
DataRedactor.redact(text) # runs all patterns including custom
DataRedactor.redact(text, only: [:custom]) # only user patterns
DataRedactor.redact(text, only: [:custom, :credentials]) # mix
DataRedactor.custom_patterns # => [{name:, source:, tag:, boundary:}, ...]
DataRedactor.remove_pattern("employee_id")
DataRedactor.clear_custom_patterns! # mostly for test suitesRegex rules — patterns must be POSIX ERE (the same engine used for built-ins). Not supported: \d, \s, \w, \b, lookahead/lookbehind, non-greedy quantifiers, named groups. Violations raise DataRedactor::InvalidPatternError at registration time, never at redaction time. Use [0-9] instead of \d, [[:space:]] instead of \s, etc.
boundary: true — wraps the pattern with (^|[^0-9A-Za-z])(PATTERN)([^0-9A-Za-z]|$) so it only fires when the token is not embedded in a longer alphanumeric string. Incompatible with patterns that contain capture groups.
Detected patterns (79 total)
The table below is a representative sample. Use DataRedactor.pattern_names for the canonical, machine-readable list — it stays in sync with the C extension automatically.
Cloud & API secrets
| # | Pattern | Example |
|---|---|---|
| 0 | AWS Access Key ID | AKIAIOSFODNN7EXAMPLE |
| 1 | AWS Secret Access Key | 40-character base64 string |
| 5 | Google API Key | AIzaSyXXXX... |
| 6 | GitHub Personal Access Token | github_pat_XXXX... |
| 7 | Slack Webhook URL | https://hooks.slack.com/services/T.../B.../... |
| 8 | Stripe Secret Key | sk_live_XXXX... |
| 9 | PEM Private Key header | -----BEGIN RSA PRIVATE KEY----- |
| 13 | Scaleway Access Key | SCW12345ABCDE6789FGHIJ |
| 14 | UUID v4 / Scaleway Secret Key | 550e8400-e29b-41d4-a716-446655440000 |
Travel documents
| # | Pattern | Example |
|---|---|---|
| 2 | Italian Codice Fiscale (basic) | RSSMRA85M01H501Z |
| 3 | Passport — letter prefix + digits | AB1234567 |
| 4 | Passport — 9 consecutive digits ¹ | 123456789 |
| 22 | Italian Codice Fiscale (omocodia) | RSSMRALPMNLH5LMZ |
Payment & network
| # | Pattern | Example |
|---|---|---|
| 11 | Credit card — Visa, Mastercard, Amex, Discover, JCB | 4111111111111111 |
| 12 | IPv4 address | 192.168.1.100 |
IBANs
| # | Country | Example |
|---|---|---|
| 10 | Italy | IT60X0542811101000000123456 |
| 15 | France | FR7630006000011234567890189 |
| 16 | Germany | DE89370400440532013000 |
| 17 | Spain | ES9121000418450200051332 |
| 18 | Netherlands | NL91ABNA0417164300 |
| 19 | Belgium | BE68539007547034 |
| 20 | Portugal | PT50000201231234567890154 |
| 21 | Ireland | IE29AIBK93115212345678 |
| 28 | Sweden | SE4550000000058398257466 |
| 29 | Denmark | DK5000400440116243 |
| 30 | Norway | NO9386011117947 |
| 31 | Finland | FI2112345600000785 |
| 37 | Poland | PL61109010140000071219812874 |
| 38 | Austria | AT611904300234573201 |
| 39 | Switzerland | CH9300762011623852957 |
| 40 | Czechia | CZ6508000000192000145399 |
| 41 | Hungary | HU42117730161111101800000000 |
| 42 | Romania | RO49AAAA1B31007593840000 |
National personal identifiers
| # | Country | Type | Example |
|---|---|---|---|
| 23 | France | NIR / Social Security ¹ | 185126203450342 |
| 24 | Spain | DNI ¹ | 12345678Z |
| 25 | Spain | NIE | X1234567L |
| 26 | Netherlands | BSN ¹ | 123456789 |
| 27 | Poland | PESEL ¹ | 85121612345 |
| 32 | Belgium | National Number ¹ | 85121612345 |
| 33 | Sweden | Personnummer ¹ | 850101-1234 |
| 34 | Denmark | CPR Number ¹ | 010185-1234 |
| 35 | Norway | Fødselsnummer ¹ | 01018512345 |
| 36 | Finland | HETU ¹ | 010185-123A |
| 43 | Poland | PESEL (alt slot) ¹ | 90010112345 |
| 44 | Austria | Abgabenkontonummer ¹ | 123456789 |
| 45 | Switzerland | AHV Number ¹ | 756.1234.5678.90 |
| 46 | Czechia | Rodné číslo ¹ | 856121/1234 |
| 47 | Hungary | Tax ID ¹ | 8012345678 |
| 48 | Romania | CNP ¹ | 1850101123456 |
¹ Word-boundary protected — these patterns are wrapped with
(^|[^0-9A-Za-z])(PATTERN)([^0-9A-Za-z]|$)at compile time so they do not fire when the digit sequence appears inside a longer alphanumeric token.
Directory structure
redactor/
├── data_redactor.gemspec
├── Gemfile
├── Rakefile
├── lib/
│ ├── data_redactor.rb # Ruby entry point, loads the .so
│ └── data_redactor/
│ └── version.rb
├── ext/
│ └── data_redactor/
│ ├── extconf.rb # Checks for C headers, generates Makefile (globs *.c)
│ ├── data_redactor.c # Entry point: Init_data_redactor only
│ ├── patterns.{c,h} # Built-in pattern table + compiled regex_t array
│ ├── placeholder.{c,h} # write_placeholder, djb2 hash, tag_name_for_bit
│ ├── redact.{c,h} # _redact + replace_all_matches + wrap_boundary
│ ├── scan.{c,h} # _scan + byte-offset replacement-log macros
│ ├── custom_patterns.{c,h} # Dynamic registry: add/remove/clear/list
│ └── tags.h # TAG_* bit constants
└── spec/
└── data_redactor_spec.rb # RSpec tests — at least one example per pattern, plus filter / placeholder / custom-pattern coverage
Requirements
- Ruby >= 2.7
- A C compiler (
gccorclang) - POSIX
regex.h(standard on Linux and macOS)
Setup
bundle installCompile the C extension
bundle exec rake compileThis runs extconf.rb via rake-compiler, which generates a Makefile and compiles data_redactor.c into a .so shared library placed under lib/data_redactor/.
Run the tests
bundle exec rake specOr compile and test in one step:
bundle exec rakeHow it works
- At load time,
Init_data_redactorcompiles all 79 regex patterns once usingregcomp(POSIX ERE) and stores them as staticregex_tstructs. Patterns marked as boundary-wrapped are expanded withwrap_boundary()before compilation. -
DataRedactor.redact(text)receives a RubyString, converts it to a Cchar*viaStringValueCStr, and runs each compiled pattern in sequence on a working buffer. - For each pattern,
replace_all_matchesiterates usingregexec, copies non-matching segments to a fresh output buffer, and inserts[REDACTED]in place of each match. For boundary-wrapped patterns,regexecis called withnmatch=4and sub-match groups[1]/[3]identify the boundary characters so they are preserved verbatim. - The output buffer is grown with
reallocas needed. After all patterns are applied the result is returned as a RubyStringviarb_str_new_cstr. All intermediatemalloc/strdupallocations are explicitlyfreed.
Memory management
All C-side buffers are heap-allocated with malloc/strdup and freed before the function returns. The only Ruby-managed allocation is the final return value from rb_str_new_cstr. No Ruby objects are created mid-processing, so GC cannot collect anything out from under the C code.
Thread safety
DataRedactor.redact and DataRedactor.scan are safe to call concurrently from multiple threads. Built-in patterns are compiled into a static regex_t array at load time and never mutated afterward, and each call allocates its own working buffers. POSIX regexec is documented as thread-safe.
DataRedactor.add_pattern, remove_pattern, and clear_custom_patterns! mutate a shared dynamic array and are not thread-safe. Register custom patterns once at boot — before spawning worker threads or forking — and they will be visible (read-only) to every subsequent redact/scan call.
Versioning
This project follows Semantic Versioning 2.0.0. Until 1.0.0, minor versions may introduce breaking changes; from 1.0.0 onward, breaking changes will only land in major versions. See CHANGELOG.md for the release history.
License
Released under the MIT License.
Known limitations
- Pattern ordering matters — patterns run sequentially. An early broad pattern (e.g. the 9-digit passport) may consume digits that a later pattern (e.g. credit card) depends on. Boundary wrapping mitigates this for pure-digit patterns.
- AWS Secret Key (pattern 1) — 40 consecutive base64 characters is a broad match. It can produce false positives in base64-encoded content such as embedded images or binary blobs.
- Duplicate digit patterns — several national ID formats share the same digit-length (11 digits: PESEL, Norwegian Fødselsnummer, Belgian National Number). They are kept as separate slots for clarity but the practical effect is that any 11-digit boundary-delimited number will be redacted.