The project is in a healthy, maintained state
Detect encoding from BOM and heuristics with confidence scores, convert between encodings, normalize to UTF-8, analyze byte distributions, and handle Windows codepages. Zero dependencies.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
 Dependencies
 Project Readme

philiprehberger-encoding_kit

Tests Gem Version Last updated

Character encoding detection, conversion, and normalization

Requirements

  • Ruby >= 3.1

Installation

Add to your Gemfile:

gem "philiprehberger-encoding_kit"

Or install directly:

gem install philiprehberger-encoding_kit

Usage

require "philiprehberger/encoding_kit"

result = Philiprehberger::EncodingKit.detect(raw_bytes)
result.encoding   # => Encoding::UTF_8
result.confidence # => 0.9
utf8 = Philiprehberger::EncodingKit.to_utf8(raw_bytes)

Encoding Detection with Confidence

require "philiprehberger/encoding_kit"

# Returns a DetectionResult that delegates to Encoding
result = Philiprehberger::EncodingKit.detect("\xEF\xBB\xBFhello".b)
result == Encoding::UTF_8  # => true (backward compatible)
result.confidence          # => 1.0 (BOM detected)
result.name                # => "UTF-8"
result.to_h                # => {encoding: Encoding::UTF_8, confidence: 1.0}

# Heuristic detection returns lower confidence
result = Philiprehberger::EncodingKit.detect("caf\xC3\xA9".b)
result.confidence # => 0.85-0.9

Streaming Detection

require "philiprehberger/encoding_kit"

File.open("data.csv", "rb") do |file|
  result = Philiprehberger::EncodingKit.detect_stream(file, sample_size: 8192)
  result.encoding   # => Encoding::UTF_8
  result.confidence # => 0.9
end

Encoding Analysis

require "philiprehberger/encoding_kit"

analysis = Philiprehberger::EncodingKit.analyze(raw_bytes)
analysis[:encoding]       # => Encoding::UTF_8
analysis[:confidence]     # => 0.9
analysis[:printable_ratio] # => 0.95
analysis[:ascii_ratio]    # => 0.8
analysis[:high_bytes]     # => 12
analysis[:candidates]     # => [{encoding: Encoding::UTF_8, confidence: 0.9}, ...]

Transcode

require "philiprehberger/encoding_kit"

# Auto-detect source, convert to UTF-8
utf8 = Philiprehberger::EncodingKit.transcode(raw_bytes)

# Convert to a specific encoding
latin1 = Philiprehberger::EncodingKit.transcode(utf8_string, to: Encoding::ISO_8859_1)

# Custom fallback behavior
result = Philiprehberger::EncodingKit.transcode(data, to: "UTF-8", fallback: :replace, replace: "?")

Convert to UTF-8

require "philiprehberger/encoding_kit"

# Auto-detect source encoding
utf8 = Philiprehberger::EncodingKit.to_utf8(raw_bytes)

# Specify source encoding
utf8 = Philiprehberger::EncodingKit.to_utf8(latin1_string, from: Encoding::ISO_8859_1)

Normalize

require "philiprehberger/encoding_kit"

# Replace invalid/undefined bytes with U+FFFD
clean = Philiprehberger::EncodingKit.normalize("hello \xFF world".b)

Convert Between Encodings

require "philiprehberger/encoding_kit"

latin1 = Philiprehberger::EncodingKit.convert(utf8_string, from: Encoding::UTF_8, to: Encoding::ISO_8859_1)

BOM Handling

require "philiprehberger/encoding_kit"

Philiprehberger::EncodingKit.bom?("\xEF\xBB\xBFhello")       # => true
Philiprehberger::EncodingKit.strip_bom("\xEF\xBB\xBFhello")  # => "hello"

Validity Check

require "philiprehberger/encoding_kit"

Philiprehberger::EncodingKit.valid?("hello")                                # => true
Philiprehberger::EncodingKit.valid?("\xFF\xFE".force_encoding("UTF-8"))     # => false
Philiprehberger::EncodingKit.valid?("hello", encoding: Encoding::US_ASCII)  # => true

API

Method Description
EncodingKit.detect(string) Detect encoding via BOM and heuristics, returns a DetectionResult with .encoding and .confidence
EncodingKit.detect_stream(io, sample_size: 4096) Detect encoding from an IO stream by sampling bytes
EncodingKit.analyze(string) Analyze byte distribution and return encoding candidates with stats
EncodingKit.transcode(string, to:, fallback:, replace:) Auto-detect source and convert to target encoding
EncodingKit.to_utf8(string, from: nil) Convert to UTF-8, auto-detect source if from is nil
EncodingKit.normalize(string) Force to valid UTF-8, replacing bad bytes with U+FFFD
EncodingKit.valid?(string, encoding: nil) Check if string is valid in given or current encoding
EncodingKit.convert(string, from:, to:) Convert between arbitrary encodings
EncodingKit.strip_bom(string) Remove byte order mark if present
EncodingKit.bom?(string) Check if string starts with a BOM

Development

bundle install
bundle exec rspec
bundle exec rubocop

Support

If you find this project useful:

Star the repo

🐛 Report issues

💡 Suggest features

❤️ Sponsor development

🌐 All Open Source Projects

💻 GitHub Profile

🔗 LinkedIn Profile

License

MIT