No release in over 3 years
Low commit activity in last 3 years
There's a lot of open issues
Character encoding detection, brought to you by ICU
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

>= 2.0.0
 Project Readme

CharlockHolmes

Character encoding detecting library for Ruby using ICU

Usage

First you'll need to require it

require 'charlock_holmes'

Encoding detection

contents = File.read('test.xml')
detection = CharlockHolmes::EncodingDetector.detect(contents)
# => {:encoding => 'UTF-8', :confidence => 100, :type => :text}

# optionally there will be a :language key as well, but
# that's mostly only returned for legacy encodings like ISO-8859-1

NOTE: CharlockHolmes::EncodingDetector.detect will return nil if it was unable to find an encoding.

For binary content, :type will be set to :binary

Though it's more efficient to reuse once detector instance:

detector = CharlockHolmes::EncodingDetector.new

detection1 = detector.detect(File.read('test.xml'))
detection2 = detector.detect(File.read('test2.json'))

# and so on...

String monkey patch

Alternatively, you can just use the detect_encoding method on the String class

require 'charlock_holmes/string'

contents = File.read('test.xml')

detection = contents.detect_encoding

Ruby 1.9 specific

NOTE: This method only exists on Ruby 1.9+

If you want to use this library to detect and set the encoding flag on strings, you can use the detect_encoding! method on the String class

require 'charlock_holmes/string'

contents = File.read('test.xml')

# this will detect and set the encoding of `contents`, then return self
contents.detect_encoding!

Transcoding

Being able to detect the encoding of some arbitrary content is nice, but what you probably want is to be able to transcode that content into an encoding your application is using.

content = File.read('test2.txt')
detection = CharlockHolmes::EncodingDetector.detect(content)
utf8_encoded_content = CharlockHolmes::Converter.convert content, detection[:encoding], 'UTF-8'

The first parameter is the content to transcode, the second is the source encoding (the encoding the content is assumed to be in), and the third parameter is the destination encoding.

Installing

If the traditional gem install charlock_holmes doesn't work, you may need to specify the path to your installation of ICU using the --with-icu-dir option during the gem install or by configuring Bundler to pass those arguments to Gem:

Configure Bundler to always use the correct arguments when installing:

bundle config build.charlock_holmes --with-icu-dir=/path/to/installed/icu4c

Using Gem to install directly without Bundler:

gem install charlock_holmes -- --with-icu-dir=/path/to/installed/icu4c

If you get a compile time error that looks like error: delegating constructors are permitted only in C++11 or something else related to C++11, you need to set the --with-cxxflags=-std=c++11 options

Bundler:

bundle config build.charlock_holmes --with-icu-dir=/path/to/installed/icu4c --with-cxxflags=-std=c++11

Installing directly:

gem install charlock_holmes -- --with-icu-dir=/path/to/installed/icu4c --with-cxxflags=-std=c++11

Homebrew

If you're installing on Mac OS X then using Homebrew is the easiest way to install ICU.

However, be warned; it is a Keg-Only (see homedir issue #167 for more info) install meaning RubyGems won't find it when installing without specifying --with-icu-dir

To install ICU with Homebrew:

brew install icu4c

Configure Bundler to always use the correct arguments when installing:

bundle config build.charlock_holmes --with-icu-dir=/usr/local/opt/icu4c

Using Gem to install directly without Bundler:

gem install charlock_holmes -- --with-icu-dir=/usr/local/opt/icu4c