0.0
No release in over a year
To use this gem is required the file`vectors.bin` where is stored the output of the Google algorithm called `word2vec`. This gem doesn't produce this file. Once produced, this can can load it and use it to calculate some arithmetic operations like distance between words or to calculate the relations between them.'
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

~> 2.1.0
~> 12.0
~> 3.0
 Project Readme

word2vec-rb

Gem using word2vec functionality from https://code.google.com/archive/p/word2vec/

This gem was developed using the .c files of the Google word2vec as base. Mostly by applying copy-and-paste.

Installation

Add this line to your application's Gemfile:

gem 'word2vec-rb'

And then execute:

$ bundle install

Or install it yourself as:

$ gem install word2vec-rb

Usage

Distance arithmetic: to find the nearest words, try:

require 'word2vec'

model = Word2vec::Model.load("./data/minimal.bin")
words = model.distance("from")
words.each do |w| 
  puts "#{w.first} #{w.last}"
end

Analogy arithmetic: to find the analogy with three words, try:

require 'word2vec'

model = Word2vec::Model.load("./data/minimal.bin")
words = model.analogy("spain", "madrid", "france")
# In a well prepared vectors file (high quality), first word would be "Paris"
words.each do |w| 
  puts "#{w.first} #{w.last}"
end

Accuray: test accuracy of the vectors:

Define a file with the analogies to test, format: : section heading Word1 Word2 Word3 Word4

Sample:

: capital-common-countries
Athens Greece Baghdad Iraq
Athens Greece Bangkok Thailand
require 'word2vec'

model = Word2vec::Model.load(file_name)
model.accuracy("./data/questions-words.txt")

# Outputs the results on terminal

Vocabulary: create a vocabulary file from a train file:

require 'word2vec'

Word2vec::Model.build\_vocab("./data/text7", "./data/vocab.txt")

The output file will have a list of words and its number of appearances separated by line break.

Tokenizer: create a binary file by tokenizing an input file

This method requires a vocabulary file precreated.

require 'word2vec'

Word2vec::Model.tokenize("./data/text7", "./data/vocab.txt", "./data/tokenized.bin")

The output file will contain a sequence of binary identificators of each word of the input file.

Read output file with:

long long id;
fread(&id, sizeof(id), 1, fi);

Load the word2vec output bin file (vectors.bin), into ruby array

require 'word2vec'

vector_array = Word2vec::load_vectors("./data/minimal.bin")

The vector_array variable will contain an array of pairs with the vocab and the vector the float values of each word.

Set parameter normalize: true to normalize the vectors.

require 'word2vec'

vector_array = Word2vec::Model.load_vectors("./data/minimal.bin", normalize: true)

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and tags, and push the .gem file to rubygems.org.

Build extension

$ rake build

Launch tests

$ rake spec

Build extension

$ rake compile

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/madcato/word2vec-rb