0.05
No commit activity in last 3 years
No release in over 3 years
There's a lot of open issues
k-means clustering. Uses NArray for fast calculations.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
 Dependencies

Runtime

~> 0.6
 Project Readme

KMeansClusterer

k-means clustering in Ruby. Uses NArray under the hood for fast calculations.

Jump to the examples directory to see this in action.

Features

  • Runs multiple clustering attempts to find optimal solution (single runs are susceptible to falling into non-optimal local minima)
  • Initializes centroids via k-means++ algorithm, for faster convergence
  • Calculates silhouette score for evaluation
  • Option to scale data before clustering, so that output isn't biased by different feature scales
  • Works with high-dimensional data

Install

gem install kmeans-clusterer

Usage

Simple example:

require 'kmeans-clusterer'

data = [[40.71,-74.01],[34.05,-118.24],[39.29,-76.61],
        [45.52,-122.68],[38.9,-77.04],[36.11,-115.17]]

labels = ['New York', 'Los Angeles', 'Baltimore', 
          'Portland', 'Washington DC', 'Las Vegas']

k = 2 # find 2 clusters in data

kmeans = KMeansClusterer.run k, data, labels: labels, runs: 5

kmeans.clusters.each do |cluster|
  puts  cluster.id.to_s + '. ' + 
        cluster.points.map(&:label).join(", ") + "\t" +
        cluster.centroid.to_s
end

# Use existing clusters for prediction with new data:
predicted = kmeans.predict [[41.85,-87.65]] # Chicago
puts "\nClosest cluster to Chicago: #{predicted[0]}"

# Clustering quality score. Value between -1.0..1.0 (1.0 is best)
puts "\nSilhouette score: #{kmeans.silhouette.round(2)}"

Output of simple example:

0. New York, Baltimore, Washington DC [39.63, -75.89]
1. Los Angeles, Portland, Las Vegas [38.56, -118.7]

Closest cluster to Chicago: 0

Silhouette score: 0.91

Options

The following options can be passed in to KMeansClusterer.run:

option default description
:labels nil optional array of Ruby objects to collate with data array
:runs 10 number of times to run kmeans
:log false print stats after each run
:init :kmpp algorithm for picking initial cluster centroids. Accepts :kmpp, :random, or an array of k centroids
:scale_data false scales features before clustering using formula (data - mean) / std
:float_precision :double float precision to use. :double or :single
:max_iter 300 max iterations per run