Project

pca

0.02
No commit activity in last 3 years
No release in over 3 years
Principal Component Analysis (PCA). Uses GSL for calculations.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Runtime

~> 2.1
 Project Readme

Principal Component Analysis (PCA)

Principal component analysis in Ruby. Uses GSL for calculations.

PCA can be used to map data to a lower dimensional space while minimizing information loss. It's useful for data visualization, where you're limited to 2-D and 3-D plots.

For example, here's a plot of the 4-D iris flower dataset mapped to 2-D via PCA:

iris

PCA is also used to compress the features of a dataset before feeding it into a machine learning algorithm, potentially speeding up training time with a minimal loss of data detail.

Install

GSL must be installed first. On OS X it can be installed via homebrew: brew install gsl

gem install pca

Example Usage

require 'pca'

pca = PCA.new components: 1

data_2d = [ 
  [2.5, 2.4], [0.5, 0.7], [2.2, 2.9], [1.9, 2.2], [3.1, 3.0],
  [2.3, 2.7], [2.0, 1.6], [1.0, 1.1], [1.5, 1.6], [1.1, 0.9]
]

data_1d = pca.fit_transform data_2d

# Transforms 2d data into 1d:
# data_1d ~= [
#   [-0.8], [1.8], [-1.0], [-0.3], [-1.7],
#   [-0.9], [0.1], [1.1], [0.4], [1.2]
# ]

more_data_1d = pca.transform [ [3.1, 2.9] ]

# Transforms new data into previously fitted 1d space:
# more_data_1d ~= [ [-1.6] ]

reconstructed_2d = pca.inverse_transform data_1d

# Reconstructs original data (approximate, b/c data compression):
# reconstructed_2d ~= [
#   [2.4, 2.5], [0.6, 0.6], [2.5, 2.6], [2.0, 2.1], [2.9, 3.1]
#   [2.4, 2.6], [1.7, 1.8], [1.0, 1.1], [1.5, 1.6], [1.0, 1.0]
# ]

evr = pca.explained_variance_ratio

# Proportion of data variance explained by each component
# Here, the first component explains 99.85% of the data variance:
# evr ~= [0.99854]

See examples for more. Also, peruse the source code (~ 100 loc.)

Options

The following options can be passed in to PCA.new:

option default description
:components nil number of components to extract. If nil, will just rotate data onto first principal component
:scale_data false scales features before running PCA by dividing each feature by its standard deviation.

Working with Returned GSL::Matrix

PCA#transform, #fit_transform, #inverse_transform and #components return instances of GSL::Matrix.

Some useful methods to work with these are the #each_row and #each_col iterators, and the #row(i) and #col(i) accessors.

Or if you'd prefer to work with a standard Ruby Array, you can just call #to_a and get an array of row arrays.

See GSL::Matrix RDoc for more.

Plotting Results With GNUPlot

Requires GNUPlot and gnuplot gem.

require 'pca'
require 'gnuplot'

pca = PCA.new components: 2
data_2d = pca.fit_transform data

Gnuplot.open do |gp|
  Gnuplot::Plot.new(gp) do |plot|
    plot.title "Transformed Data"
    plot.terminal "png"
    plot.output "out.png"

    # Use #col accessor to get separate x and y arrays
    # #col returns a GSL::Vector, so be sure to call #to_a before passing to DataSet
    xy = [data_2d.col(0).to_a, data_2d.col(1).to_a]

    plot.data << Gnuplot::DataSet.new(xy) do |ds|
      ds.title = "Points"
    end
  end
end

Sources and Inspirations