Proper related posts plugin for Jekyll - uses document correlation matrix on TF-IDF (optionally with Latent Semantic Indexing).
I am going to try to start blogging, again. Anyway I am studying at Decision Support Systems Group and I have found document correlation problem somehow interesting.
How to install
Initially you had to install the plugin manually, however the plugin is a gem now - follow instructions to install the plugin:
- Install the gem
- if you are using bundler add
gem 'jekyll-related-posts'to your
- or install gem via
gem install jekyll-related-posts.
gems: ['jekyll-related-posts']to your
<related-posts />somewhere in your
jekyll build, don't forget to blog about the plugin!
You can customize default related posts template by creating
related.html in your layouts directory. Plugin behaviour can be
altered by options in
Basis of operation
TF-IDF matrix for the whole site is calculated (including extra provided weights), then if given accuraccy is lower than 1.0, LSI algorithm is used to compute new simplified vector space. Document correlation matrix is created using dot product of the matrix and its transpose.
For each of the post' related documents are inserted into priority queue (sorted by score from document correlation matrix), assuming the score is greater than minimal required score. Selected few bests related posts are retrieven from the queue.
Liquid template for each post is rendered and
<related-posts /> is
replaced with the outcomes of algorithm.
_config.yml file (under
related:) you can configure:
max_count: 5- maximum number of related posts,
min_score: 0.1- minimal required score to treat post as related,
accuracy: 0.75- percentage of keywords used as basis for document correlation matrix (if 1.0 then no LSI is computed, otherwise LSI is computed and dimensions are reduced to
accuracy * |keywords|)
You can configure weights of words providing dictionary with them to
weights. In example weight of
2 means for term frequency algorithm
that the word occurred twice as much in the document as in reality.
For casual blogs, performance should not be an issue.
I did not benchmark the plugin, however for the dataset given in the example (containing ~900 documents, ~7000 keywords) rendering time (including Jekyll hoodoo stuff) is more less 70 seconds (on Xeon, using 750MB RAM). Computation related to this plugin is about 20 seconds long. It should be noticed that I'm using OpenBLAS and standard LAPACK distributed with Ubuntu (performance is similar on OS X using builtin Acccelerate framework).
Unfortunately the plugin is not compatible with Jekyll 3.0 new incremental builds, even though it requires at least Jekyll 3.0 (for the plugin hooks).
- Amadeusz Juskowiak - juskowiak[at]amadeusz.me