0.0
Low commit activity in last 3 years
No release in over a year
visual HTML page structure recognizer
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

Runtime

>= 0
~> 0.4.2
 Project Readme

Gem Version Test

Pagerecognizer -- the visual web page structure recognizing A.I. tool

The idea is to forget that DOM is a tree and view the page like a human would do. Then apply smart algorithms to recognize the main blocks that really form a UI. This is particularly useful in test automation because HTML/CSS internals are changing more frequently than design.

Example of splitting in rows (also check examples/google.rb for some other details):

I'll show how to use this tool on www.google.com as an exmple. The HTML page of it might already have some convenient ids or classes but let's pretend there are none. Currently the gem utilizes the Ferrum so you may already know some basic methods:

require "ferrum"
require "pagerecognizer"
Ferrum::Node.include PageRecognizer

browser = Ferrum::Browser.new
browser.goto "https://google.com/"

We've just added some methods to Ferrum::Node, let's call the private method #recognize to export what the A.I. would see to an HTML file like this:

File.write "dump.htm", browser.at_css("body").send(:recognize).dump

This is a nodes rects view that the A.I. will use later for the recognition. Let's do a web search and see what it sees now:

browser.at_css("input[type=text]").focus.type "Ruby", :enter

Now let's try the magic method #rows and see if it has recognized the search results sections of the page.

File.write "dump.htm", browser.at_css("body").rows([:AREA, :SIZE]).dump

:AREA and :SIZE are the recommended euristics for the rows and cols methods, you can find others in the source code.

The Google Search page is complex today and as you can see with the default options it did not recognize the first result and misrecognized others. The misrecognized ones either have no blue hyperlinks or no text at all. What can we do? Each recognized node has a method #texts that allows us to access the text blocks and their style. It also recognizes text color classifying it based on 16 Basic Web colors. Let's use it and add a custom euristic that would give a hint to process only such nodes that contain black and blue text:

results = browser.at_css("body").rows([:AREA, :SIZE]) do |node|
  colors = node.texts.map{ |text, style, color, | color }
  colors.any?{ |c| :black == c } &&
  colors.any?{ |c| :blue == c || :navy == c }
end
File.write "dump.htm", results.dump

Custom euristic not only helps the A.I. but also may make the recognition faster because it makes less nodes to process. It still picks wrong nodes though. Then let's select such that the biggest text in them is blue and happens only once:

... do |node|
  texts = node.texts
  next if texts.none?{ |text, style, color, | :black == color }
  _, group = texts.group_by{ |text, style, | style["fontSize"].to_i }.to_a.max_by(&:first)
  next unless group
  next unless group.size == 1 && %i{ blue navy }.include?(group[0][2])
  true
end

Perfect. Now we can reject the nodes with images because we are not interested in video results (note that we use node.node since the node is a recognized object, a structure, and node.node is the actual Ferrum object), and then parse the results:

results.reject{ |_| _.node.at_css "img" }.map do |result|
  [
    result.node.at_css("a").property("href")[0,40],
    result.texts.max_by{ |t, s, | s["fontStyle"].to_i }[0].sub(/(.{40}) .+/, "\\1..."),
  ]
end
  https://ru.wikipedia.org/wiki/Ruby         Ruby - Википедия                                   
  https://www.ruby-lang.org/ru/              Язык программирования Ruby                         
  https://evrone.ru/why-ruby                 5 причин, почему мы выбираем Ruby - evrone.ru      
  https://habr.com/ru/hub/ruby/              Ruby — Динамический высокоуровневый язык...        
  https://ru.wikibooks.org/wiki/Ruby         Ruby - Викиучебник                                 
  https://context.reverso.net/%D0%BF%D0%B5   ruby - Перевод на русский - примеры английский...  
  https://web-creator.ru/articles/ruby       Язык программирования Ruby - Веб Креатор           
  https://ru.hexlet.io/courses/ruby          Введение в Ruby - Хекслет                          
  https://www.ozon.ru/product/yazyk-progra   Книга "Язык программирования Ruby" - OZON        

We've just scraped the SERP knowing nothing about its DOM other that there are big blue links with black descriptions!

Example of grid detection

browser.goto "https://youtube.com/"
grid = browser.at_css("#content").grid

grid.size              # => 24
grid.cols.size         # => 3
grid.cols.map(&:size)  # => [8, 8, 8]
grid.rows.size         # => 8
grid.rows.map(&:size)  # => [3, 3, 3, 3, 3, 3, 3, 3]