UNMAINTAINED

Ruby bindings for Google's PDFium project

This allows Ruby efficiently to extract information from PDF files.

It currently has only very rudimantary PDF editing capabilities.

API Documentation is also available and the test directory has examples of usage.

Installing

The gem requires both the PDFium and freeimage libraries.

An Ubuntu PPA is available for PDFium.

Freeimage should be installable via system packages.

In memory render and extraction

# Assuming AWS::S3 is already authorized elsewhere
bucket = AWS::S3.new.buckets['my-pdfs']

pdf = PDFium::Document.from_memory bucket.objects['secrets.pdf'].read
pdf.pages.each do | page |

  # render the complete page as a PNG with the height locked to 1000 pixels
  # The width will be calculated to maintain the proper aspect ratio
  path = "secrets/page-#{page.number}.png"
  bucket.objects[path].write page.as_image(height: 1000).data(:png)

  # extract and save each embedded image as a PNG
  page.images.each do | image |
    path = "secrets/page-#{page.number}-image-#{image.index}.png"
    bucket.objects[path].write image.data(:png)
  end

  # Extract text from page.  Will be encoded as UTF-16LE by default
  path = "secrets/page-#{page.number}-text.txt"
  bucket.objects[path].write page.text

end

Open and saveing

pdf = PDFium::Document.new("test.pdf")
pdf.save

Document information

Page count:

pdf.page_count

PDF Metadata:

pdf.metadata

Returns a hash with keys = :title, :author :subject, :keywords, :creator, :producer, :creation_date, :mod_date

Bookmarks

def print_bookmarks(list, indent=0)
    list.bookmarks.each do | bm |
        print ' ' * indent
        puts bm.title
        print_marks( bm.children )
    end
end
print_bookmarks( pdf.bookmarks )

Render page as an image

pdf.each_page | page |
    page.as_image(width: 800).save("test-{page.number}.png")
end

Extract embedded images from page

doc = PDFium::Document.new("test.pdf")
page = doc.page_at(0)
page.images do |image|
    img.save("page-0-image-#{image.index}.png")
end

Text access

Text is returned as a UTF-16LE encoded string. Future version may return position information as well

pdf.page_at(0).text.encode!("ASCII-8BIT")

pdfium

Development