Structify
A Ruby gem for extracting structured data from content using LLMs in Rails applications
What is Structify?
Structify helps you extract structured data from unstructured content in your Rails apps:
- Define extraction schemas directly in your ActiveRecord models
- Generate JSON schemas to use with OpenAI, Anthropic, or other LLM providers
- Store and validate extracted data with ActiveRecord validations
- Access structured data through typed model attributes with full validation support
Use Cases
- Extract metadata, topics, and sentiment from articles or blog posts
- Pull structured information from user-generated content
- Organize unstructured feedback or reviews into categorized data
- Convert emails or messages into actionable, structured formats
- Extract entities and relationships from documents
# 1. Define extraction schema in your model
class Article < ApplicationRecord
include Structify::Model
schema_definition do
field :title, :string
field :summary, :text
field :category, :string, enum: ["tech", "business", "science"]
field :topics, :array, items: { type: "string" }
end
end
# 2. Get schema for your LLM API
schema = Article.json_schema
# 3. Store LLM response in your model
article = Article.find(123)
article.update(llm_response)
# 4. Access extracted data
article.title # => "AI Advances in 2023"
article.summary # => "Recent developments in artificial intelligence..."
article.topics # => ["machine learning", "neural networks", "computer vision"]
Install
# Add to Gemfile
gem 'structify'
Then:
bundle install
Database Setup
Add a JSON column to store extracted data:
add_column :articles, :json_attributes, :jsonb # PostgreSQL (default column name)
# or
add_column :articles, :json_attributes, :json # MySQL (default column name)
# Or if you configure a custom column name:
add_column :articles, :custom_json_column, :jsonb # PostgreSQL
Configuration
Structify can be configured in an initializer:
# config/initializers/structify.rb
Structify.configure do |config|
# Configure the default JSON container attribute (default: :json_attributes)
config.default_container_attribute = :custom_json_column
end
Usage
Define Your Schema
class Article < ApplicationRecord
include Structify::Model
schema_definition do
version 1
name "ArticleExtraction"
field :title, :string, required: true
field :summary, :text
field :category, :string, enum: ["tech", "business", "science"]
field :topics, :array, items: { type: "string" }
field :metadata, :object, properties: {
"author" => { type: "string" },
"published_at" => { type: "string" }
}
end
end
Get Schema for LLM API
Structify generates the JSON schema that you'll need to send to your LLM provider:
# Get JSON Schema to send to OpenAI, Anthropic, etc.
schema = Article.json_schema
Integration with LLM Services
You need to implement the actual LLM integration. Here's how you can integrate with popular services:
OpenAI Integration Example
require "openai"
class OpenAiExtractor
def initialize(api_key = ENV["OPENAI_API_KEY"])
@client = OpenAI::Client.new(access_token: api_key)
end
def extract(content, model_class)
# Get schema from Structify model
schema = model_class.json_schema
# Call OpenAI with structured outputs
response = @client.chat(
parameters: {
model: "gpt-4o",
response_format: { type: "json_object", schema: schema },
messages: [
{ role: "system", content: "Extract structured information from the provided content." },
{ role: "user", content: content }
]
}
)
# Parse and return the structured data
JSON.parse(response.dig("choices", 0, "message", "content"), symbolize_names: true)
end
end
# Usage
extractor = OpenAiExtractor.new
article = Article.find(123)
extracted_data = extractor.extract(article.content, Article)
article.update(extracted_data)
Anthropic Integration Example
require "anthropic"
class AnthropicExtractor
def initialize(api_key = ENV["ANTHROPIC_API_KEY"])
@client = Anthropic::Client.new(api_key: api_key)
end
def extract(content, model_class)
# Get schema from Structify model
schema = model_class.json_schema
# Call Claude with tool use
response = @client.messages.create(
model: "claude-3-opus-20240229",
max_tokens: 1000,
system: "Extract structured data based on the provided schema.",
messages: [{ role: "user", content: content }],
tools: [{
type: "function",
function: {
name: "extract_data",
description: "Extract structured data from content",
parameters: schema
}
}],
tool_choice: { type: "function", function: { name: "extract_data" } }
)
# Parse and return structured data
JSON.parse(response.content[0].tools[0].function.arguments, symbolize_names: true)
end
end
Store & Access Extracted Data
# Store LLM response in your model
article.update(response)
# Access via model attributes
article.title # => "How AI is Changing Healthcare"
article.category # => "tech"
article.topics # => ["machine learning", "healthcare"]
# All data is in the JSON column (default column name: json_attributes)
article.json_attributes # => The complete JSON
Field Types
Structify supports all standard JSON Schema types:
field :name, :string # String values
field :count, :integer # Integer values
field :price, :number # Numeric values (float/int)
field :active, :boolean # Boolean values
field :metadata, :object # JSON objects
field :tags, :array # Arrays
Field Options
# Required fields
field :title, :string, required: true
# Enum values
field :status, :string, enum: ["draft", "published", "archived"]
# Array constraints
field :tags, :array,
items: { type: "string" },
min_items: 1,
max_items: 5,
unique_items: true
# Nested objects
field :author, :object, properties: {
"name" => { type: "string", required: true },
"email" => { type: "string" }
}
Field Validations
Structify leverages attr_json's integration with ActiveRecord validations to provide comprehensive field-level validation:
schema_definition do
# Basic validations
field :email, :string, required: true, validations: {
format: /\A[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+\z/i
}
# Length validations
field :title, :string, validations: {
length: { minimum: 5, maximum: 200 }
}
# Numeric validations
field :age, :integer, validations: {
numericality: { greater_than_or_equal_to: 18 }
}
# Custom validations
field :url, :string, validations: {
custom: ->(record, field_name) {
value = record.send(field_name)
if value && !URI.parse(value).host
record.errors.add(field_name, "must be a valid URL")
end
}
}
end
Array Validations
Arrays have special validation support:
field :tags, :array,
min_items: 1,
max_items: 10,
unique_items: true,
validations: {
custom: ->(record, field_name) {
tags = record.send(field_name) || []
tags.each do |tag|
unless tag.is_a?(String) && tag.length >= 2
record.errors.add(field_name, "items must be strings with 2+ characters")
end
end
}
}
Nested Model Validations
When using nested models, their validations are automatically applied:
class Address
include AttrJson::Model
attr_json :street, :string
attr_json :city, :string
validates :street, :city, presence: true
end
# In your schema:
field :address, Address.to_type, required: true
See the validation guide for comprehensive documentation.
Chain of Thought Mode
Structify supports a "thinking" mode that automatically requests chain of thought reasoning from the LLM:
schema_definition do
version 1
thinking true # Enable chain of thought reasoning
field :title, :string, required: true
# other fields...
end
Chain of thought (COT) reasoning is beneficial because it:
- Adds more context to the extraction process
- Helps the LLM think through problems more systematically
- Improves accuracy for complex extractions
- Makes the reasoning process transparent and explainable
- Reduces hallucinations by forcing step-by-step thinking
This is especially useful when:
- Answers need more detailed information
- Questions require multi-step reasoning
- Extractions involve complex decision-making
- You need to understand how the LLM reached its conclusions
For best results, include instructions for COT in your base system prompt:
system_prompt = "Extract structured data from the content.
For each field, think step by step before determining the value."
You can generate effective chain of thought prompts using tools like the Claude Prompt Designer.
Schema Versioning and Field Lifecycle
Structify provides a simple field lifecycle management system using a versions
parameter:
schema_definition do
version 3
# Fields for specific version ranges
field :title, :string # Available in all versions (default behavior)
field :legacy, :string, versions: 1...3 # Only in versions 1-2 (removed in v3)
field :summary, :text, versions: 2 # Added in version 2 onwards
field :content, :text, versions: 2.. # Added in version 2 onwards (endless range)
field :temp_field, :string, versions: 2..3 # Only in versions 2-3
field :special, :string, versions: [1, 3, 5] # Only in versions 1, 3, and 5
end
Version Range Syntax
Structify supports several ways to specify which versions a field is available in:
Syntax | Example | Meaning |
---|---|---|
No version specified | field :title, :string |
Available in all versions (default) |
Single integer | versions: 2 |
Available from version 2 onwards |
Range (inclusive) | versions: 1..3 |
Available in versions 1, 2, and 3 |
Range (exclusive) | versions: 1...3 |
Available in versions 1 and 2 (not 3) |
Endless range | versions: 2.. |
Available from version 2 onwards |
Array | versions: [1, 4, 7] |
Only available in versions 1, 4, and 7 |
Handling Records with Different Versions
# Create a record with version 1 schema
article_v1 = Article.create(title: "Original Article")
# Access with version 3 schema
article_v3 = Article.find(article_v1.id)
# Fields from v1 are still accessible
article_v3.title # => "Original Article"
# Fields not in v1 raise errors
article_v3.summary # => VersionRangeError: Field 'summary' is not available in version 1.
# This field is only available in versions: 2 to 999.
# Check version compatibility
article_v3.version_compatible_with?(3) # => false
article_v3.version_compatible_with?(1) # => true
# Upgrade record to version 3
article_v3.summary = "Added in v3"
article_v3.save! # Record version is automatically updated to 3
Accessing the Container Attribute
The JSON container attribute can be accessed directly:
# Using the default container attribute :json_attributes
article.json_attributes # => { "title" => "My Title", "version" => 1, ... }
# If you've configured a custom container attribute
article.custom_json_column # => { "title" => "My Title", "version" => 1, ... }
Validation & Error Handling
Structify validates all LLM responses and raises specific exceptions for retry logic:
begin
article.update!(llm_response)
rescue Structify::LLMValidationError => e
RetryExtractionJob.perform_later(article.id, content, e.field_name)
end
Understanding Structify's Role
Structify is designed as a bridge between your Rails models and LLM extraction services:
What Structify Does For You
- ✅ Define extraction schemas directly in your ActiveRecord models
- ✅ Generate compatible JSON schemas for OpenAI, Anthropic, and other LLM providers
- ✅ Store and validate extracted data with automatic error detection
- ✅ Provide typed access to extracted fields through your models
- ✅ Handle schema versioning and backward compatibility
- ✅ Raise specific exceptions for different validation failures to enable retry logic
What You Need To Implement
- 🔧 API integration with your chosen LLM provider (see examples above)
- 🔧 Processing logic for when and how to extract data
- 🔧 Authentication and API key management
- 🔧 Error handling and retries for API calls
This separation of concerns allows you to:
- Use any LLM provider and model you prefer
- Implement extraction logic specific to your application
- Handle API access in a way that fits your application architecture
- Change LLM providers without changing your data model