CV Parser
A Ruby gem for parsing and extracting structured information from CVs/resumes using LLM providers.
Features
- Multiple file format support: PDF, DOCX, TXT, and Markdown files
- Smart file processing: Converts DOCX to PDF, processes text files directly (no upload required)
- Extract structured data from CVs using leading LLM providers
- Multiple LLM providers: OpenAI, Anthropic, and Faker (for testing)
- Customizable output schema using JSON Schema format
- Command-line interface for quick parsing and analysis
- Performance optimized: Text files bypass upload for faster processing
- Robust error handling and validation
Installation
Add this line to your application's Gemfile:
gem 'cv-parser'And then execute:
$ bundle installOr install it yourself as:
$ gem install cv-parserUsage
Using in Rails
You can use CV Parser directly in your Ruby or Rails application to extract structured data from CVs.
Basic Configuration
You can configure the gem for different providers:
require 'cv_parser'
# OpenAI
CvParser.configure do |config|
config.provider = :openai
config.api_key = ENV['OPENAI_API_KEY']
config.model = 'gpt-4.1-mini'
config.output_schema = schema
end
# Anthropic
CvParser.configure do |config|
config.provider = :anthropic
config.api_key = ENV['ANTHROPIC_API_KEY']
config.model = 'claude-3-sonnet-20240229'
config.output_schema = schema
end
# Faker (for testing/development)
CvParser.configure do |config|
config.provider = :faker
config.output_schema = schema
endDefining an Output Schema
Define the schema for the data you want to extract using JSON Schema format:
schema = {
type: "json_schema",
name: "cv_parsing",
description: "Schema for a CV or resume document",
properties: {
personal_info: {
type: "object",
description: "Personal and contact information for the candidate",
properties: {
name: {
type: "string",
description: "Full name of the individual"
},
email: {
type: "string",
description: "Email address of the individual"
},
phone: {
type: "string",
description: "Phone number of the individual"
},
location: {
type: "string",
description: "Geographic location or city of residence"
}
},
required: %w[name email]
},
experience: {
type: "array",
description: "List of professional experience entries",
items: {
type: "object",
description: "A professional experience entry",
properties: {
company: {
type: "string",
description: "Name of the company or organization"
},
position: {
type: "string",
description: "Job title or position held"
},
start_date: {
type: "string",
description: "Start date of employment (e.g. '2020-01')"
},
end_date: {
type: "string",
description: "End date of employment or 'present'"
},
description: {
type: "string",
description: "Description of responsibilities and achievements"
}
},
required: %w[company position start_date]
}
},
education: {
type: "array",
description: "List of educational qualifications",
items: {
type: "object",
description: "An education entry",
properties: {
institution: {
type: "string",
description: "Name of the educational institution"
},
degree: {
type: "string",
description: "Degree or certification received"
},
field: {
type: "string",
description: "Field of study"
},
graduation_date: {
type: "string",
description: "Graduation date (e.g. '2019-06')"
}
},
required: %w[institution degree]
}
},
skills: {
type: "array",
description: "List of relevant skills",
items: {
type: "string",
description: "A single skill"
}
}
},
required: %w[personal_info experience education skills]
}Set the output schema in the configuration block:
CvParser.configure do |config|
config.output_schema = schema
endYou can also set the output schema in the extractor method which will override the configuration block:
extractor = CvParser::Extractor.new
extractor.extract(
output_schema: schema
)Extracting Data from a CV
extractor = CvParser::Extractor.new
# Extract from PDF (uploaded to LLM)
result = extractor.extract(
file_path: "path/to/resume.pdf"
)
# Extract from text file (fast, no upload)
result = extractor.extract(
file_path: "path/to/resume.txt"
)
# Extract from markdown file (fast, no upload)
result = extractor.extract(
file_path: "path/to/resume.md"
)
puts "Name: #{result['personal_info']['name']}"
puts "Email: #{result['personal_info']['email']}"
result['skills'].each { |skill| puts "- #{skill}" }Error Handling
begin
result = extractor.extract(
file_path: "path/to/resume.txt" # Works with any supported format
)
rescue CvParser::FileNotFoundError, CvParser::FileNotReadableError => e
puts "File error: #{e.message}"
rescue CvParser::EmptyTextFileError => e
puts "Text file is empty: #{e.message}"
rescue CvParser::TextFileEncodingError => e
puts "Text file encoding error: #{e.message}"
rescue CvParser::ParseError => e
puts "Error parsing the response: #{e.message}"
rescue CvParser::APIError => e
puts "LLM API error: #{e.message}"
rescue CvParser::ConfigurationError => e
puts "Configuration error: #{e.message}"
endCommand-Line Interface
CV Parser also provides a CLI for quick analysis:
# Process different file formats
cv-parser path/to/resume.pdf
cv-parser path/to/resume.docx
cv-parser path/to/resume.txt
cv-parser path/to/resume.md
# Use different providers
cv-parser --provider anthropic path/to/resume.pdf
cv-parser --provider openai path/to/resume.txt
# Output options
cv-parser --format yaml --output result.yaml path/to/resume.md
cv-parser --schema custom-schema.json path/to/resume.txt
cv-parser --helpYou can use environment variables for API keys and provider selection:
export OPENAI_API_KEY=your-openai-key
export ANTHROPIC_API_KEY=your-anthropic-key
export CV_PARSER_PROVIDER=openai
export CV_PARSER_API_KEY=your-api-key
cv-parser resume.pdfSupported File Formats
CV Parser supports multiple file formats with optimized processing:
File Format Support
| Format | Extension | Processing Method | Upload Required | Performance |
|---|---|---|---|---|
.pdf |
Direct upload | Yes | Standard | |
| DOCX | .docx |
Convert to PDF → Upload | Yes | Standard |
| Text | .txt |
Direct text processing | No | Fast |
| Markdown | .md |
Direct text processing | No | Fast |
Performance Benefits of Text Files
Text files (.txt and .md) offer significant performance advantages:
- No file upload overhead: Content is included directly in the API request
- Faster processing: Eliminates the upload → reference workflow
- Reduced API calls: Single request instead of upload + process
- Lower bandwidth usage: Direct text inclusion vs binary file transfer
- Better for automation: Simpler integration in automated workflows
File Size Limits
- PDF/DOCX files: Limited by LLM provider (typically 20MB)
- Text files: No explicit size limits (limited only by LLM provider)
File Processing Examples
# Fast text processing (no upload)
extractor.extract(file_path: "resume.txt", output_schema: schema)
extractor.extract(file_path: "resume.md", output_schema: schema)
# Standard file processing (with upload)
extractor.extract(file_path: "resume.pdf", output_schema: schema)
extractor.extract(file_path: "resume.docx", output_schema: schema)Advanced Configuration
You can further customize CV Parser by setting advanced options in the configuration block. For example:
CvParser.configure do |config|
# Configure OpenAI with organization ID
config.provider = :openai
config.api_key = ENV['OPENAI_API_KEY']
config.model = 'gpt-4.1-mini'
# Set timeout for file uploads (important for larger files)
config.timeout = 120 # TODO - not yet implemented
config.max_retries = 2 # TODO - not yet implemented
# Provider-specific options
config.provider_options[:organization_id] = ENV['OPENAI_ORG_ID']
# You can also set custom prompts for the LLM:
config.prompt = "Extract the following fields from the CV..."
config.system_prompt = "You are a CV parsing assistant."
# Set the output schema (JSON Schema format)
config.output_schema = schema
# Set the max tokens and temperature
config.max_tokens = 4000
config.temperature = 0.1
endTesting and Development
Using the Faker Provider
The Faker provider generates realistic-looking fake data based on your schema without making API calls. This is useful for:
- Writing tests (RSpec, Rails, etc.)
- Developing UI components
- Demonstrating functionality without API keys
- Avoiding API costs and rate limits
- Tests run faster without external API calls
- Consistent, predictable results
- No need for API keys in CI/CD environments
Basic Test Setup
Here's how to use the faker provider in your RSpec tests:
# spec/your_resume_processor_spec.rb
require 'spec_helper'
RSpec.describe YourResumeProcessor do
# Define a JSON Schema format schema for testing
let(:test_schema) do
{
type: "json_schema",
name: "cv_parsing_test",
description: "Test schema for CV parsing",
properties: {
personal_info: {
type: "object",
description: "Personal information",
properties: {
name: {
type: "string",
description: "Full name"
},
email: {
type: "string",
description: "Email address"
}
},
required: %w[name email]
},
skills: {
type: "array",
description: "List of skills",
items: {
type: "string",
description: "A skill"
}
}
},
required: %w[personal_info skills]
}
end
before do
# Configure CV Parser to use the faker provider
CvParser.configure do |config|
config.provider = :faker
end
end
after do
# Reset configuration after tests
CvParser.reset
end
it "processes a resume and extracts relevant fields" do
processor = YourResumeProcessor.new
result = processor.process_resume("spec/fixtures/sample_resume.pdf", test_schema)
# The faker provider will return consistent test data
expect(result.personal_info.name).to eq("John Doe")
expect(result.personal_info.email).to eq("john.doe@example.com")
expect(result.skills).to be_an(Array)
expect(result.skills).not_to be_empty
end
endSimple Faker Example
# Configure with Faker provider
CvParser.configure do |config|
config.provider = :faker
end
# Use the extractor as normal
extractor = CvParser::Extractor.new
result = extractor.extract(
file_path: "path/to/resume.pdf", # Path will be ignored by faker
output_schema: schema # Using the JSON Schema format defined above
)
# Faker will generate structured data based on your schema
puts result.inspectData Generation Behavior
The faker provider generates realistic-looking data based on your schema. The data is deterministic for fields like name, email, and phone, but randomized for arrays and collections. You can write tests that check for structure without relying on specific content for variable fields.