CAST

C parser and abstract syntax tree for Ruby.

Example

require 'cast'

source = File.read('file.c')
ast = C.parse(source)
ast.entities.each do |declaration|
  declaration.declarator.each do |declarator|
    puts "#{declarator.name}: declarator.type"
  end
end

Or in irb:

irb> ast = C.parse('int main(void) { return 0; }')
 => TranslationUnit
    entities:
        - FunctionDef
            type: Function
                type: Int
                params: []
            name: "main"
            def: Block
                stmts:
                    - Return
                        expr: IntLiteral
                            val: 0

irb> puts ast
int main(void) {
    return 0;
}
 => nil

Nodes

C.parse returns a tree of Node objects. Here's the class hierarchy:

Node
- TranslationUnit
- Comment
- Declaration
- Declarator
- FunctionDef
- Parameter
- Enumerator
- MemberInit
- Member
- Statement
  - Block
  - If
  - Switch
  - While
  - For
  - Goto
  - Continue
  - Break
  - Return
  - ExpressionStatement
- Label
  - PlainLabel
  - Default
  - Case
- Type
  - IndirectType
    - Pointer
    - Array
    - Function
  - DirectType
    - Struct
    - Union
    - Enum
    - CustomType
    - PrimitiveType
      - Void
      - Int
      - Float
      - Char
      - Bool
      - Complex
      - Imaginary
- Expression
  - Comma
  - Conditional
  - Variable
  - UnaryExpression
    - PostfixExpression
      - Index
      - Call
      - Dot
      - Arrow
      - PostInc
      - PostDec
    - PrefixExpression
      - Cast
      - Address
      - Dereference
      - Sizeof
      - Plus
      - Minus
      - PreInc
      - PreDec
      - BitNot
      - Not
  - BinaryExpression
    - Add
    - Subtract
    - Multiply
    - Divide
    - Mod
    - Equal
    - NotEqual
    - Less
    - More
    - LessOrEqual
    - MoreOrEqual
    - BitAnd
    - BitOr
    - BitXor
    - ShiftLeft
    - ShiftRight
    - And
    - Or
  - AssignmentExpression
    - Assign
    - MultiplyAssign
    - DivideAssign
    - ModAssign
    - AddAssign
    - SubtractAssign
    - ShiftLeftAssign
    - ShiftRightAssign
    - BitAndAssign
    - BitXorAssign
    - BitOrAssign
  - Literal
    - StringLiteral
    - CharLiteral
    - CompoundLiteral
    - IntLiteral
    - FloatLiteral
- NodeList
  - NodeArray
  - NodeChain

The bold ones are abstract.

The last 2 (NodeLists) represent lists of Nodes. They quack like standard ruby Arrays. NodeChain is a doubly linked list; NodeArray is an array.

Node Methods

parent: return the parent in the tree (a Node or nil).
pos, pos=: the position in the source file (a Node::Pos).
to_s: return the code for the tree (a String).
inspect: return a pretty string for inspection, makes irb fun.
match?(str), =~(str): return true iff str parses as a Node equal to this one.
detach: remove this node from the tree (parent becomes nil) and return it.
detached?, attached?: return true if parent is nil or non-nil respectively.
replace_with(node): replace this node with node in the tree.
swap_with(node): exchange this node with node in their trees.
insert_prev(*nodes), insert_next(*nodes): insert nodes before this node in the parent list. Parent must be a NodeList! Useful for adding statements before a node in a block, for example.

Foo?: (where Foo is a module name) return self.is_a?(Foo). This is a convienience for a common need. Example:

\# print all global variables
ast.entities.each do |node|
  node.Declaration? or next
  node.declarators.each do |decl|
    unless decl.type.Function?
      puts "#{decl.name}: #{decl.type}"
    end
  end
end

The =~ method lets you do:

if declarator.type =~ 'const int *'
  puts "Ooh, a const int pointer!"
end

This is not the same as declarator.type.to_s == 'const int *'; that'd require you to guess how to_s formats its strings (most notably, the whitespace).

Fields and Children

The big table down below lists the fields of each Node. A field is an attribute which:

is used in equality checks (== and eql?).
are copied recursively by dup and clone.

Fields listed as children form the tree structure. They only have a Node or nil value, and are yielded/returned/affected by the traversal methods:

next, prev: return the next/prev sibling.
list_next, list_prev: like next/prev, but also requires the parent to be NodeList. I'll be honest; I don't remember why I added these methods. They may well suddenly disappear.
each, reverse_each: Yield all (non-nil) children. Node includes Enumerable, so, you know.
depth_first, reverse_depth_first: Walk the tree in that order, yielding two args (event, node) at each node. event is :down on the way down, :up on the way up. If the block throws :prune, it won't descend any further.
preorder, reverse_preorder, postorder, reverse_postorder: Walk the tree depth first, yielding nodes in the given order. For the preorders, if the block throws :prune, it won't descend any further.
node_after(child), node_before(child): return the node before/after child (same as child.next).
remove_node(child): remove child from this node (same as child.detach).
replace_node(child, new_child): replace child with yeah you guessed it (same as child.replace_with(newchild)).

Note: don't modify the tree during traversal!

Other notes about the table:

Field names that end in '?' are always true-or-false.
If no default is listed:
- it is false if the field name ends in a '?'
- it is a NodeArray if it is a NodeList.
- it is nil otherwise.

table.node_desc tr.first_field table td {
  border: none;
}

table.node_desc td {
  padding: 3px;
  vertical-align: top;
{

table.node_desc table td {
  padding: 0px;
{

</style>

Class

Field

Type / values

Default

Comments

TranslationUnit

entities *

NodeList

NodeChain[]

The root of a parsed file.

Declaration

storage

:typedef, :extern, :static, :auto, :register

Also:

#typedef? -- true iff storage == :typedef
#extern? -- true iff storage == :extern
#static? -- true iff storage == :static
#auto? -- true iff storage == :auto
#register? -- true iff storage == :register

type *

DirectType

declarators *

NodeList

NodeArray[]

inline?

true, false

Declarator

indirect_type *

IndirectType

What's a "declarator?" Consider "int i, *ip;". This is a Declaration with two Declarators:

    Declaration
        type: Int
        declarators:
            - Declarator
                name: "i"
            - Declarator
                indirect_type: Pointer
                name: "ip"

The indirect_type of the ip Declarator is a Pointer to nil. To get the complete type of the variable use:

#type -- return the complete type. This is a clone; modifying it won't modify the tree.

So calling #type on the ip Declarator gives:

    Pointer
      type: Int

name

String

init *

Expression

num_bits *

Integer

FunctionDef

storage

:extern, :static

Also:

#extern? -- return true iff storage == :extern
#static? -- return true iff storage == :static
#prototype? -- same as !no_prototype?
#prototype=(val) -- same as no_prototype = !val

no_prototype? means that no prototype was given. That means parameter types weren't given in the parens, but in the "old-style" declaration list. Example:

No prototype	Prototype
int main(argc, argv) int argc; char **argv; { return 0; }	int main(int argc, char **argv) { return 0; }

Everyone tells you to use prototypes. That's because no type checking is done when calling a function declared without a prototype.

inline?

true, false

type *

Type

name

String

def *

Block

Block.new

no_prototype?

true, false

Parameter

true, false

Used in Functions.

type *

Type

name

String

Enumerator

name

String

Used in Enums.

val *

Expression

MemberInit

member *

NodeList of (Member or Expression)

Used in CompoundLiterals.

init *

Expression

Member

name

String

Used in MemberInits.

Block

labels *

NodeList of Label

NodeArray[]

stmts *

NodeList of (Statement or Declaration or Comment)

NodeArray[]

labels *

NodeList of Label

NodeArray[]

cond *

Expression

then *

Statement

else *

Statement

Switch

labels *

NodeList of Label

NodeArray[]

cond *

Expression

stmt *

Statement

While

labels *

NodeList of Label

NodeArray[]

do? means it's a do-while loop.

do?

true, false

cond *

Expression

stmt *

Statement

For

labels *

NodeList of Label

NodeArray[]

init *

Expression or Declaration

cond *

Expression

iter *

Expression

stmt *

Statement

Goto

labels *

NodeList of Label

NodeArray[]

target

String

Continue

labels *

NodeList of Label

NodeArray[]

Break

labels *

NodeList of Label

NodeArray[]

Return

labels *

NodeList of Label

NodeArray[]

expr *

Expression

ExpressionStatement

labels *

NodeList of Label

NodeArray[]

expr *

Expression

PlainLabel

name

String

Default

Case

expr *

Expression

Comma

exprs *

NodeList of Expression

Conditional

cond *

Expression

then *

Expression

else *

Expression

Variable

name

String

Index

expr *

Expression

index *

Expression

Call

expr *

Expression

args *

NodeList of (Expression or Type)

Dot

expr *

Expression

member *

String

Arrow

expr *

Expression

member *

String

PostInc

expr *

Expression

PostDec

expr *

Expression

Cast

type *

Type

expr *

Expression

Address

expr *

Expression

Dereference

expr *

Expression

Sizeof

expr *

Type or Expression

Positive

expr *

Expression

Negative

expr *

Expression

PreInc

expr *

Expression

PreDec

expr *

Expression

BitNot

expr *

Expression

Not

expr *

Expression

Add

expr1 *

Expression

expr2 *

Expression

Subtract

expr1 *

Expression

expr2 *

Expression

Multiply

expr1 *

Expression

expr2 *

Expression

Divide

expr1 *

Expression

expr2 *

Expression

Mod

expr1 *

Expression

expr2 *

Expression

Equal

expr1 *

Expression

expr2 *

Expression

NotEqual

expr1 *

Expression

expr2 *

Expression

Less

expr1 *

Expression

expr2 *

Expression

expr1 *

Expression

expr2 *

Expression

LessOrEqual

expr1 *

Expression

expr2 *

Expression

MoreOrEqual

expr1 *

Expression

expr2 *

Expression

BitAnd

expr1 *

Expression

expr2 *

Expression

BitOr

expr1 *

Expression

expr2 *

Expression

BitXor

expr1 *

Expression

expr2 *

Expression

ShiftLeft

expr1 *

Expression

expr2 *

Expression

ShiftRight

expr1 *

Expression

expr2 *

Expression

And

expr1 *

Expression

expr2 *

Expression

expr1 *

Expression

expr2 *

Expression

Assign

lval *

Expression

rval *

Expression

MultiplyAssign

lval *

Expression

rval *

Expression

DivideAssign

lval *

Expression

rval *

Expression

ModAssign

lval *

Expression

rval *

Expression

AddAssign

lval *

Expression

rval *

Expression

SubtractAssign

lval *

Expression

rval *

Expression

ShiftLeftAssign

lval *

Expression

rval *

Expression

ShiftRightAssign

lval *

Expression

rval *

Expression

BitAndAssign

lval *

Expression

rval *

Expression

BitXorAssign

lval *

Expression

rval *

Expression

BitOrAssign

lval *

Expression

rval *

Expression

StringLiteral

val

String

The String in val is the literal string entered. "\n" isn't converted to a newline, for instance.

CharLiteral

val

String

The String in val is the literal string entered. '\n' isn't converted to a newline, for instance.

CompoundLiteral

type *

Type

Here's an example:

(struct S){1, .x = 2, .y [3] .z = 4}

parses as:

CompoundLiteral
    type: Struct
        name: "S"
    member_inits:
        - MemberInit
            init: IntLiteral
                val: 1
        - MemberInit
            member:
                - Member
                    name: "x"
            init: IntLiteral
                val: 2
        - MemberInit
            member:
                - Member
                    name: "y"
                - IntLiteral
                    val: 3
                - Member
                    name: "z"
            init: IntLiteral
                val: 4

member_inits *

NodeList of MemberInit

NodeArray[]

IntLiteral

format

:dec, :hex, :oct

:dec

Also:

#dec? -- return true iff format == :dec
#hex? -- return true iff format == :hex
#oct? -- return true iff format == :oct

val

Integer

suffix

String

FloatLiteral

format

:dec, :hex

:dec

val

Float

exponent

Integer

suffix

String

Pointer

const?

true, false

restrict?

true, false

volatile?

true, false

type *

Type

Array

const?

true, false

restrict?

true, false

volatile?

true, false

type *

Type

length *

Expression

Function

const?

true, false

restrict?

true, false

volatile?

true, false

type *

Type

params *

NodeList of Parameter

NodeArray[]

var_args?

true, false

Struct

const?

true, false

restrict?

true, false

volatile?

true, false

name

String

members *

NodeList of Member

NodeArray[]

Union

const?

true, false

restrict?

true, false

volatile?

true, false

name

String

members *

NodeList of Member

NodeArray[]

Enum

const?

true, false

restrict?

true, false

volatile?

true, false

name

String

members *

NodeList of Enumerator

CustomType

const?

true, false

For typedef'd names.

restrict?

true, false

volatile?

true, false

name

String

Void

const?

true, false

const is for things like const void *.

restrict?

true, false

volatile?

true, false

Int

const?

true, false

Also:

#short? -- return true iff longness == -1
#plain? -- return true iff longness == 0
#long? -- return true iff longness == 1
#long_long? -- return true iff longness == 2
#signed? -- same as !unsigned?
#signed=(val) -- same as unsigned = !val

restrict?

true, false

volatile?

true, false

longness

-1, 0, 1, 2

unsigned?

true, false

Float

const?

true, false

Also:

#plain? -- return true iff longness == 0
#double? -- return true iff longness == 1
#long_double? -- return true iff longness == 2

restrict?

true, false

volatile?

true, false

longness

0, 1, 2

Char

const?

true, false

Also:

#signed? -- return true iff signed == true
#unsigned? -- return true iff signed == false
#plain? -- return true iff signed == nil

Yes, C99 says that char, signed char, and unsigned char are 3 distinct types (unlike with int -- go figure). Like Martian chalk and Venusian cheese: completely different, but you can fit 'em each in one byte.

restrict?

true, false

volatile?

true, false

signed

true, false, nil

Bool

const?

true, false

This is the rarely seen _Bool type.

restrict?

true, false

volatile?

true, false

Complex

const?

true, false

This is the rarely seen _Complex type.

#plain? -- return true iff longness == 0
#double? -- return true iff longness == 1
#long_double? -- return true iff longness == 2

restrict?

true, false

volatile?

true, false

longness

0, 1, 2

Imaginary

const?

true, false

This is the rarely seen _Imaginary type.

#plain? -- return true iff longness == 0
#double? -- return true iff longness == 1
#long_double? -- return true iff longness == 2

restrict?

true, false

volatile?

true, false

longness

0, 1, 2

BlockExpression

block *

Block

Block.new

Only if the block_expressions extension is enabled. See "Extensions" section below.

Parser

C.parse will use the default parser (C.default_parser), but you can also manage your own parser(s) if you need finer control over state. Parser state consists of:

type_names: a Set of Strings. As a parser eats typedefs, this grows.
pos: the Node::Pos this parser will start parsing at.

A Node::Pos has three read-write attributes: filename, line_num, col_num. Default is nil, 1, 0.

Note that the type names the parser has seen affects the parser! For example, consider:

a * b;

If only a is a type, this is a declaration.
If neither a nor b are types, this is a multiplication statement.
Otherwise, it's a syntax error.

You may append type names implicitly, by parsing typedefs, or explicitly like this:

parser.type_names << 'Thing' << 'OtherThing'

Parsing Snippets

C.parse will parse the toplevel C construct, a C::TranslationUnit, but you can also parse other snippets of C:

C::Statement.parse('while (not_looking) { paint_car(); }')
C::Type.parse('void *(*)(int *(*)[][2], ...)')

This works for both concrete and abstract Node subclasses. A Parser may be given as an optional second argument.

Extensions to C99

Types are allowed as function arguments. This is needed to parse C99 macros like va_arg().
Blocks in parentheses are allowed as expressions (a gcc extension). You need to call parser.enable_block_expressions first. They appear as BlockExpression nodes.

Parsing Full Programs

This can be tricky for a number of reasons. Here are the issues you'll likely encounter.

Preprocessing

Directives that start with # are not handled by the Parser, as they're external to the C grammar. CAST ships with a Preprocessor, which wraps the preprocessor used to build your Ruby interpreter.

cpp = C::Preprocessor.new
cpp.include_path << '/usr/include' << /usr/local/include'
cpp.macros['DEBUG'] = '1'
cpp.macros['max(a, b)'] = '((a) > (b) ? (a) : (b))'
cpp.preprocess(code)

Note however, that preprocessors tend to leave vendor-specific extensions in their output. GNU cpp, for example, leaves "linemarkers" (lines that begin with #) in the output which you'll need to filter out manually before feeding it to a Parser.

Built-in types

Mac OS 10.5's system cpp for instance assumes the compiler will recognize types such as __darwin_va_list.

Syntactic Extensions

Some code may take advantage of compiler-specific extensions to the syntax. For example, gcc supports inline assembly via directives like:

asm("movl %1, %%eax;
    "movl %%eax, %0;"
    :"=r"(y)
    :"r"(x)
    :"%eax");

Such code is fairly rare, so there is no direct support in CAST for this. You'll need to manually massage such constructs out of the Parser input. Or send me patches. Delicious patches.

Contributing

Bug reports
Source
Patches: Fork on Github, send pull request.
- Include tests where practical.
- Leave the version alone, or bump it in a separate commit.

cast

CAST

Example

Nodes

Node Methods

Fields and Children

Parser

Parsing Snippets

Extensions to C99

Parsing Full Programs

Preprocessing

Built-in types

Syntactic Extensions

Contributing

Copyright