CAST
C parser and abstract syntax tree for Ruby.
Example
require 'cast'
source = File.read('file.c')
ast = C.parse(source)
ast.entities.each do |declaration|
declaration.declarator.each do |declarator|
puts "#{declarator.name}: declarator.type"
end
end
Or in irb:
irb> ast = C.parse('int main(void) { return 0; }')
=> TranslationUnit
entities:
- FunctionDef
type: Function
type: Int
params: []
name: "main"
def: Block
stmts:
- Return
expr: IntLiteral
val: 0
irb> puts ast
int main(void) {
return 0;
}
=> nil
Nodes
C.parse returns a tree of Node objects. Here's the class hierarchy:
-
Node
- TranslationUnit
- Comment
- Declaration
- Declarator
- FunctionDef
- Parameter
- Enumerator
- MemberInit
- Member
-
Statement
- Block
- If
- Switch
- While
- For
- Goto
- Continue
- Break
- Return
- ExpressionStatement
-
Label
- PlainLabel
- Default
- Case
-
Type
-
IndirectType
- Pointer
- Array
- Function
-
DirectType
- Struct
- Union
- Enum
- CustomType
-
PrimitiveType
- Void
- Int
- Float
- Char
- Bool
- Complex
- Imaginary
-
IndirectType
-
Expression
- Comma
- Conditional
- Variable
-
UnaryExpression
-
PostfixExpression
- Index
- Call
- Dot
- Arrow
- PostInc
- PostDec
-
PrefixExpression
- Cast
- Address
- Dereference
- Sizeof
- Plus
- Minus
- PreInc
- PreDec
- BitNot
- Not
-
PostfixExpression
-
BinaryExpression
- Add
- Subtract
- Multiply
- Divide
- Mod
- Equal
- NotEqual
- Less
- More
- LessOrEqual
- MoreOrEqual
- BitAnd
- BitOr
- BitXor
- ShiftLeft
- ShiftRight
- And
- Or
-
AssignmentExpression
- Assign
- MultiplyAssign
- DivideAssign
- ModAssign
- AddAssign
- SubtractAssign
- ShiftLeftAssign
- ShiftRightAssign
- BitAndAssign
- BitXorAssign
- BitOrAssign
-
Literal
- StringLiteral
- CharLiteral
- CompoundLiteral
- IntLiteral
- FloatLiteral
-
NodeList
- NodeArray
- NodeChain
The bold ones are abstract.
The last 2 (NodeLists) represent lists of Nodes. They quack like
standard ruby Arrays. NodeChain is a doubly linked list;
NodeArray is an array.
Node Methods
-
parent: return the parent in the tree (aNodeor nil). -
pos,pos=: the position in the source file (aNode::Pos). -
to_s: return the code for the tree (aString). -
inspect: return a pretty string for inspection, makes irb fun. -
match?(str),=~(str): return true iff str parses as aNodeequal to this one. -
detach: remove this node from the tree (parent becomes nil) and return it. -
detached?,attached?: return true if parent is nil or non-nil respectively. -
replace_with(node): replace this node with node in the tree. -
swap_with(node): exchange this node with node in their trees. -
insert_prev(*nodes),insert_next(*nodes): insert nodes before this node in the parent list. Parent must be aNodeList! Useful for adding statements before a node in a block, for example. -
Foo?: (whereFoois a module name) returnself.is_a?(Foo). This is a convienience for a common need. Example:\# print all global variables ast.entities.each do |node| node.Declaration? or next node.declarators.each do |decl| unless decl.type.Function? puts "#{decl.name}: #{decl.type}" end end end
The =~ method lets you do:
if declarator.type =~ 'const int *'
puts "Ooh, a const int pointer!"
end
This is not the same as declarator.type.to_s == 'const int *';
that'd require you to guess how to_s formats its strings (most
notably, the whitespace).
Fields and Children
The big table down below lists the fields of each Node. A field is
an attribute which:
- is used in equality checks (
==andeql?). - are copied recursively by
dupandclone.
Fields listed as children form the tree structure. They only have a
Node or nil value, and are yielded/returned/affected by the
traversal methods:
-
next,prev: return the next/prev sibling. -
list_next,list_prev: likenext/prev, but also requires the parent to beNodeList. I'll be honest; I don't remember why I added these methods. They may well suddenly disappear. -
each,reverse_each: Yield all (non-nil) children.NodeincludesEnumerable, so, you know. -
depth_first,reverse_depth_first: Walk the tree in that order, yielding two args (event, node) at each node. event is:downon the way down,:upon the way up. If the block throws:prune, it won't descend any further. -
preorder,reverse_preorder,postorder,reverse_postorder: Walk the tree depth first, yielding nodes in the given order. For the preorders, if the block throws:prune, it won't descend any further. -
node_after(child),node_before(child): return the node before/after child (same aschild.next). -
remove_node(child): remove child from this node (same aschild.detach). -
replace_node(child, new_child): replace child with yeah you guessed it (same aschild.replace_with(newchild)).
Note: don't modify the tree during traversal!
Other notes about the table:
- Field names that end in '?' are always true-or-false.
- If no default is listed:
- it is false if the field name ends in a '?'
- it is a
NodeArrayif it is aNodeList. - it is
nilotherwise.
table.node_desc tr.first_field table td {
border: none;
}
table.node_desc td {
padding: 3px;
vertical-align: top;
{
table.node_desc table td {
padding: 0px;
{
| Class | Field | Type / values | Default | Comments | ||||
|---|---|---|---|---|---|---|---|---|
| TranslationUnit | entities * | NodeList | NodeChain[] | The root of a parsed file. | ||||
| Declaration | storage | :typedef, :extern, :static, :auto, :register |
Also:
|
|||||
| type * | DirectType | |||||||
| declarators * | NodeList | NodeArray[] | ||||||
| inline? | true, false | |||||||
| Declarator | indirect_type * | IndirectType |
What's a "declarator?" Consider "int i, *ip;". This is
a Declaration with two Declarators:
Declaration
type: Int
declarators:
- Declarator
name: "i"
- Declarator
indirect_type: Pointer
name: "ip"
The indirect_type of the ip
Declarator is a Pointer to nil.
To get the complete type of the variable use:
Pointer
type: Int
|
|||||
| name | String | |||||||
| init * | Expression | |||||||
| num_bits * | Integer | |||||||
| FunctionDef | storage | :extern, :static |
Also:
|
|||||
| inline? | true, false | |||||||
| type * | Type | |||||||
| name | String | |||||||
| def * | Block | Block.new | ||||||
| no_prototype? | true, false | |||||||
| Parameter | register? | true, false | Used in Functions. | |||||
| type * | Type | |||||||
| name | String | |||||||
| Enumerator | name | String | Used in Enums. | |||||
| val * | Expression | |||||||
| MemberInit | member * | NodeList of (Member or Expression) | Used in CompoundLiterals. | |||||
| init * | Expression | |||||||
| Member | name | String | Used in MemberInits. | |||||
| Block | labels * | NodeList of Label | NodeArray[] | |||||
| stmts * | NodeList of (Statement or Declaration or Comment) | NodeArray[] | ||||||
| If | labels * | NodeList of Label | NodeArray[] | |||||
| cond * | Expression | |||||||
| then * | Statement | |||||||
| else * | Statement | |||||||
| Switch | labels * | NodeList of Label | NodeArray[] | |||||
| cond * | Expression | |||||||
| stmt * | Statement | |||||||
| While | labels * | NodeList of Label | NodeArray[] | do? means it's a do-while loop. | ||||
| do? | true, false | |||||||
| cond * | Expression | |||||||
| stmt * | Statement | |||||||
| For | labels * | NodeList of Label | NodeArray[] | |||||
| init * | Expression or Declaration | |||||||
| cond * | Expression | |||||||
| iter * | Expression | |||||||
| stmt * | Statement | |||||||
| Goto | labels * | NodeList of Label | NodeArray[] | |||||
| target | String | |||||||
| Continue | labels * | NodeList of Label | NodeArray[] | |||||
| Break | labels * | NodeList of Label | NodeArray[] | |||||
| Return | labels * | NodeList of Label | NodeArray[] | |||||
| expr * | Expression | |||||||
| ExpressionStatement | labels * | NodeList of Label | NodeArray[] | |||||
| expr * | Expression | |||||||
| PlainLabel | name | String | ||||||
| Default | ||||||||
| Case | expr * | Expression | ||||||
| Comma | exprs * | NodeList of Expression | ||||||
| Conditional | cond * | Expression | ||||||
| then * | Expression | |||||||
| else * | Expression | |||||||
| Variable | name | String | ||||||
| Index | expr * | Expression | ||||||
| index * | Expression | |||||||
| Call | expr * | Expression | ||||||
| args * | NodeList of (Expression or Type) | |||||||
| Dot | expr * | Expression | ||||||
| member * | String | |||||||
| Arrow | expr * | Expression | ||||||
| member * | String | |||||||
| PostInc | expr * | Expression | ||||||
| PostDec | expr * | Expression | ||||||
| Cast | type * | Type | ||||||
| expr * | Expression | |||||||
| Address | expr * | Expression | ||||||
| Dereference | expr * | Expression | ||||||
| Sizeof | expr * | Type or Expression | ||||||
| Positive | expr * | Expression | ||||||
| Negative | expr * | Expression | ||||||
| PreInc | expr * | Expression | ||||||
| PreDec | expr * | Expression | ||||||
| BitNot | expr * | Expression | ||||||
| Not | expr * | Expression | ||||||
| Add | expr1 * | Expression | ||||||
| expr2 * | Expression | |||||||
| Subtract | expr1 * | Expression | ||||||
| expr2 * | Expression | |||||||
| Multiply | expr1 * | Expression | ||||||
| expr2 * | Expression | |||||||
| Divide | expr1 * | Expression | ||||||
| expr2 * | Expression | |||||||
| Mod | expr1 * | Expression | ||||||
| expr2 * | Expression | |||||||
| Equal | expr1 * | Expression | ||||||
| expr2 * | Expression | |||||||
| NotEqual | expr1 * | Expression | ||||||
| expr2 * | Expression | |||||||
| Less | expr1 * | Expression | ||||||
| expr2 * | Expression | |||||||
| More | expr1 * | Expression | ||||||
| expr2 * | Expression | |||||||
| LessOrEqual | expr1 * | Expression | ||||||
| expr2 * | Expression | |||||||
| MoreOrEqual | expr1 * | Expression | ||||||
| expr2 * | Expression | |||||||
| BitAnd | expr1 * | Expression | ||||||
| expr2 * | Expression | |||||||
| BitOr | expr1 * | Expression | ||||||
| expr2 * | Expression | |||||||
| BitXor | expr1 * | Expression | ||||||
| expr2 * | Expression | |||||||
| ShiftLeft | expr1 * | Expression | ||||||
| expr2 * | Expression | |||||||
| ShiftRight | expr1 * | Expression | ||||||
| expr2 * | Expression | |||||||
| And | expr1 * | Expression | ||||||
| expr2 * | Expression | |||||||
| Or | expr1 * | Expression | ||||||
| expr2 * | Expression | |||||||
| Assign | lval * | Expression | ||||||
| rval * | Expression | |||||||
| MultiplyAssign | lval * | Expression | ||||||
| rval * | Expression | |||||||
| DivideAssign | lval * | Expression | ||||||
| rval * | Expression | |||||||
| ModAssign | lval * | Expression | ||||||
| rval * | Expression | |||||||
| AddAssign | lval * | Expression | ||||||
| rval * | Expression | |||||||
| SubtractAssign | lval * | Expression | ||||||
| rval * | Expression | |||||||
| ShiftLeftAssign | lval * | Expression | ||||||
| rval * | Expression | |||||||
| ShiftRightAssign | lval * | Expression | ||||||
| rval * | Expression | |||||||
| BitAndAssign | lval * | Expression | ||||||
| rval * | Expression | |||||||
| BitXorAssign | lval * | Expression | ||||||
| rval * | Expression | |||||||
| BitOrAssign | lval * | Expression | ||||||
| rval * | Expression | |||||||
| StringLiteral | val | String | The String in val is the literal string entered. "\n" isn't converted to a newline, for instance. | |||||
| CharLiteral | val | String | The String in val is the literal string entered. '\n' isn't converted to a newline, for instance. | |||||
| CompoundLiteral | type * | Type |
Here's an example: (struct S){1, .x = 2, .y [3] .z = 4}
parses as: CompoundLiteral
type: Struct
name: "S"
member_inits:
- MemberInit
init: IntLiteral
val: 1
- MemberInit
member:
- Member
name: "x"
init: IntLiteral
val: 2
- MemberInit
member:
- Member
name: "y"
- IntLiteral
val: 3
- Member
name: "z"
init: IntLiteral
val: 4
|
|||||
| member_inits * | NodeList of MemberInit | NodeArray[] | ||||||
| IntLiteral | format | :dec, :hex, :oct | :dec |
Also:
|
||||
| val | Integer | |||||||
| suffix | String | |||||||
| FloatLiteral | format | :dec, :hex | :dec | |||||
| val | Float | |||||||
| exponent | Integer | |||||||
| suffix | String | |||||||
| Pointer | const? | true, false | ||||||
| restrict? | true, false | |||||||
| volatile? | true, false | |||||||
| type * | Type | |||||||
| Array | const? | true, false | ||||||
| restrict? | true, false | |||||||
| volatile? | true, false | |||||||
| type * | Type | |||||||
| length * | Expression | |||||||
| Function | const? | true, false | ||||||
| restrict? | true, false | |||||||
| volatile? | true, false | |||||||
| type * | Type | |||||||
| params * | NodeList of Parameter | NodeArray[] | ||||||
| var_args? | true, false | |||||||
| Struct | const? | true, false | ||||||
| restrict? | true, false | |||||||
| volatile? | true, false | |||||||
| name | String | |||||||
| members * | NodeList of Member | NodeArray[] | ||||||
| Union | const? | true, false | ||||||
| restrict? | true, false | |||||||
| volatile? | true, false | |||||||
| name | String | |||||||
| members * | NodeList of Member | NodeArray[] | ||||||
| Enum | const? | true, false | ||||||
| restrict? | true, false | |||||||
| volatile? | true, false | |||||||
| name | String | |||||||
| members * | NodeList of Enumerator | |||||||
| CustomType | const? | true, false | For typedef'd names. | |||||
| restrict? | true, false | |||||||
| volatile? | true, false | |||||||
| name | String | |||||||
| Void | const? | true, false | const is for things like const void *. | |||||
| restrict? | true, false | |||||||
| volatile? | true, false | |||||||
| Int | const? | true, false |
Also:
|
|||||
| restrict? | true, false | |||||||
| volatile? | true, false | |||||||
| longness | -1, 0, 1, 2 | 0 | ||||||
| unsigned? | true, false | |||||||
| Float | const? | true, false |
Also:
|
|||||
| restrict? | true, false | |||||||
| volatile? | true, false | |||||||
| longness | 0, 1, 2 | 0 | ||||||
| Char | const? | true, false |
Also:
|
|||||
| restrict? | true, false | |||||||
| volatile? | true, false | |||||||
| signed | true, false, nil | |||||||
| Bool | const? | true, false | This is the rarely seen _Bool type. | |||||
| restrict? | true, false | |||||||
| volatile? | true, false | |||||||
| Complex | const? | true, false |
This is the rarely seen _Complex type.
|
|||||
| restrict? | true, false | |||||||
| volatile? | true, false | |||||||
| longness | 0, 1, 2 | 0 | ||||||
| Imaginary | const? | true, false |
This is the rarely seen _Imaginary type.
|
|||||
| restrict? | true, false | |||||||
| volatile? | true, false | |||||||
| longness | 0, 1, 2 | 0 | ||||||
| BlockExpression | block * | Block | Block.new | Only if the block_expressions extension is enabled. See "Extensions" section below. |
Parser
C.parse will use the default parser (C.default_parser), but you
can also manage your own parser(s) if you need finer control over
state. Parser state consists of:
-
type_names: a Set of Strings. As a parser eatstypedefs, this grows. -
pos: theNode::Posthis parser will start parsing at.
A Node::Pos has three read-write attributes: filename, line_num,
col_num. Default is nil, 1, 0.
Note that the type names the parser has seen affects the parser! For example, consider:
a * b;
- If only
ais a type, this is a declaration. - If neither
anorbare types, this is a multiplication statement. - Otherwise, it's a syntax error.
You may append type names implicitly, by parsing typedefs, or
explicitly like this:
parser.type_names << 'Thing' << 'OtherThing'
Parsing Snippets
C.parse will parse the toplevel C construct, a C::TranslationUnit,
but you can also parse other snippets of C:
C::Statement.parse('while (not_looking) { paint_car(); }')
C::Type.parse('void *(*)(int *(*)[][2], ...)')
This works for both concrete and abstract Node subclasses. A
Parser may be given as an optional second argument.
Extensions to C99
-
Types are allowed as function arguments. This is needed to parse C99 macros likeva_arg(). -
Blocks in parentheses are allowed as expressions (a gcc extension). You need to callparser.enable_block_expressionsfirst. They appear asBlockExpressionnodes.
Parsing Full Programs
This can be tricky for a number of reasons. Here are the issues you'll likely encounter.
Preprocessing
Directives that start with # are not handled by the Parser, as
they're external to the C grammar. CAST ships with a Preprocessor,
which wraps the preprocessor used to build your Ruby interpreter.
cpp = C::Preprocessor.new
cpp.include_path << '/usr/include' << /usr/local/include'
cpp.macros['DEBUG'] = '1'
cpp.macros['max(a, b)'] = '((a) > (b) ? (a) : (b))'
cpp.preprocess(code)
Note however, that preprocessors tend to leave vendor-specific
extensions in their output. GNU cpp, for example, leaves
"linemarkers" (lines that begin with #) in the output which you'll
need to filter out manually before feeding it to a Parser.
Built-in types
Mac OS 10.5's system cpp for instance assumes the compiler will
recognize types such as __darwin_va_list.
Syntactic Extensions
Some code may take advantage of compiler-specific extensions to the
syntax. For example, gcc supports inline assembly via directives
like:
asm("movl %1, %%eax;
"movl %%eax, %0;"
:"=r"(y)
:"r"(x)
:"%eax");
Such code is fairly rare, so there is no direct support in CAST for
this. You'll need to manually massage such constructs out of the
Parser input. Or send me patches. Delicious patches.
Contributing
- Bug reports
- Source
- Patches: Fork on Github, send pull request.
- Include tests where practical.
- Leave the version alone, or bump it in a separate commit.
Copyright
Copyright (c) George Ogata. See LICENSE for details.