1 Written by Mitchell Foral
4 I will assume the reader has a decent knowledge of how Ragel works and the
6 All parsers must do 4 things:
7 * Call back when a line of code is parsed.
8 * Call back when a line of comment is parsed.
9 * Call back when a blank line is parsed.
10 * Call back for entities parsed.
11 The first three are tricker than they may seem; the last is very easy.
13 Take a look at c.rl and even keep it open for reference when reading this
14 document to better understand how parsers work and how to write one.
17 First create your parser in ext/ohcount_native/ragel_parsers/. It's name
18 should be the language you're parsing with a '.rl' extension. Every parser
19 must have the following at the top:
21 /************************* Required for every parser *************************/
23 // the name of the language
24 const char *C_LANG = "c";
26 // the languages entities
27 const char *c_entities[] = {
28 "space", "comment", "string", "number", "preproc", "keyword",
29 "identifier", "operator", "escaped_newline", "newline"
32 // constants associated with the entities
34 C_SPACE = 0, C_COMMENT, C_STRING, C_NUMBER, C_PREPROC, C_KEYWORD,
35 C_IDENTIFIER, C_OPERATOR, C_ESCAPED_NL, C_NEWLINE
38 // do not change the following variables
40 // used for newlines inside patterns like strings and comments that can have
42 #define INTERNAL_NL -1
46 char *p, *pe, *eof, *ts, *te;
48 // used for calculating offsets from buffer start for start and end positions
50 #define cint(c) ((int) (c - buffer_start))
52 // state flags for line and comment counting
53 int whole_line_comment;
54 int line_contains_code;
56 // the beginning of a line in the buffer for line and comment counting
59 // state variable for the current entity being matched
62 /*****************************************************************************/
64 And the following at the bottom:
66 /* Parses a string buffer with C/C++ code.
68 * @param *buffer The string to parse.
69 * @param length The length of the string to parse.
70 * @param *c_callback Callback function called for each entity. Entities are
71 * the ones defined in the lexer as well as 3 additional entities used by
72 * Ohcount for counting lines: lcode, lcomment, lblank.
74 void parse_c(char *buffer, int length,
75 void (*c_callback) (const char *lang, const char *entity, int start, int end)
81 buffer_start = buffer;
82 whole_line_comment = 0;
83 line_contains_code = 0;
90 // no newline at EOF; get contents of last line
91 if ((whole_line_comment || line_contains_code) && c_callback) {
92 if (line_contains_code)
93 c_callback(LANG, "lcode", cint(line_start), cint(pe));
94 else if (whole_line_comment)
95 c_callback(LANG, "lcomment", cint(line_start), cint(pe));
99 (Your parser will go between these two blocks.)
101 The code can be found in the existing c.rl parser. You'll need to change:
102 * [lang]_LANG - Set the variable name to be [lang]_LANG and its value to be
103 the name of your language to parse. [lang] is your language name. So if
104 you're writing a C parser, it would be C_LANG.
105 * [lang]_entities - Set the variable name to be [lang]_entities (e.g.
106 c_entries) The value is an array of string entities your language has.
107 For example C has comment, string, number, etc. entities. You should
108 definately have "space", and "newline" entities. If your language has
109 escaped newlines (or continuations), have an "escaped_newline" entity as
111 * enum - Change the value of the enum to correspond with your entities. So
112 if in your parser you look up [lang]_entities[ENTITY], you'll get the
113 associated entity's string name.
114 * parse_[lang] - Set the function name to parse_[lang] where again, [lang]
115 is the name of your language. In the case of C, it is parse_c.
116 * [lang]_callback - Set the name of the callback to be [lang]_callback
117 (e.g. c_callback) and change all occurances in the parse_[lang] function
120 You may be asking why you have to rename variables and functions. Well when
121 you have languages embedded inside others, any identifiers with the same
122 name can easily be mixed up.
124 Try to understand what the main variables are used for. They will make more
127 Now you can define your Ragel parser. Name your machine after your language,
128 'write data', and include 'common.rl', a file with common Ragel definitions,
129 actions, etc. For example:
138 Understanding What you're Writing:
139 Before you begin to write patterns for each entity in your language, you
140 need to understand how the parser should work.
142 You should write a parser as a line-by-line parser for multiple lines. This
143 means you match any combination of entities except a newline up until you do
144 reach a newline. If the line contains only spaces, or nothing at all, it is
145 blank. If the line contains spaces at first, but then a comment, or just
146 simply a comment, the line is a comment. If the line contains anything but a
147 comment after spaces (if there are any), it is a line of code. You will do
148 this using a Ragel scanner.
150 Scanner Parser Structure:
151 A scanner parser will look like this:
153 entity1 ${ entity = ENTITY1; } => [lang]_callback;
154 entity1 ${ entity = ENTITY2; } => [lang]_callback;
156 entityn ${ entity = ENTITYN; } => [lang]_callback;
158 (As usual, replace [lang] with your language name.)
159 Each entity is the pattern for an entity to match. For each match, the
160 variable is set to a constant defined in the enum, and the main action is
161 called (you will need to create this action above the scanner).
163 When you detect whether or not a line is code or comment, you should call
164 the appropriate 'code' or 'comment' action defined in common.rl as soon as
165 possible. It is not necessary to worry about whether or not these actions
166 are called more than once for a given line; the first call to either sets
167 the status of the line permanently. Sometimes you cannot call 'code' or
168 'comment' for one reason or another. Do not worry, as this is discussed
171 When you reach a newline, you will need to decide whether the current line
172 is a line of code, comment, or blank. This is easy. Simply check if the
173 line_contains_code or whole_line_comment variables are set to 1. If neither
174 of them are, the line is blank. Then call the [lang]_callback function (not
175 action) with an "lcode", "lcomment", or "lblank" string, and the start and
176 end positions of that line (including the newline). The start position of
177 the line is in the line_start variable. It should be set at the beginning
178 of every line either through the 'code' or 'comment' actions, or manually
179 in the main action. Finally the line_contains_code, whole_line_comment, and
180 line_start state variables must be reset. All this is done in the main
183 Main Action Structure:
184 The main action looks like this:
185 action [lang]_callback {
198 if([lang]_callback && entity != INTERNAL_NL)
199 [lang]_callback(LANG, [lang]_entities[entity], cint(ts), cint(te));
201 The last bit of code is for the entity callback. It passes the entire entity
202 text (including internal newlines) and position of the entity in the buffer
203 to the callback function.
205 Defining Patterns for Entities:
206 Now it is time to write patterns for each entity in your language. That
207 doesn't seem very hard, except when your entity can cover multiple lines.
208 Comments and strings in particular can do this. To make an accurate line
209 counter, you will need to count the lines covered by multi-line entities.
210 When you detect a newline inside your multi-line entity, you should set the
211 entity variable to be INTERNAL_NL (-1) and call the main action. The main
212 action should have a case for INTERNAL_NL separate from the newline entity.
213 In it, you will check if the current line is code or comment and call the
214 callback function with the appropriate string ("lcode" or "lcomment") and
215 beginning and end of the line (including the newline). Afterwards, you will
216 reset the line_contains_code and whole_line_comment state variables. Then
217 set the line_start variable to be p, the current Ragel buffer position.
218 Because line_contains_code and whole_line_comment have been reset, any non-
219 newline and non-space character in the multi-line pattern should set
220 line_contains_code or whole_line_comment back to 1. Otherwise you would
221 count the line as blank.
223 For multi-line matches, it is important to call the 'code' or 'comment'
224 actions (mentioned earlier) before an internal newline is detected so the
225 line_contains_code and whole_line_comment variables are properly set. For
226 other entities, you can add code for setting line_contains_code and
227 whole_line_comment inside the switch statement of the main action. See the
228 'code' and 'comment' actions in 'common.rl' for the appropriate code.
230 That's all there is to it!