1 PARSER_DOC written by Mitchell Foral
4 I will assume the reader has a decent knowledge of how Ragel works and the
5 Ragel syntax. If not, please review the Ragel manual found at:
6 http://research.cs.queensu.ca/~thurston/ragel/
8 All parsers must at least:
9 * Call a callback function when a line of code is parsed.
10 * Call a callback function when a line of comment is parsed.
11 * Call a callback function when a blank line is parsed.
12 Additionally a parser can call the callback function for each position of
15 Take a look at c.rl and even keep it open for reference when reading this
16 document to better understand how parsers work and how to write one.
19 First create your parser in ext/ohcount_native/ragel_parsers/. Its name
20 should be the language you are parsing with a '.rl' extension. Every parser
21 must have the following at the top:
23 /************************* Required for every parser *************************/
24 #include "ragel_parser_macros.h"
26 // the name of the language
27 const char *C_LANG = "c";
29 // the languages entities
30 const char *c_entities[] = {
31 "space", "comment", "string", "number", "preproc",
32 "keyword", "identifier", "operator", "any"
35 // constants associated with the entities
37 C_SPACE = 0, C_COMMENT, C_STRING, C_NUMBER, C_PREPROC,
38 C_KEYWORD, C_IDENTIFIER, C_OPERATOR, C_ANY
41 // do not change the following variables
46 // used for newlines inside patterns like strings and comments that can have
48 #define INTERNAL_NL -2
52 char *p, *pe, *eof, *ts, *te;
54 // used for calculating offsets from buffer start for start and end positions
56 #define cint(c) ((int) (c - buffer_start))
58 // state flags for line and comment counting
59 int whole_line_comment;
60 int line_contains_code;
62 // the beginning of a line in the buffer for line and comment counting
65 // state variable for the current entity being matched
68 /*****************************************************************************/
70 And the following at the bottom:
72 /* Parses a string buffer with C/C++ code.
74 * @param *buffer The string to parse.
75 * @param length The length of the string to parse.
76 * @param count Integer flag specifying whether or not to count lines. If yes,
77 * uses the Ragel machine optimized for counting. Otherwise uses the Ragel
78 * machine optimized for returning entity positions.
79 * @param *callback Callback function. If count is set, callback is called for
80 * every line of code, comment, or blank with 'lcode', 'lcomment', and
81 * 'lblank' respectively. Otherwise callback is called for each entity found.
83 void parse_c(char *buffer, int length, int count,
84 void (*callback) (const char *lang, const char *entity, int start, int end)
90 buffer_start = buffer;
91 whole_line_comment = 0;
92 line_contains_code = 0;
97 cs = (count) ? c_en_c_line : c_en_c_entity;
100 // if no newline at EOF; callback contents of last line
101 if (count) { process_last_line(C_LANG) }
104 (Your parser will go between these two blocks.)
106 The code can be found in the existing c.rl parser. You will need to change:
107 * [lang]_LANG - Set the variable name to be [lang]_LANG and its value to be
108 the name of your language to parse. [lang] is your language name. So if
109 you are writing a C parser, it would be C_LANG.
110 * [lang]_entities - Set the variable name to be [lang]_entities (e.g.
111 c_entries) The value is an array of string entities your language has.
112 For example C has comment, string, number, etc. entities. You should
113 definately have "space", and "any" entities. "any" entities are typically
114 used for entity machines (discussed later) and match any character that
115 is not recognized so the parser does not do something unpredictable.
116 * enum - Change the value of the enum to correspond with your entities. So
117 if in your parser you look up [lang]_entities[ENTITY], you will get the
118 associated entity's string name.
119 * parse_[lang] - Set the function name to parse_[lang] where again, [lang]
120 is the name of your language. In the case of C, it is parse_c.
121 * [lang]_en_[lang]_line - The line counting machine.
122 * [lang]_en_[lang]_entity - The entity machine.
124 You may be asking why you have to rename variables and functions. Well if
125 variables have the same name in header files (which is what parsers are),
126 the compiler complains. Also, when you have languages embedded inside each
127 other, any identifiers with the same name can easily be mixed up. It is also
128 important to prefix your Ragel definitions with your language to avoid
129 conflicts with other parsers.
131 Try to understand what the main variables are used for. They will make more
134 Now you can define your Ragel parser. Name your machine after your language,
135 'write data', and include 'common.rl', a file with common Ragel definitions,
136 actions, etc. For example:
145 Before you begin to write patterns for each entity in your language, you need
146 to understand how the parser should work.
148 Each parser has two machines: one optimized for counting lines of code,
149 comments, and blanks; the other for identifying entity positions in the
152 Line Counting Machine:
153 This machine should be written as a line-by-line parser for multiple lines.
154 This means you match any combination of entities except a newline up until
155 you do reach a newline. If the line contains only spaces, or nothing at all,
156 it is blank. If the line contains spaces at first, but then a comment, or
157 just simply a comment, the line is a comment. If the line contains anything
158 but a comment after spaces (if there are any), it is a line of code. You
159 will do this using a Ragel scanner.
160 The callback function will be called for each line parsed.
162 Scanner Parser Structure:
163 A scanner parser will look like this:
165 entity1 ${ entity = ENTITY1; } => [lang]_ccallback;
166 entity2 ${ entity = ENTITY2; } => [lang]_ccallback;
168 entityn ${ entity = ENTITYN; } => [lang]_ccallback;
170 (As usual, replace [lang] with your language name.)
171 Each entity is the pattern for an entity to match, the last one typically
172 being the newline entity. For each match, the variable is set to a
173 constant defined in the enum, and the main action is called (you will need
174 to create this action above the scanner).
176 When you detect whether or not a line is code or comment, you should call
177 the appropriate 'code' or 'comment' action defined in common.rl as soon
178 as possible. It is not necessary to worry about whether or not these
179 actions are called more than once for a given line; the first call to
180 either sets the status of the line permanently. Sometimes you cannot call
181 'code' or 'comment' for one reason or another. Do not worry, as this is
184 When you reach a newline, you will need to decide whether the current line
185 is a line of code, comment, or blank. This is easy. Simply check if the
186 line_contains_code or whole_line_comment variables are set to 1. If
187 neither of them are, the line is blank. Then call the callback function
188 (not action) with an "lcode", "lcomment", or "lblank" string, and the
189 start and end positions of that line (including the newline). The start
190 position of the line is in the line_start variable. It should be set at
191 the beginning of every line either through the 'code' or 'comment'
192 actions, or manually in the main action. Finally the line_contains_code,
193 whole_line_comment, and line_start state variables must be reset. All this
194 should be done within the main action shown below.
195 Note: For most parsers, the std_newline(lang) macro is sufficient and does
196 everything in the main action mentioned above. The lang parameter is the
199 Main Action Structure:
200 The main action looks like this:
201 action [lang]_ccallback {
216 Defining Patterns for Entities:
217 Now it is time to write patterns for each entity in your language. That
218 does not seem very hard, except when your entity can cover multiple lines.
219 Comments and strings in particular can do this. To make an accurate line
220 counter, you will need to count the lines covered by multi-line entities.
221 When you detect a newline inside your multi-line entity, you should set
222 the entity variable to be INTERNAL_NL (-2) and call the main action. The
223 main action should have a case for INTERNAL_NL separate from the newline
224 entity. In it, you will check if the current line is code or comment and
225 call the callback function with the appropriate string ("lcode" or
226 "lcomment") and beginning and end of the line (including the newline).
227 Afterwards, you will reset the line_contains_code and whole_line_comment
228 state variables. Then set the line_start variable to be p, the current
229 Ragel buffer position. Because line_contains_code and whole_line_comment
230 have been reset, any non-newline and non-space character in the multi-line
231 pattern should set line_contains_code or whole_line_comment back to 1.
232 Otherwise you would count the line as blank.
233 Note: For most parsers, the std_internal_newline(lang) macro is sufficient
234 and does everything in the main action mentioned above. The lang parameter
235 is the [lang]_LANG string.
237 For multi-line matches, it is important to call the 'code' or 'comment'
238 actions (mentioned earlier) before an internal newline is detected so the
239 line_contains_code and whole_line_comment variables are properly set. For
240 other entities, you can use the 'code' macro inside the main action which
241 executes the same code as the Ragel 'code' action. Other C macros are
242 'comment' and 'ls', the latter is typically used for the SPACE entity when
246 * You can be a bit sloppy with the line counting machine. For example the
247 only C entities that can contain newlines are strings and comments, so
248 INTERNAL_NL would only be necessary inside them. Other than those,
249 anything other than spaces is considered code, so do not waste your time
250 defining specific patterns for other entities.
252 Entity Identifying Machine:
253 This machine does not have to be written as a line-by-line parser. It only
254 has to identify the positions of language entities, such as whitespace,
255 comments, strings, etc. in sequence. As a result they can be written much
256 faster and more easily with less thought than a line counter. Using a
257 scanner is most efficient.
258 The callback function will be called for each entity parsed.
262 entity1 ${ entity = ENTITY1; } => [lang]_ecallback;
263 entity2 ${ entity = ENTITY2; } => [lang]_ecallback;
265 entityn ${ entity = ENTITYN; } => [lang]_ecallback;
268 Main Action Structure:
269 action [lang]_ecallback {
270 callback([lang]_LANG, entity, cint(ts), cint(te));