1 PARSER_DOC written by Mitchell Foral
4 I will assume the reader has a decent knowledge of how Ragel works and the
5 Ragel syntax. If not, please review the Ragel manual found at:
6 http://research.cs.queensu.ca/~thurston/ragel/
8 All parsers must at least:
9 * Call a callback function when a line of code is parsed.
10 * Call a callback function when a line of comment is parsed.
11 * Call a callback function when a blank line is parsed.
12 Additionally a parser can call the callback function for each position of
15 Take a look at c.rl and even keep it open for reference when reading this
16 document to better understand how parsers work and how to write one.
19 First create your parser in ext/ohcount_native/ragel_parsers/. Its name
20 should be the language you are parsing with a '.rl' extension. You will not
21 have to manually compile any parsers, as the Rakefile does this automatically
22 for you. Every parser must have the following at the top:
24 /************************* Required for every parser *************************/
25 #ifndef RAGEL_C_PARSER
26 #define RAGEL_C_PARSER
28 #include "ragel_parser_macros.h"
30 // the name of the language
31 const char *C_LANG = "c";
33 // the languages entities
34 const char *c_entities[] = {
35 "space", "comment", "string", "number", "preproc",
36 "keyword", "identifier", "operator", "any"
39 // constants associated with the entities
41 C_SPACE = 0, C_COMMENT, C_STRING, C_NUMBER, C_PREPROC,
42 C_KEYWORD, C_IDENTIFIER, C_OPERATOR, C_ANY
45 /*****************************************************************************/
47 And the following at the bottom:
49 /************************* Required for every parser *************************/
51 /* Parses a string buffer with C/C++ code.
53 * @param *buffer The string to parse.
54 * @param length The length of the string to parse.
55 * @param count Integer flag specifying whether or not to count lines. If yes,
56 * uses the Ragel machine optimized for counting. Otherwise uses the Ragel
57 * machine optimized for returning entity positions.
58 * @param *callback Callback function. If count is set, callback is called for
59 * every line of code, comment, or blank with 'lcode', 'lcomment', and
60 * 'lblank' respectively. Otherwise callback is called for each entity found.
62 void parse_c(char *buffer, int length, int count,
63 void (*callback) (const char *lang, const char *entity, int start, int end)
68 cs = (count) ? c_en_c_line : c_en_c_entity;
71 // if no newline at EOF; callback contents of last line
72 if (count) { process_last_line(C_LANG) }
77 /*****************************************************************************/
79 (Your parser will go between these two blocks.)
81 The code can be found in the existing c.rl parser. You will need to change:
82 * RAGEL_[lang]_PARSER - Replace [lang] with your language name. So if you
83 are writing a C parser, it would be RAGEL_C_PARSER.
84 * [lang]_LANG - Set the variable name to be [lang]_LANG and its value to be
85 the name of your language to parse. [lang] is your language name. For C it
87 * [lang]_entities - Set the variable name to be [lang]_entities (e.g.
88 c_entries) The value is an array of string entities your language has.
89 For example C has comment, string, number, etc. entities. You should
90 definately have "space", and "any" entities. "any" entities are typically
91 used for entity machines (discussed later) and match any character that
92 is not recognized so the parser does not do something unpredictable.
93 * enum - Change the value of the enum to correspond with your entities. So
94 if in your parser you look up [lang]_entities[ENTITY], you will get the
95 associated entity's string name.
96 * parse_[lang] - Set the function name to parse_[lang] where again, [lang]
97 is the name of your language. In the case of C, it is parse_c.
98 * [lang]_en_[lang]_line - The line counting machine.
99 * [lang]_en_[lang]_entity - The entity machine.
101 You may be asking why you have to rename variables and functions. Well if
102 variables have the same name in header files (which is what parsers are),
103 the compiler complains. Also, when you have languages embedded inside each
104 other, any identifiers with the same name can easily be mixed up. It is also
105 important to prefix your Ragel definitions with your language to avoid
106 conflicts with other parsers.
108 Additional variables available to parsers are in the "ragel_parser_macros.h"
109 file. Take a look at it and try to understand what the variables are used for.
110 They will make more sense later on.
112 Now you can define your Ragel parser. Name your machine after your language,
113 'write data', and include 'common.rl', a file with common Ragel definitions,
114 actions, etc. For example:
123 Before you begin to write patterns for each entity in your language, you need
124 to understand how the parser should work.
126 Each parser has two machines: one optimized for counting lines of code,
127 comments, and blanks; the other for identifying entity positions in the
130 Line Counting Machine:
131 This machine should be written as a line-by-line parser for multiple lines.
132 This means you match any combination of entities except a newline up until
133 you do reach a newline. If the line contains only spaces, or nothing at all,
134 it is blank. If the line contains spaces at first, but then a comment, or
135 just simply a comment, the line is a comment. If the line contains anything
136 but a comment after spaces (if there are any), it is a line of code. You
137 will do this using a Ragel scanner.
138 The callback function will be called for each line parsed.
140 Scanner Parser Structure:
141 A scanner parser will look like this:
143 entity1 ${ entity = ENTITY1; } => [lang]_ccallback;
144 entity2 ${ entity = ENTITY2; } => [lang]_ccallback;
146 entityn ${ entity = ENTITYN; } => [lang]_ccallback;
148 (As usual, replace [lang] with your language name.)
149 Each entity is the pattern for an entity to match, the last one typically
150 being the newline entity. For each match, the variable is set to a
151 constant defined in the enum, and the main action is called (you will need
152 to create this action above the scanner).
154 When you detect whether or not a line is code or comment, you should call
155 the appropriate 'code' or 'comment' action defined in common.rl as soon
156 as possible. It is not necessary to worry about whether or not these
157 actions are called more than once for a given line; the first call to
158 either sets the status of the line permanently. Sometimes you cannot call
159 'code' or 'comment' for one reason or another. Do not worry, as this is
162 When you reach a newline, you will need to decide whether the current line
163 is a line of code, comment, or blank. This is easy. Simply check if the
164 line_contains_code or whole_line_comment variables are set to 1. If
165 neither of them are, the line is blank. Then call the callback function
166 (not action) with an "lcode", "lcomment", or "lblank" string, and the
167 start and end positions of that line (including the newline). The start
168 position of the line is in the line_start variable. It should be set at
169 the beginning of every line either through the 'code' or 'comment'
170 actions, or manually in the main action. Finally the line_contains_code,
171 whole_line_comment, and line_start state variables must be reset. All this
172 should be done within the main action shown below.
173 Note: For most parsers, the std_newline(lang) macro is sufficient and does
174 everything in the main action mentioned above. The lang parameter is the
177 Main Action Structure:
178 The main action looks like this:
179 action [lang]_ccallback {
194 Defining Patterns for Entities:
195 Now it is time to write patterns for each entity in your language. That
196 does not seem very hard, except when your entity can cover multiple lines.
197 Comments and strings in particular can do this. To make an accurate line
198 counter, you will need to count the lines covered by multi-line entities.
199 When you detect a newline inside your multi-line entity, you should set
200 the entity variable to be INTERNAL_NL (-2) and call the main action. The
201 main action should have a case for INTERNAL_NL separate from the newline
202 entity. In it, you will check if the current line is code or comment and
203 call the callback function with the appropriate string ("lcode" or
204 "lcomment") and beginning and end of the line (including the newline).
205 Afterwards, you will reset the line_contains_code and whole_line_comment
206 state variables. Then set the line_start variable to be p, the current
207 Ragel buffer position. Because line_contains_code and whole_line_comment
208 have been reset, any non-newline and non-space character in the multi-line
209 pattern should set line_contains_code or whole_line_comment back to 1.
210 Otherwise you would count the line as blank.
211 Note: For most parsers, the std_internal_newline(lang) macro is sufficient
212 and does everything in the main action mentioned above. The lang parameter
213 is the [lang]_LANG string.
215 For multi-line matches, it is important to call the 'code' or 'comment'
216 actions (mentioned earlier) before an internal newline is detected so the
217 line_contains_code and whole_line_comment variables are properly set. For
218 other entities, you can use the 'code' macro inside the main action which
219 executes the same code as the Ragel 'code' action. Other C macros are
220 'comment' and 'ls', the latter is typically used for the SPACE entity when
224 * You can be a bit sloppy with the line counting machine. For example the
225 only C entities that can contain newlines are strings and comments, so
226 INTERNAL_NL would only be necessary inside them. Other than those,
227 anything other than spaces is considered code, so do not waste your time
228 defining specific patterns for other entities.
230 Parsers with Embedded Languages:
231 Notation: [lang] is the parent language, [elang] is the embedded language.
233 To write a parser with embedded languages (such as HTML with embedded CSS
234 and Javascript), you should first #include the parser(s) above your Ragel
235 code. The header file is "[elang]_parser.h".
237 Next, after the inclusion of 'common.rl', add '#EMBED([elang])' on
238 separate lines for each embedded language. The Rakefile looks for these
239 special comments to embed the language for you automatically.
241 In your main action, you need to add another entity CHECK_BLANK_ENTRY. It
242 should call the 'check_blank_entry([lang]_LANG)' macro. Blank entries are
243 an entry into an embedded language, but the rest of the line is blank
244 before a newline. For example, a CSS entry in HTML is something like:
245 <style type="text/css">
246 If there is no CSS code after the entry (a blank entry), the line should
247 be counted as HTML code, and the 'check_blank_entry' macro handles this.
248 But you may be asking, "how do I get to the CHECK_BLANK_ENTRY entity?".
249 This will be discussed in just a bit.
250 Also use the emb_newline and emb_internal_newline macros instead of the
251 std_newline and std_internal_newline macros.
253 For each embedded language you will have to define an entry and outry. An
254 entry is the pattern that transitions from the parent language into the
255 child language. An outry is the pattern from child to parent. You will
256 need to put your entries in your [lang]_line machine. You will also need
257 to re-create each embedded language's line machine (define as
258 [lang]_[elang]_line; e.g. html_css_line) and put outry patterns in those.
259 Entries typically would be defined as [lang]_[elang]_entry, and outries
260 as [lang]_[elang]_outry.
261 Note: An outry should have a 'check_blank_outry' action so the line is not
262 mistakenly counted as a line of embedded language code if it is actually a
265 Entry pattern actions should be:
266 [lang]_[elang]_entry @{ entity = CHECK_BLANK_ENTRY; } @[lang]_callback
267 @{ saw([lang]_LANG)} => { fcall [lang]_[elang]_line; };
268 What this does is checks for a blank entry, and if it is, counts the line
269 as a line of parent language code. If it is not, the macro will not do
270 anything. The machine then transitions into the child language.
272 Outry pattern actions should be:
274 What this does is sets the current Ragel parser position to the beginning
275 of the outry so the line is counted as a line of parent language code if
276 no child code is on the same line. The machine then transitions into the
279 Entity Identifying Machine:
280 This machine does not have to be written as a line-by-line parser. It only
281 has to identify the positions of language entities, such as whitespace,
282 comments, strings, etc. in sequence. As a result they can be written much
283 faster and more easily with less thought than a line counter. Using a
284 scanner is most efficient.
285 The callback function will be called for each entity parsed.
289 entity1 ${ entity = ENTITY1; } => [lang]_ecallback;
290 entity2 ${ entity = ENTITY2; } => [lang]_ecallback;
292 entityn ${ entity = ENTITYN; } => [lang]_ecallback;
295 Main Action Structure:
296 action [lang]_ecallback {
297 callback([lang]_LANG, entity, cint(ts), cint(te));
300 Parsers for Embedded Languages: