1 // parser.h written by Mitchell Foral. mitchell<att>caladbolg.net.
2 // See COPYING for license information.
4 #ifndef OHCOUNT_PARSER_H
5 #define OHCOUNT_PARSER_H
7 #include "sourcefile.h"
10 * @page parser_doc Parser Documentation
11 * @author Mitchell Foral
13 * @section overview Overview
15 * I will assume the reader has a decent knowledge of how Ragel works and the
16 * Ragel syntax. If not, please review the Ragel manual found at:
17 * http://research.cs.queensu.ca/~thurston/ragel/
19 * All parsers must at least:
21 * @li Call a callback function when a line of code is parsed.
22 * @li Call a callback function when a line of comment is parsed.
23 * @li Call a callback function when a blank line is parsed.
25 * Additionally a parser can call the callback function for each position of
28 * Take a look at 'c.rl' and even keep it open for reference when reading this
29 * document to better understand how parsers work and how to write one.
31 * @section writing Writing a Parser
33 * First create your parser in 'src/parsers/'. Its name should be the language
34 * you are parsing with a '.rl' extension. You will not have to manually compile
35 * any parsers, as this is automatically for you. However, you do need to add
36 * your parser to 'hash/parsers.gperf'.
38 * Every parser must have the following at the top:
40 * @include parser_doc_1
42 * And the following at the bottom:
44 * @include parser_doc_2
46 * (Your parser will go between these two blocks.)
48 * The code can be found in the existing 'c.rl' parser. You will need to change:
49 * @li OHCOUNT_[lang]_PARSER_H - Replace [lang] with your language name. So if
50 * you are writing a C parser, it would be OHCOUNT_C_PARSER_H.
51 * @li [lang]_LANG - Set the variable name to be [lang]_LANG and its value to be
52 * the name of your language to parse as defined in languages.h. [lang] is
53 * your language name. For C it would be C_LANG.
54 * @li [lang]_entities - Set the variable name to be [lang]_entities (e.g.
55 * c_entries) The value is an array of string entities your language has. For
56 * example C has comment, string, number, etc. entities. You should definately
57 * have "space", and "any" entities. "any" entities are typically used for
58 * entity machines (discussed later) and match any character that is not
59 * recognized so the parser does not do something unpredictable.
60 * @li enum - Change the value of the enum to correspond with your entities. So
61 * if in your parser you look up [lang]_entities[ENTITY], you will get the
62 * associated entity's string name.
63 * @li parse_[lang] - Set the function name to parse_[lang] where again, [lang]
64 * is the name of your language. In the case of C, it is parse_c.
65 * @li [lang]_en_[lang]_line - The line counting machine.
66 * @li [lang]_en_[lang]_entity - The entity machine.
68 * You may be asking why you have to rename variables and functions. Well if
69 * variables have the same name in header files (which is what parsers are), the
70 * compiler complains. Also, when you have languages embedded inside each other,
71 * any identifiers with the same name can easily be mixed up. It is also
72 * important to prefix your Ragel definitions with your language to avoid
73 * conflicts with other parsers.
75 * Additional variables available to parsers are in the parser_macros.h file.
76 * Take a look at it and try to understand what the variables are used for. They
77 * will make more sense later on.
79 * Now you can define your Ragel parser. Name your machine after your language,
80 * "write data", and include 'common.rl', a file with common Ragel definitions,
81 * actions, etc. For example:
83 * @include parser_doc_3
85 * Before you begin to write patterns for each entity in your language, you need
86 * to understand how the parser should work.
88 * Each parser has two machines: one optimized for counting lines of code,
89 * comments, and blanks; the other for identifying entity positions in the
92 * @section line Line Counting Machine
94 * This machine should be written as a line-by-line parser for multiple lines.
95 * This means you match any combination of entities except a newline up until
96 * you do reach a newline. If the line contains only spaces, or nothing at all,
97 * it is blank. If the line contains spaces at first, but then a comment, or
98 * just simply a comment, the line is a comment. If the line contains anything
99 * but a comment after spaces (if there are any), it is a line of code. You
100 * will do this using a Ragel scanner. The callback function will be called for
103 * @subsection line_scanner Scanner Parser Structure
105 * A scanner parser will look like this:
107 * @include parser_doc_4
109 * (As usual, replace [lang] with your language name.)
111 * Each entity is the pattern for an entity to match, the last one typically
112 * being the newline entity. For each match, the variable is set to a constant
113 * defined in the enum, and the main action is called (you will need to create
114 * this action above the scanner).
116 * When you detect whether or not a line is code or comment, you should call the
117 * appropriate \@code or \@comment action defined in 'common.rl' as soon as
118 * possible. It is not necessary to worry about whether or not these actions are
119 * called more than once for a given line; the first call to either sets the
120 * status of the line permanently. Sometimes you cannot call \@code or \@comment
121 * for one reason or another. Do not worry, as this is discussed later.
123 * When you reach a newline, you will need to decide whether the current line is
124 * a line of code, comment, or blank. This is easy. Simply check if the
125 * #line_contains_code or #whole_line_comment variables are set to 1. If neither
126 * of them are, the line is blank. Then call the callback function (not action)
127 * with an "lcode", "lcomment", or "lblank" string, and the start and end
128 * positions of that line (including the newline). The start position of the
129 * line is in the #line_start variable. It should be set at the beginning of
130 * every line either through the \@code or \@comment actions, or manually in the
131 * main action. Finally the #line_contains_code, #whole_line_comment, and
132 * #line_start state variables must be reset. All this should be done within the
133 * main action shown below. Note: For most parsers, the std_newline(lang) macro
134 * is sufficient and does everything in the main action mentioned above. The
135 * lang parameter is the [lang]_LANG string.
137 * @subsection line_action Main Action Structure
139 * The main action looks like this:
141 * @include parser_doc_5
143 * @subsection line_entity_patterns Defining Patterns for Entities
145 * Now it is time to write patterns for each entity in your language. That does
146 * not seem very hard, except when your entity can cover multiple lines.
147 * Comments and strings in particular can do this. To make an accurate line
148 * counter, you will need to count the lines covered by multi-line entities.
149 * When you detect a newline inside your multi-line entity, you should set the
150 * entity variable to be #INTERNAL_NL and call the main action. The main action
151 * should have a case for #INTERNAL_NL separate from the newline entity. In it,
152 * you will check if the current line is code or comment and call the callback
153 * function with the appropriate string ("lcode" or "lcomment") and beginning
154 * and end of the line (including the newline). Afterwards, you will reset the
155 * #line_contains_code and #whole_line_comment state variables. Then set the
156 * #line_start variable to be #p, the current Ragel buffer position. Because
157 * #line_contains_code and #whole_line_comment have been reset, any non-newline
158 * and non-space character in the multi-line pattern should set
159 * #line_contains_code or #whole_line_comment back to 1. Otherwise you would count
162 * Note: For most parsers, the std_internal_newline(lang) macro is sufficient
163 * and does everything in the main action mentioned above. The lang parameter
164 * is the [lang]_LANG string.
166 * For multi-line matches, it is important to call the \@code or \@comment
167 * actions (mentioned earlier) before an internal newline is detected so the
168 * #line_contains_code and #whole_line_comment variables are properly set. For
169 * other entities, you can use the #code macro inside the main action which
170 * executes the same code as the Ragel \@code action. Other C macros are
171 * #comment and #ls, the latter is typically used for the SPACE entity when
172 * defining #line_start.
174 * Also for multi-line matches, it may be necessary to use the \@enqueue and
175 * \@commit actions. If it is possible that a multi-line entity will not have an
176 * ending delimiter (for example a string), use the \@enqueue action as soon as
177 * the start delimitter has been detected, and the \@commit action as soon as
178 * the end delimitter has been detected. This will eliminate the potential for
179 * any counting errors.
181 * @subsection line_notes Notes
183 * You can be a bit sloppy with the line counting machine. For example the only
184 * C entities that can contain newlines are strings and comments, so
185 * #INTERNAL_NL would only be necessary inside them. Other than those, anything
186 * other than spaces is considered code, so do not waste your time defining
187 * specific patterns for other entities.
189 * @subsection line_embedded Parsers with Embedded Languages
191 * Notation: [lang] is the parent language, [elang] is the embedded language.
193 * To write a parser with embedded languages (such as HTML with embedded CSS and
194 * Javascript), you should first \#include the parser(s) above your Ragel code.
195 * The header file is "[elang]_parser.h".
197 * Next, after the inclusion of 'common.rl', add "#EMBED([elang])" on separate
198 * lines for each embedded language. The build process looks for these special
199 * comments to embed the language for you automatically.
201 * In your main action, you need to add another entity #CHECK_BLANK_ENTRY. It
202 * should call the #check_blank_entry([lang]_LANG) macro. Blank entries are an
203 * entry into an embedded language, but the rest of the line is blank before a
204 * newline. For example, a CSS entry in HTML is something like:
207 * <style type="text/css">
210 * If there is no CSS code after the entry (a blank entry), the line should be
211 * counted as HTML code, and the #check_blank_entry macro handles this. But you
212 * may be asking, "how do I get to the CHECK_BLANK_ENTRY entity?". This will be
213 * discussed in just a bit.
215 * The #emb_newline and #emb_internal_newline macros should be used instead of
216 * the #std_newline and #std_internal_newline macros.
218 * For each embedded language you will have to define an entry and outry. An
219 * entry is the pattern that transitions from the parent language into the child
220 * language. An outry is the pattern from child to parent. You will need to put
221 * your entries in your [lang]_line machine. You will also need to re-create
222 * each embedded language's line machine (define as [lang]_[elang]_line; e.g.
223 * html_css_line) and put outry patterns in those. Entries typically would be
224 * defined as [lang]_[elang]_entry, and outries as [lang]_[elang]_outry.
226 * Note: An outry should have a \@check_blank_outry action so the line is not
227 * mistakenly counted as a line of embedded language code if it is actually a
228 * line of parent code.
230 * @subsection line_entry_action Entry Pattern Actions
232 * @include parser_doc_6
234 * What this does is checks for a blank entry, and if it is, counts the line as
235 * a line of parent language code. If it is not, the macro will not do anything.
236 * The machine then transitions into the child language.
238 * @subsection line_outry_action Outry Pattern Actions
240 * @include parser_doc_7
242 * What this does is sets the current Ragel parser position to the beginning of
243 * the outry so the line is counted as a line of parent language code if no
244 * child code is on the same line. The machine then transitions into the parent
247 * @section entity Entity Identifying Machine
249 * This machine does not have to be written as a line-by-line parser. It only
250 * has to identify the positions of language entities, such as whitespace,
251 * comments, strings, etc. in sequence. As a result they can be written much
252 * faster and more easily with less thought than a line counter. Using a scanner
253 * is most efficient. The callback function will be called for each entity
256 * The \@ls, \@ code, \@comment, \@queue, and \@commit actions are completely
259 * @subsection entity_scanner Scanner Structure
261 * @include parser_doc_8
263 * @subsection entity_action Main Action Structure
265 * @include parser_doc_9
267 * @subsection entity_embedded Parsers for Embedded Languages
271 * @section tests Including Written Tests for Parsers
273 * You should have two kinds of tests for parsers. One will be a header file
274 * that goes in the 'test/unit/parsers/' directory and the other will be an
275 * input source file that goes in the 'test/src_dir/' and an expected output
276 * file that goes in the 'test/expected_dir/' directory.
278 * The header file will need to be "#include"ed in 'test/unit/test_parsers.h'.
279 * Then add the "all_[lang]_tests()" function to the "all_parser_tests()"
282 * Recompile the tests for the changes to take effect.
284 * The other files added to the 'test/{src,expected}_dir/' directories will be
285 * automatically detected and run with the test suite.
289 * Tries to use an existing Ragel parser for the given language.
290 * @param sourcefile A SourceFile created by ohcount_sourcefile_new().
291 * @param count An integer flag indicating whether to count lines or parse
293 * @param callback A callback to use for every line or entity in the source
294 * file discovered (depends on count).
295 * @param userdata Pointer to userdata used by callback (if any).
296 * @return 1 if a Ragel parser is found, 0 otherwise.
298 int ohcount_parse(SourceFile *sourcefile, int count,
299 void (*callback) (const char *, const char *, int, int,