git.oblomov.eu Git - ohcount/blob - PARSER_DOC

   1 PARSER_DOC written by Mitchell Foral
   2
   3 Overview:
   4   I will assume the reader has a decent knowledge of how Ragel works and the
   5   Ragel syntax. If not, please review the Ragel manual found at:
   6     http://research.cs.queensu.ca/~thurston/ragel/
   7
   8   All parsers must at least:
   9     * Call a callback function when a line of code is parsed.
  10     * Call a callback function when a line of comment is parsed.
  11     * Call a callback function when a blank line is parsed.
  12   Additionally a parser can call the callback function for each position of
  13   entities parsed.
  14
  15   Take a look at c.rl and even keep it open for reference when reading this
  16   document to better understand how parsers work and how to write one.
  17
  18 Writing a Parser:
  19   First create your parser in ext/ohcount_native/ragel_parsers/. Its name
  20   should be the language you are parsing with a '.rl' extension. You will not
  21   have to manually compile any parsers, as the Rakefile does this automatically
  22   for you. Every parser must have the following at the top:
  23
  24 /************************* Required for every parser *************************/
  25 #ifndef RAGEL_C_PARSER
  26 #define RAGEL_C_PARSER
  27
  28 #include "ragel_parser_macros.h"
  29
  30 // the name of the language
  31 const char *C_LANG = "c";
  32
  33 // the languages entities
  34 const char *c_entities[] = {
  35   "space", "comment", "string", "number", "preproc",
  36   "keyword", "identifier", "operator", "any"
  37 };
  38
  39 // constants associated with the entities
  40 enum {
  41   C_SPACE = 0, C_COMMENT, C_STRING, C_NUMBER, C_PREPROC,
  42   C_KEYWORD, C_IDENTIFIER, C_OPERATOR, C_ANY
  43 };
  44
  45 /*****************************************************************************/
  46
  47   And the following at the bottom:
  48
  49 /************************* Required for every parser *************************/
  50
  51 /* Parses a string buffer with C/C++ code.
  52  *
  53  * @param *buffer The string to parse.
  54  * @param length The length of the string to parse.
  55  * @param count Integer flag specifying whether or not to count lines. If yes,
  56  *   uses the Ragel machine optimized for counting. Otherwise uses the Ragel
  57  *   machine optimized for returning entity positions.
  58  * @param *callback Callback function. If count is set, callback is called for
  59  *   every line of code, comment, or blank with 'lcode', 'lcomment', and
  60  *   'lblank' respectively. Otherwise callback is called for each entity found.
  61  */
  62 void parse_c(char *buffer, int length, int count,
  63   void (*callback) (const char *lang, const char *entity, int start, int end)
  64   ) {
  65   init
  66
  67   %% write init;
  68   cs = (count) ? c_en_c_line : c_en_c_entity;
  69   %% write exec;
  70
  71   // if no newline at EOF; callback contents of last line
  72   if (count) { process_last_line(C_LANG) }
  73 }
  74
  75 #endif
  76
  77 /*****************************************************************************/
  78
  79   (Your parser will go between these two blocks.)
  80
  81   The code can be found in the existing c.rl parser. You will need to change:
  82     * RAGEL_[lang]_PARSER - Replace [lang] with your language name. So if you
  83       are writing a C parser, it would be RAGEL_C_PARSER.
  84     * [lang]_LANG - Set the variable name to be [lang]_LANG and its value to be
  85       the name of your language to parse. [lang] is your language name. For C it
  86       would be C_LANG.
  87     * [lang]_entities - Set the variable name to be [lang]_entities (e.g.
  88       c_entries) The value is an array of string entities your language has.
  89       For example C has comment, string, number, etc. entities. You should
  90       definately have "space", and "any" entities. "any" entities are typically
  91       used for entity machines (discussed later) and match any character that
  92       is not recognized so the parser does not do something unpredictable.
  93     * enum - Change the value of the enum to correspond with your entities. So
  94       if in your parser you look up [lang]_entities[ENTITY], you will get the
  95       associated entity's string name.
  96     * parse_[lang] - Set the function name to parse_[lang] where again, [lang]
  97       is the name of your language. In the case of C, it is parse_c.
  98     * [lang]_en_[lang]_line - The line counting machine.
  99     * [lang]_en_[lang]_entity - The entity machine.
 100
 101     You may be asking why you have to rename variables and functions. Well if
 102     variables have the same name in header files (which is what parsers are),
 103     the compiler complains. Also, when you have languages embedded inside each
 104     other, any identifiers with the same name can easily be mixed up. It is also
 105     important to prefix your Ragel definitions with your language to avoid
 106     conflicts with other parsers.
 107
 108   Additional variables available to parsers are in the "ragel_parser_macros.h"
 109   file. Take a look at it and try to understand what the variables are used for.
 110   They will make more sense later on.
 111
 112   Now you can define your Ragel parser. Name your machine after your language,
 113   'write data', and include 'common.rl', a file with common Ragel definitions,
 114   actions, etc. For example:
 115     %%{
 116       machine c;
 117       write data;
 118       include "common.rl";
 119
 120       ...
 121     }%%
 122
 123   Before you begin to write patterns for each entity in your language, you need
 124   to understand how the parser should work.
 125
 126   Each parser has two machines: one optimized for counting lines of code,
 127   comments, and blanks; the other for identifying entity positions in the
 128   buffer.
 129
 130   Line Counting Machine:
 131     This machine should be written as a line-by-line parser for multiple lines.
 132     This means you match any combination of entities except a newline up until
 133     you do reach a newline. If the line contains only spaces, or nothing at all,
 134     it is blank. If the line contains spaces at first, but then a comment, or
 135     just simply a comment, the line is a comment. If the line contains anything
 136     but a comment after spaces (if there are any), it is a line of code. You
 137     will do this using a Ragel scanner.
 138     The callback function will be called for each line parsed.
 139
 140     Scanner Parser Structure:
 141       A scanner parser will look like this:
 142         [lang]_line := |*
 143           entity1 ${ entity = ENTITY1; } => [lang]_ccallback;
 144           entity2 ${ entity = ENTITY2; } => [lang]_ccallback;
 145           ...
 146           entityn ${ entity = ENTITYN; } => [lang]_ccallback;
 147         *|;
 148       (As usual, replace [lang] with your language name.)
 149       Each entity is the pattern for an entity to match, the last one typically
 150       being the newline entity. For each match, the variable is set to a
 151       constant defined in the enum, and the main action is called (you will need
 152       to create this action above the scanner).
 153
 154       When you detect whether or not a line is code or comment, you should call
 155       the appropriate 'code' or 'comment' action defined in common.rl as soon
 156       as possible. It is not necessary to worry about whether or not these
 157       actions are called more than once for a given line; the first call to
 158       either sets the status of the line permanently. Sometimes you cannot call
 159       'code' or 'comment' for one reason or another. Do not worry, as this is
 160       discussed later.
 161
 162       When you reach a newline, you will need to decide whether the current line
 163       is a line of code, comment, or blank. This is easy. Simply check if the
 164       line_contains_code or whole_line_comment variables are set to 1. If
 165       neither of them are, the line is blank. Then call the callback function
 166       (not action) with an "lcode", "lcomment", or "lblank" string, and the
 167       start and end positions of that line (including the newline). The start
 168       position of the line is in the line_start variable. It should be set at
 169       the beginning of every line either through the 'code' or 'comment'
 170       actions, or manually in the main action. Finally the line_contains_code,
 171       whole_line_comment, and line_start state variables must be reset. All this
 172       should be done within the main action shown below.
 173       Note: For most parsers, the std_newline(lang) macro is sufficient and does
 174       everything in the main action mentioned above. The lang parameter is the
 175       [lang]_LANG string.
 176
 177     Main Action Structure:
 178       The main action looks like this:
 179         action [lang]_ccallback {
 180           switch(entity) {
 181           when ENTITY1:
 182             ...
 183             break;
 184           when ENTITY2:
 185             ...
 186             break;
 187           ...
 188           when ENTITYN:
 189             ...
 190             break;
 191           }
 192         }
 193
 194     Defining Patterns for Entities:
 195       Now it is time to write patterns for each entity in your language. That
 196       does not seem very hard, except when your entity can cover multiple lines.
 197       Comments and strings in particular can do this. To make an accurate line
 198       counter, you will need to count the lines covered by multi-line entities.
 199       When you detect a newline inside your multi-line entity, you should set
 200       the entity variable to be INTERNAL_NL (-2) and call the main action. The
 201       main action should have a case for INTERNAL_NL separate from the newline
 202       entity. In it, you will check if the current line is code or comment and
 203       call the callback function with the appropriate string ("lcode" or
 204       "lcomment") and beginning and end of the line (including the newline).
 205       Afterwards, you will reset the line_contains_code and whole_line_comment
 206       state variables. Then set the line_start variable to be p, the current
 207       Ragel buffer position. Because line_contains_code and whole_line_comment
 208       have been reset, any non-newline and non-space character in the multi-line
 209       pattern should set line_contains_code or whole_line_comment back to 1.
 210       Otherwise you would count the line as blank.
 211       Note: For most parsers, the std_internal_newline(lang) macro is sufficient
 212       and does everything in the main action mentioned above. The lang parameter
 213       is the [lang]_LANG string.
 214
 215       For multi-line matches, it is important to call the 'code' or 'comment'
 216       actions (mentioned earlier) before an internal newline is detected so the
 217       line_contains_code and whole_line_comment variables are properly set. For
 218       other entities, you can use the 'code' macro inside the main action which
 219       executes the same code as the Ragel 'code' action. Other C macros are
 220       'comment' and 'ls', the latter is typically used for the SPACE entity when
 221       defining line_start.
 222
 223       Also for multi-line matches, it may be necessary to use the 'enqueue' and
 224       'commit' actions. If it is possible that a multi-line entity will not have
 225       an ending delimiter (for example a string), use the 'enqueue' action as
 226       soon as the start delimitter has been detected, and the 'commit' action as
 227       soon as the end delimitter has been detected. This will eliminate the
 228       potential for any counting errors.
 229
 230     Notes:
 231       * You can be a bit sloppy with the line counting machine. For example the
 232         only C entities that can contain newlines are strings and comments, so
 233         INTERNAL_NL would only be necessary inside them. Other than those,
 234         anything other than spaces is considered code, so do not waste your time
 235         defining specific patterns for other entities.
 236
 237     Parsers with Embedded Languages:
 238       Notation: [lang] is the parent language, [elang] is the embedded language.
 239
 240       To write a parser with embedded languages (such as HTML with embedded CSS
 241       and Javascript), you should first #include the parser(s) above your Ragel
 242       code. The header file is "[elang]_parser.h".
 243
 244       Next, after the inclusion of 'common.rl', add '#EMBED([elang])' on
 245       separate lines for each embedded language. The Rakefile looks for these
 246       special comments to embed the language for you automatically.
 247
 248       In your main action, you need to add another entity CHECK_BLANK_ENTRY. It
 249       should call the 'check_blank_entry([lang]_LANG)' macro. Blank entries are
 250       an entry into an embedded language, but the rest of the line is blank
 251       before a newline. For example, a CSS entry in HTML is something like:
 252         <style type="text/css">
 253       If there is no CSS code after the entry (a blank entry), the line should
 254       be counted as HTML code, and the 'check_blank_entry' macro handles this.
 255       But you may be asking, "how do I get to the CHECK_BLANK_ENTRY entity?".
 256       This will be discussed in just a bit.
 257       Also use the emb_newline and emb_internal_newline macros instead of the
 258       std_newline and std_internal_newline macros.
 259
 260       For each embedded language you will have to define an entry and outry. An
 261       entry is the pattern that transitions from the parent language into the
 262       child language. An outry is the pattern from child to parent. You will
 263       need to put your entries in your [lang]_line machine. You will also need
 264       to re-create each embedded language's line machine (define as
 265       [lang]_[elang]_line; e.g. html_css_line) and put outry patterns in those.
 266       Entries typically would be defined as [lang]_[elang]_entry, and outries
 267       as [lang]_[elang]_outry.
 268       Note: An outry should have a 'check_blank_outry' action so the line is not
 269       mistakenly counted as a line of embedded language code if it is actually a
 270       line of parent code.
 271
 272       Entry pattern actions should be:
 273         [lang]_[elang]_entry @{ entity = CHECK_BLANK_ENTRY; } @[lang]_callback
 274           @{ saw([lang]_LANG)} => { fcall [lang]_[elang]_line; };
 275       What this does is checks for a blank entry, and if it is, counts the line
 276       as a line of parent language code. If it is not, the macro will not do
 277       anything. The machine then transitions into the child language.
 278
 279       Outry pattern actions should be:
 280         @{ p = ts; fret; };
 281       What this does is sets the current Ragel parser position to the beginning
 282       of the outry so the line is counted as a line of parent language code if
 283       no child code is on the same line. The machine then transitions into the
 284       parent language.
 285
 286   Entity Identifying Machine:
 287     This machine does not have to be written as a line-by-line parser. It only
 288     has to identify the positions of language entities, such as whitespace,
 289     comments, strings, etc. in sequence. As a result they can be written much
 290     faster and more easily with less thought than a line counter. Using a
 291     scanner is most efficient.
 292     The callback function will be called for each entity parsed.
 293
 294     Scanner Structure:
 295       [lang]_entity := |*
 296         entity1 ${ entity = ENTITY1; } => [lang]_ecallback;
 297         entity2 ${ entity = ENTITY2; } => [lang]_ecallback;
 298         ...
 299         entityn ${ entity = ENTITYN; } => [lang]_ecallback;
 300       *|;
 301
 302     Main Action Structure:
 303       action [lang]_ecallback {
 304         callback([lang]_LANG, [lang]_entities[entity], cint(ts), cint(te));
 305       }
 306
 307     Note: the 'ls', 'code', 'comment', 'queue' and 'commit' actions are
 308     completely unnecessary.
 309
 310     Parsers for Embedded Languages:
 311       TODO: