git.oblomov.eu Git - ohcount/blob - PARSER_DOC

   1 PARSER_DOC written by Mitchell Foral
   2
   3 Overview:
   4   I will assume the reader has a decent knowledge of how Ragel works and the
   5   Ragel syntax. If not, please review the Ragel manual found at:
   6     http://research.cs.queensu.ca/~thurston/ragel/
   7
   8   All parsers must at least:
   9     * Call a callback function when a line of code is parsed.
  10     * Call a callback function when a line of comment is parsed.
  11     * Call a callback function when a blank line is parsed.
  12   Additionally a parser can call the callback function for each position of
  13   entities parsed.
  14
  15   Take a look at c.rl and even keep it open for reference when reading this
  16   document to better understand how parsers work and how to write one.
  17
  18 Writing a Parser:
  19   First create your parser in ext/ohcount_native/ragel_parsers/. Its name
  20   should be the language you are parsing with a '.rl' extension. Every parser
  21   must have the following at the top:
  22
  23 /************************* Required for every parser *************************/
  24 #include "ragel_parser_macros.h"
  25
  26 // the name of the language
  27 const char *C_LANG = "c";
  28
  29 // the languages entities
  30 const char *c_entities[] = {
  31   "space", "comment", "string", "number", "preproc",
  32   "keyword", "identifier", "operator", "any"
  33 };
  34
  35 // constants associated with the entities
  36 enum {
  37   C_SPACE = 0, C_COMMENT, C_STRING, C_NUMBER, C_PREPROC,
  38   C_KEYWORD, C_IDENTIFIER, C_OPERATOR, C_ANY
  39 };
  40
  41 // do not change the following variables
  42
  43 // used for newlines
  44 #define NEWLINE -1
  45
  46 // used for newlines inside patterns like strings and comments that can have
  47 // newlines in them
  48 #define INTERNAL_NL -2
  49
  50 // required by Ragel
  51 int cs, act;
  52 char *p, *pe, *eof, *ts, *te;
  53
  54 // used for calculating offsets from buffer start for start and end positions
  55 char *buffer_start;
  56 #define cint(c) ((int) (c - buffer_start))
  57
  58 // state flags for line and comment counting
  59 int whole_line_comment;
  60 int line_contains_code;
  61
  62 // the beginning of a line in the buffer for line and comment counting
  63 char *line_start;
  64
  65 // state variable for the current entity being matched
  66 int entity;
  67
  68 /*****************************************************************************/
  69
  70   And the following at the bottom:
  71
  72 /* Parses a string buffer with C/C++ code.
  73  *
  74  * @param *buffer The string to parse.
  75  * @param length The length of the string to parse.
  76  * @param count Integer flag specifying whether or not to count lines. If yes,
  77  *   uses the Ragel machine optimized for counting. Otherwise uses the Ragel
  78  *   machine optimized for returning entity positions.
  79  * @param *callback Callback function. If count is set, callback is called for
  80  *   every line of code, comment, or blank with 'lcode', 'lcomment', and
  81  *   'lblank' respectively. Otherwise callback is called for each entity found.
  82  */
  83 void parse_c(char *buffer, int length, int count,
  84   void (*callback) (const char *lang, const char *entity, int start, int end)
  85   ) {
  86   p = buffer;
  87   pe = buffer + length;
  88   eof = pe;
  89
  90   buffer_start = buffer;
  91   whole_line_comment = 0;
  92   line_contains_code = 0;
  93   line_start = 0;
  94   entity = 0;
  95
  96   %% write init;
  97   cs = (count) ? c_en_c_line : c_en_c_entity;
  98   %% write exec;
  99
 100   // if no newline at EOF; callback contents of last line
 101   if (count) { process_last_line(C_LANG) }
 102 }
 103
 104   (Your parser will go between these two blocks.)
 105
 106   The code can be found in the existing c.rl parser. You will need to change:
 107     * [lang]_LANG - Set the variable name to be [lang]_LANG and its value to be
 108       the name of your language to parse. [lang] is your language name. So if
 109       you are writing a C parser, it would be C_LANG.
 110     * [lang]_entities - Set the variable name to be [lang]_entities (e.g.
 111       c_entries) The value is an array of string entities your language has.
 112       For example C has comment, string, number, etc. entities. You should
 113       definately have "space", and "any" entities. "any" entities are typically
 114       used for entity machines (discussed later) and match any character that
 115       is not recognized so the parser does not do something unpredictable.
 116     * enum - Change the value of the enum to correspond with your entities. So
 117       if in your parser you look up [lang]_entities[ENTITY], you will get the
 118       associated entity's string name.
 119     * parse_[lang] - Set the function name to parse_[lang] where again, [lang]
 120       is the name of your language. In the case of C, it is parse_c.
 121     * [lang]_en_[lang]_line - The line counting machine.
 122     * [lang]_en_[lang]_entity - The entity machine.
 123
 124     You may be asking why you have to rename variables and functions. Well if
 125     variables have the same name in header files (which is what parsers are),
 126     the compiler complains. Also, when you have languages embedded inside each
 127     other, any identifiers with the same name can easily be mixed up. It is also
 128     important to prefix your Ragel definitions with your language to avoid
 129     conflicts with other parsers.
 130
 131   Try to understand what the main variables are used for. They will make more
 132   sense later on.
 133
 134   Now you can define your Ragel parser. Name your machine after your language,
 135   'write data', and include 'common.rl', a file with common Ragel definitions,
 136   actions, etc. For example:
 137     %%{
 138       machine c;
 139       write data;
 140       include "common.rl";
 141
 142       ...
 143     }%%
 144
 145   Before you begin to write patterns for each entity in your language, you need
 146   to understand how the parser should work.
 147
 148   Each parser has two machines: one optimized for counting lines of code,
 149   comments, and blanks; the other for identifying entity positions in the
 150   buffer.
 151
 152   Line Counting Machine:
 153     This machine should be written as a line-by-line parser for multiple lines.
 154     This means you match any combination of entities except a newline up until
 155     you do reach a newline. If the line contains only spaces, or nothing at all,
 156     it is blank. If the line contains spaces at first, but then a comment, or
 157     just simply a comment, the line is a comment. If the line contains anything
 158     but a comment after spaces (if there are any), it is a line of code. You
 159     will do this using a Ragel scanner.
 160     The callback function will be called for each line parsed.
 161
 162     Scanner Parser Structure:
 163       A scanner parser will look like this:
 164         [lang]_line := |*
 165           entity1 ${ entity = ENTITY1; } => [lang]_ccallback;
 166           entity2 ${ entity = ENTITY2; } => [lang]_ccallback;
 167           ...
 168           entityn ${ entity = ENTITYN; } => [lang]_ccallback;
 169         *|;
 170       (As usual, replace [lang] with your language name.)
 171       Each entity is the pattern for an entity to match, the last one typically
 172       being the newline entity. For each match, the variable is set to a
 173       constant defined in the enum, and the main action is called (you will need
 174       to create this action above the scanner).
 175
 176       When you detect whether or not a line is code or comment, you should call
 177       the appropriate 'code' or 'comment' action defined in common.rl as soon
 178       as possible. It is not necessary to worry about whether or not these
 179       actions are called more than once for a given line; the first call to
 180       either sets the status of the line permanently. Sometimes you cannot call
 181       'code' or 'comment' for one reason or another. Do not worry, as this is
 182       discussed later.
 183
 184       When you reach a newline, you will need to decide whether the current line
 185       is a line of code, comment, or blank. This is easy. Simply check if the
 186       line_contains_code or whole_line_comment variables are set to 1. If
 187       neither of them are, the line is blank. Then call the callback function
 188       (not action) with an "lcode", "lcomment", or "lblank" string, and the
 189       start and end positions of that line (including the newline). The start
 190       position of the line is in the line_start variable. It should be set at
 191       the beginning of every line either through the 'code' or 'comment'
 192       actions, or manually in the main action. Finally the line_contains_code,
 193       whole_line_comment, and line_start state variables must be reset. All this
 194       should be done within the main action shown below.
 195       Note: For most parsers, the std_newline(lang) macro is sufficient and does
 196       everything in the main action mentioned above. The lang parameter is the
 197       [lang]_LANG string.
 198
 199     Main Action Structure:
 200       The main action looks like this:
 201         action [lang]_ccallback {
 202           switch(entity) {
 203           when ENTITY1:
 204             ...
 205             break;
 206           when ENTITY2:
 207             ...
 208             break;
 209           ...
 210           when ENTITYN:
 211             ...
 212             break;
 213           }
 214         }
 215
 216     Defining Patterns for Entities:
 217       Now it is time to write patterns for each entity in your language. That
 218       does not seem very hard, except when your entity can cover multiple lines.
 219       Comments and strings in particular can do this. To make an accurate line
 220       counter, you will need to count the lines covered by multi-line entities.
 221       When you detect a newline inside your multi-line entity, you should set
 222       the entity variable to be INTERNAL_NL (-2) and call the main action. The
 223       main action should have a case for INTERNAL_NL separate from the newline
 224       entity. In it, you will check if the current line is code or comment and
 225       call the callback function with the appropriate string ("lcode" or
 226       "lcomment") and beginning and end of the line (including the newline).
 227       Afterwards, you will reset the line_contains_code and whole_line_comment
 228       state variables. Then set the line_start variable to be p, the current
 229       Ragel buffer position. Because line_contains_code and whole_line_comment
 230       have been reset, any non-newline and non-space character in the multi-line
 231       pattern should set line_contains_code or whole_line_comment back to 1.
 232       Otherwise you would count the line as blank.
 233       Note: For most parsers, the std_internal_newline(lang) macro is sufficient
 234       and does everything in the main action mentioned above. The lang parameter
 235       is the [lang]_LANG string.
 236
 237       For multi-line matches, it is important to call the 'code' or 'comment'
 238       actions (mentioned earlier) before an internal newline is detected so the
 239       line_contains_code and whole_line_comment variables are properly set. For
 240       other entities, you can use the 'code' macro inside the main action which
 241       executes the same code as the Ragel 'code' action. Other C macros are
 242       'comment' and 'ls', the latter is typically used for the SPACE entity when
 243       defining line_start.
 244
 245     Notes:
 246       * You can be a bit sloppy with the line counting machine. For example the
 247         only C entities that can contain newlines are strings and comments, so
 248         INTERNAL_NL would only be necessary inside them. Other than those,
 249         anything other than spaces is considered code, so do not waste your time
 250         defining specific patterns for other entities.
 251
 252   Entity Identifying Machine:
 253     This machine does not have to be written as a line-by-line parser. It only
 254     has to identify the positions of language entities, such as whitespace,
 255     comments, strings, etc. in sequence. As a result they can be written much
 256     faster and more easily with less thought than a line counter. Using a
 257     scanner is most efficient.
 258     The callback function will be called for each entity parsed.
 259
 260     Scanner Structure:
 261       [lang]_entity := |*
 262         entity1 ${ entity = ENTITY1; } => [lang]_ecallback;
 263         entity2 ${ entity = ENTITY2; } => [lang]_ecallback;
 264         ...
 265         entityn ${ entity = ENTITYN; } => [lang]_ecallback;
 266       *|;
 267
 268     Main Action Structure:
 269       action [lang]_ecallback {
 270         callback([lang]_LANG, entity, cint(ts), cint(te));
 271       }