git.oblomov.eu Git - ohcount/blob - PARSER_DOC

   1 Written by Mitchell Foral
   2
   3 Overview:
   4   I will assume the reader has a decent knowledge of how Ragel works and the
   5   Ragel syntax.
   6   All parsers must do 4 things:
   7     * Call back when a line of code is parsed.
   8     * Call back when a line of comment is parsed.
   9     * Call back when a blank line is parsed.
  10     * Call back for entities parsed.
  11   The first three are tricker than they may seem; the last is very easy.
  12
  13   Take a look at c.rl and even keep it open for reference when reading this
  14   document to better understand how parsers work and how to write one.
  15
  16 Writing a Parser:
  17   First create your parser in ext/ohcount_native/ragel_parsers/. It's name
  18   should be the language you're parsing with a '.rl' extension. Every parser
  19   must have the following at the top:
  20
  21 /************************* Required for every parser *************************/
  22
  23 // the name of the language
  24 const char *C_LANG = "c";
  25
  26 // the languages entities
  27 const char *c_entities[] = {
  28   "space", "comment", "string", "number", "preproc", "keyword",
  29   "identifier", "operator", "escaped_newline", "newline"
  30 };
  31
  32 // constants associated with the entities
  33 enum {
  34   C_SPACE = 0, C_COMMENT, C_STRING, C_NUMBER, C_PREPROC, C_KEYWORD,
  35   C_IDENTIFIER, C_OPERATOR, C_ESCAPED_NL, C_NEWLINE
  36 };
  37
  38 // do not change the following variables
  39
  40 // used for newlines inside patterns like strings and comments that can have
  41 // newlines in them
  42 #define INTERNAL_NL -1
  43
  44 // required by Ragel
  45 int cs, act;
  46 char *p, *pe, *eof, *ts, *te;
  47
  48 // used for calculating offsets from buffer start for start and end positions
  49 char *buffer_start;
  50 #define cint(c) ((int) (c - buffer_start))
  51
  52 // state flags for line and comment counting
  53 int whole_line_comment;
  54 int line_contains_code;
  55
  56 // the beginning of a line in the buffer for line and comment counting
  57 char *line_start;
  58
  59 // state variable for the current entity being matched
  60 int entity;
  61
  62 /*****************************************************************************/
  63
  64   And the following at the bottom:
  65
  66 /* Parses a string buffer with C/C++ code.
  67  *
  68  * @param *buffer The string to parse.
  69  * @param length The length of the string to parse.
  70  * @param *c_callback Callback function called for each entity. Entities are
  71  *   the ones defined in the lexer as well as 3 additional entities used by
  72  *   Ohcount for counting lines: lcode, lcomment, lblank.
  73  */
  74 void parse_c(char *buffer, int length,
  75   void (*c_callback) (const char *lang, const char *entity, int start, int end)
  76   ) {
  77   p = buffer;
  78   pe = buffer + length;
  79   eof = pe;
  80
  81   buffer_start = buffer;
  82   whole_line_comment = 0;
  83   line_contains_code = 0;
  84   line_start = 0;
  85   entity = 0;
  86
  87   %% write init;
  88   %% write exec;
  89
  90   // no newline at EOF; get contents of last line
  91   if ((whole_line_comment || line_contains_code) && c_callback) {
  92     if (line_contains_code)
  93       c_callback(LANG, "lcode", cint(line_start), cint(pe));
  94     else if (whole_line_comment)
  95       c_callback(LANG, "lcomment", cint(line_start), cint(pe));
  96   }
  97 }
  98
  99   (Your parser will go between these two blocks.)
 100
 101   The code can be found in the existing c.rl parser. You'll need to change:
 102     * [lang]_LANG - Set the variable name to be [lang]_LANG and its value to be
 103       the name of your language to parse. [lang] is your language name. So if
 104       you're writing a C parser, it would be C_LANG.
 105     * [lang]_entities - Set the variable name to be [lang]_entities (e.g.
 106       c_entries) The value is an array of string entities your language has.
 107       For example C has comment, string, number, etc. entities. You should
 108       definately have "space", and "newline" entities. If your language has
 109       escaped newlines (or continuations), have an "escaped_newline" entity as
 110       well.
 111     * enum - Change the value of the enum to correspond with your entities. So
 112       if in your parser you look up [lang]_entities[ENTITY], you'll get the
 113       associated entity's string name.
 114     * parse_[lang] - Set the function name to parse_[lang] where again, [lang]
 115       is the name of your language. In the case of C, it is parse_c.
 116     * [lang]_callback - Set the name of the callback to be [lang]_callback
 117       (e.g. c_callback) and change all occurances in the parse_[lang] function
 118       appropriately.
 119
 120     You may be asking why you have to rename variables and functions. Well when
 121     you have languages embedded inside others, any identifiers with the same
 122     name can easily be mixed up.
 123
 124   Try to understand what the main variables are used for. They will make more
 125   sense later on.
 126
 127   Now you can define your Ragel parser. Name your machine after your language,
 128   'write data', and include 'common.rl', a file with common Ragel definitions,
 129   actions, etc. For example:
 130     %%{
 131       machine c;
 132       write data;
 133       include "common.rl";
 134
 135       ...
 136     }%%
 137
 138   Understanding What you're Writing:
 139     Before you begin to write patterns for each entity in your language, you
 140     need to understand how the parser should work.
 141
 142     You should write a parser as a line-by-line parser for multiple lines. This
 143     means you match any combination of entities except a newline up until you do
 144     reach a newline. If the line contains only spaces, or nothing at all, it is
 145     blank. If the line contains spaces at first, but then a comment, or just
 146     simply a comment, the line is a comment. If the line contains anything but a
 147     comment after spaces (if there are any), it is a line of code. You will do
 148     this using a Ragel scanner.
 149
 150   Scanner Parser Structure:
 151     A scanner parser will look like this:
 152       [lang]_line := |*
 153         entity1 ${ entity = ENTITY1; } => [lang]_callback;
 154         entity1 ${ entity = ENTITY2; } => [lang]_callback;
 155         ...
 156         entityn ${ entity = ENTITYN; } => [lang]_callback;
 157       *|;
 158     (As usual, replace [lang] with your language name.)
 159     Each entity is the pattern for an entity to match. For each match, the
 160     variable is set to a constant defined in the enum, and the main action is
 161     called (you will need to create this action above the scanner).
 162
 163     When you detect whether or not a line is code or comment, you should call
 164     the appropriate 'code' or 'comment' action defined in common.rl as soon as
 165     possible. It is not necessary to worry about whether or not these actions
 166     are called more than once for a given line; the first call to either sets
 167     the status of the line permanently. Sometimes you cannot call 'code' or
 168     'comment' for one reason or another. Do not worry, as this is discussed
 169     later.
 170
 171     When you reach a newline, you will need to decide whether the current line
 172     is a line of code, comment, or blank. This is easy. Simply check if the
 173     line_contains_code or whole_line_comment variables are set to 1. If neither
 174     of them are, the line is blank. Then call the [lang]_callback function (not
 175     action) with an "lcode", "lcomment", or "lblank" string, and the start and
 176     end positions of that line (including the newline). The start position of
 177     the line is in the line_start variable. It should be set at the beginning
 178     of every line either through the 'code' or 'comment' actions, or manually
 179     in the main action. Finally the line_contains_code, whole_line_comment, and
 180     line_start state variables must be reset. All this is done in the main
 181     action shown below.
 182
 183   Main Action Structure:
 184     The main action looks like this:
 185       action [lang]_callback {
 186         switch(entity) {
 187         when ENTITY1:
 188           ...
 189           break;
 190         when ENTITY2:
 191           ...
 192           break;
 193         ...
 194         when ENTITYN:
 195           ...
 196           break;
 197         }
 198         if([lang]_callback && entity != INTERNAL_NL)
 199           [lang]_callback(LANG, [lang]_entities[entity], cint(ts), cint(te));
 200       }
 201     The last bit of code is for the entity callback. It passes the entire entity
 202     text (including internal newlines) and position of the entity in the buffer
 203     to the callback function.
 204
 205   Defining Patterns for Entities:
 206     Now it is time to write patterns for each entity in your language. That
 207     doesn't seem very hard, except when your entity can cover multiple lines.
 208     Comments and strings in particular can do this. To make an accurate line
 209     counter, you will need to count the lines covered by multi-line entities.
 210     When you detect a newline inside your multi-line entity, you should set the
 211     entity variable to be INTERNAL_NL (-1) and call the main action. The main
 212     action should have a case for INTERNAL_NL separate from the newline entity.
 213     In it, you will check if the current line is code or comment and call the
 214     callback function with the appropriate string ("lcode" or "lcomment") and
 215     beginning and end of the line (including the newline). Afterwards, you will
 216     reset the line_contains_code and whole_line_comment state variables. Then
 217     set the line_start variable to be p, the current Ragel buffer position.
 218     Because line_contains_code and whole_line_comment have been reset, any non-
 219     newline and non-space character in the multi-line pattern should set
 220     line_contains_code or whole_line_comment back to 1. Otherwise you would
 221     count the line as blank.
 222
 223     For multi-line matches, it is important to call the 'code' or 'comment'
 224     actions (mentioned earlier) before an internal newline is detected so the
 225     line_contains_code and whole_line_comment variables are properly set. For
 226     other entities, you can add code for setting line_contains_code and
 227     whole_line_comment inside the switch statement of the main action. See the
 228     'code' and 'comment' actions in 'common.rl' for the appropriate code.
 229
 230   That's all there is to it!