Written by Mitchell Foral

Overview:
  I will assume the reader has a decent knowledge of how Ragel works and the
  Ragel syntax.
  All parsers must do 4 things:
    * Call back when a line of code is parsed.
    * Call back when a line of comment is parsed.
    * Call back when a blank line is parsed.
    * Call back for entities parsed.
  The first three are tricker than they may seem; the last is very easy.

  Take a look at c.rl and even keep it open for reference when reading this
  document to better understand how parsers work and how to write one.

Writing a Parser:
  First create your parser in ext/ohcount_native/ragel_parsers/. It's name
  should be the language you're parsing with a '.rl' extension. Every parser
  must have the following at the top:

/************************* Required for every parser *************************/

// the name of the language
const char *LANG = "c";

// the languages entities
const char *c_entities[] = {
  "space", "comment", "string", "number", "preproc", "keyword",
  "identifier", "operator", "escaped_newline", "newline", "any"
};

// constants associated with the entities
enum {
  SPACE = 0, COMMENT, STRING, NUMBER, PREPROC, KEYWORD,
  IDENTIFIER, OPERATOR, ESCAPED_NL, NEWLINE, ANY
};

// do not change the following variables

// used for newlines inside patterns like strings and comments that can have
// newlines in them
#define INTERNAL_NL -1

// required by Ragel
int cs, act;
char *p, *pe, *eof, *ts, *te;

// used for calculating offsets from buffer start for start and end positions
char *buffer_start;
#define cint(c) ((int) (c - buffer_start))

// state flags for line and comment counting
int whole_line_comment;
int line_contains_code;

// the beginning of a line in the buffer for line and comment counting
char *line_start;

// state variable for the current entity being matched
int entity;

/*****************************************************************************/

  And the following at the bottom:

/* Parses a string buffer with C/C++ code.
 *
 * @param *buffer The string to parse.
 * @param length The length of the string to parse.
 * @param *c_callback Callback function called for each entity. Entities are
 *   the ones defined in the lexer as well as 3 additional entities used by
 *   Ohcount for counting lines: lcode, lcomment, lblank.
 */
void parse_c(char *buffer, int length,
  void (*c_callback) (const char *lang, const char *entity, int start, int end)
  ) {
  p = buffer;
  pe = buffer + length;
  eof = pe;

  buffer_start = buffer;
  whole_line_comment = 0;
  line_contains_code = 0;
  line_start = 0;
  entity = 0;

  %% write init;
  %% write exec;

  // no newline at EOF; get contents of last line
  if ((whole_line_comment || line_contains_code) && c_callback) {
    if (line_contains_code)
      c_callback(LANG, "lcode", cint(line_start), cint(pe));
    else if (whole_line_comment)
      c_callback(LANG, "lcomment", cint(line_start), cint(pe));
  }
}

  (Your parser will go between these two blocks.)

  The code can be found in the existing c.rl parser. You'll need to change:
    * LANG - Set the value of LANG to be the name of your language to parse.
    * [lang]_entities - Set the variable name to be [lang]_entities where [lang]
      is your language name. So if you're writing a C parser, it would be
      c_entities. The value is an array of string entities your language has.
      For example C has comment, string, number, etc. entities. You should
      definately have "space", "newline", and "any" entities. If your language
      has escaped newlines (or continuations), have an "escaped_newline" entity
      as well.
    * enum - Change the value of the enum to correspond with your entities. So
      if in your parser you look up [lang]_entities[ENTITY], you'll get the
      associated entity's string name.
    * parse_[lang] - Set the function name to parse_[lang] where again, [lang]
      is the name of your language. In the case of C, it is parse_c.
    * [lang]_callback - Set the name of the callback to be [lang]_callback
      (e.g. c_callback) and change all occurances in the parse_[lang] function
      appropriately.

    You may be asking why you have to rename variables and functions. Well when
    you have languages embedded inside others, any identifiers with the same
    name can easily be mixed up.

  Try to understand what the main variables are used for. They will make more
  sense later on.

  Now you can define your Ragel parser. Name your machine after your language,
  'write data', and include 'common.rl', a file with common Ragel definitions,
  actions, etc. For example:
    %%{
      machine c;
      write data;
      include "common.rl";

      ...
    }%%

  Understanding What you're Writing:
    Before you begin to write patterns for each entity in your language, you
    need to understand how the parser should work.

    You should write a parser as a line-by-line parser for multiple lines. This
    means you match any combination of entities except a newline up until you do
    reach a newline. If the line contains only spaces, or nothing at all, it is
    blank. If the line contains spaces at first, but then a comment, or just
    simply a comment, the line is a comment. If the line contains anything but a
    comment after spaces (if there are any), it is a line of code. You will do
    this using a Ragel scanner.

  Scanner Parser Structure:
    A scanner parser will look like this:
      [lang]_line := |*
        entity1 ${ entity = ENTITY1; } => [lang]_callback;
        entity1 ${ entity = ENTITY2; } => [lang]_callback;
        ...
        entityn ${ entity = ENTITYN; } => [lang]_callback;
      *|;
    (As usual, replace [lang] with your language name.)
    Each entity is the pattern for an entity to match. For each match, the
    variable is set to a constant defined in the enum, and the main action is
    called (you will need to create this action above the scanner).

    When you detect whether or not a line is code or comment, you should call
    the appropriate 'code' or 'comment' action defined in common.rl as soon as
    possible. It is not necessary to worry about whether or not these actions
    are called more than once for a given line; the first call to either sets
    the status of the line permanently. Sometimes you cannot call 'code' or
    'comment' for one reason or another. Do not worry, as this is discussed
    later.

    When you reach a newline, you will need to decide whether the current line
    is a line of code, comment, or blank. This is easy. Simply check if the
    line_contains_code or whole_line_comment variables are set to 1. If neither
    of them are, the line is blank. Then call the [lang]_callback function (not
    action) with an "lcode", "lcomment", or "lblank" string, and the start and
    end positions of that line (including the newline). The start position of
    the line is in the line_start variable. It should be set at the beginning
    of every line either through the 'code' or 'comment' actions, or manually
    in the main action. Finally the line_contains_code, whole_line_comment, and
    line_start state variables must be reset. All this is done in the main
    action shown below.

  Main Action Structure:
    The main action looks like this:
      action [lang]_callback {
        switch(entity) {
        when ENTITY1:
          ...
          break;
        when ENTITY2:
          ...
          break;
        ...
        when ENTITYN:
          ...
          break;
        }
        if([lang]_callback && entity != INTERNAL_NL)
          [lang]_callback(LANG, [lang]_entities[entity], cint(ts), cint(te));
      }
    The last bit of code is for the entity callback. It passes the entire entity
    text (including internal newlines) and position of the entity in the buffer
    to the callback function.

  Defining Patterns for Entities:
    Now it is time to write patterns for each entity in your language. That
    doesn't seem very hard, except when your entity can cover multiple lines.
    Comments and strings in particular can do this. To make an accurate line
    counter, you will need to count the lines covered by multi-line entities.
    When you detect a newline inside your multi-line entity, you should set the
    entity variable to be INTERNAL_NL (-1) and call the main action. The main
    action should have a case for INTERNAL_NL separate from the newline entity.
    In it, you will check if the current line is code or comment and call the
    callback function with the appropriate string ("lcode" or "lcomment") and
    beginning and end of the line (including the newline). Afterwards, you will
    reset the line_contains_code and whole_line_comment state variables. Then
    set the line_start variable to be p, the current Ragel buffer position.
    Because line_contains_code and whole_line_comment have been reset, any non-
    newline and non-space character in the multi-line pattern should set
    line_contains_code or whole_line_comment back to 1. Otherwise you would
    count the line as blank.

    For multi-line matches, it is important to call the 'code' or 'comment'
    actions (mentioned earlier) before an internal newline is detected so the
    line_contains_code and whole_line_comment variables are properly set. For
    other entities, you can add code for setting line_contains_code and
    whole_line_comment inside the switch statement of the main action. See the
    'code' and 'comment' actions in 'common.rl' for the appropriate code.

  That's all there is to it!