Written by Mitchell Foral

Overview:
  I will assume the reader has a decent knowledge of how Ragel works and the
  Ragel syntax.
  All parsers must at least:
    * Call a callback function when a line of code is parsed.
    * Call a callback function when a line of comment is parsed.
    * Call a callback function when a blank line is parsed.
  Additionally a parser can call the callback function for each position of
  entities parsed.

  Take a look at c.rl and even keep it open for reference when reading this
  document to better understand how parsers work and how to write one.

Writing a Parser:
  First create your parser in ext/ohcount_native/ragel_parsers/. It's name
  should be the language you're parsing with a '.rl' extension. Every parser
  must have the following at the top:

/************************* Required for every parser *************************/
#include "ragel_parser_macros.h"

// the name of the language
const char *C_LANG = "c";

// the languages entities
const char *c_entities[] = {
  "space", "comment", "string", "number", "preproc",
  "identifier", "operator", "escaped_newline", "newline"
};

// constants associated with the entities
enum {
  C_SPACE = 0, C_COMMENT, C_STRING, C_NUMBER, C_PREPROC,
  C_IDENTIFIER, C_OPERATOR, C_ESCAPED_NL, C_NEWLINE
};

// do not change the following variables

// used for newlines inside patterns like strings and comments that can have
// newlines in them
#define INTERNAL_NL -1

// required by Ragel
int cs, act;
char *p, *pe, *eof, *ts, *te;

// used for calculating offsets from buffer start for start and end positions
char *buffer_start;
#define cint(c) ((int) (c - buffer_start))

// state flags for line and comment counting
int whole_line_comment;
int line_contains_code;

// the beginning of a line in the buffer for line and comment counting
char *line_start;

// state variable for the current entity being matched
int entity;

/*****************************************************************************/

  And the following at the bottom:

/* Parses a string buffer with C/C++ code.
 *
 * @param *buffer The string to parse.
 * @param length The length of the string to parse.
 * @param count Integer flag specifying whether or not to count lines. If yes,
 *   uses the Ragel machine optimized for counting. Otherwise uses the Ragel
 *   machine optimized for returning entity positions.
 * @param *callback Callback function. If count is set, callback is called for
 *   every line of code, comment, or blank with 'lcode', 'lcomment', and
 *   'lblank' respectively. Otherwise callback is called for each entity found.
 */
void parse_c(char *buffer, int length, int count,
  void (*callback) (const char *lang, const char *entity, int start, int end)
  ) {
  p = buffer;
  pe = buffer + length;
  eof = pe;

  buffer_start = buffer;
  whole_line_comment = 0;
  line_contains_code = 0;
  line_start = 0;
  entity = 0;

  %% write init;
  if (count)
    %% write exec c_line;

  // if no newline at EOF; get contents of last line
  process_last_line(C_LANG)
}

  (Your parser will go between these two blocks.)

  The code can be found in the existing c.rl parser. You'll need to change:
    * [lang]_LANG - Set the variable name to be [lang]_LANG and its value to be
      the name of your language to parse. [lang] is your language name. So if
      you're writing a C parser, it would be C_LANG.
    * [lang]_entities - Set the variable name to be [lang]_entities (e.g.
      c_entries) The value is an array of string entities your language has.
      For example C has comment, string, number, etc. entities. You should
      definately have "space", and "newline" entities. If your language has
      escaped newlines (or continuations), have an "escaped_newline" entity as
      well.
    * enum - Change the value of the enum to correspond with your entities. So
      if in your parser you look up [lang]_entities[ENTITY], you'll get the
      associated entity's string name.
    * parse_[lang] - Set the function name to parse_[lang] where again, [lang]
      is the name of your language. In the case of C, it is parse_c.
    * [lang]_line - (Inside parse_[lang], after 'write exec'). This is the line
      counting machine.
    TODO: [lang]_lang - The entity machine.

    You may be asking why you have to rename variables and functions. Well if
    variables have the same name in header files (which is what parsers are),
    the compiler complains. Also, when you have languages embedded inside each
    other, any identifiers with the same name can easily be mixed up. It's also
    important to prefix your Ragel definitions with your language to avoid
    conflicts with other parsers.

  Try to understand what the main variables are used for. They will make more
  sense later on.

  Now you can define your Ragel parser. Name your machine after your language,
  'write data', and include 'common.rl', a file with common Ragel definitions,
  actions, etc. For example:
    %%{
      machine c;
      write data;
      include "common.rl";

      ...
    }%%

  Before you begin to write patterns for each entity in your language, you need
  to understand how the parser should work.

  Each parser has two machines: one optimized for counting lines of code,
  comments, and blanks; the other for identifying entity positions in the
  buffer.

  Line Counting Machine:
    This machine should be written as a line-by-line parser for multiple lines.
    This means you match any combination of entities except a newline up until
    you do reach a newline. If the line contains only spaces, or nothing at all,
    it is blank. If the line contains spaces at first, but then a comment, or
    just simply a comment, the line is a comment. If the line contains anything
    but a comment after spaces (if there are any), it is a line of code. You
    will do this using a Ragel scanner.
    The callback function will be called for each line parsed.

    Scanner Parser Structure:
      A scanner parser will look like this:
        [lang]_line := |*
          entity1 ${ entity = ENTITY1; } => [lang]_callback;
          entity1 ${ entity = ENTITY2; } => [lang]_callback;
          ...
          entityn ${ entity = ENTITYN; } => [lang]_callback;
        *|;
      (As usual, replace [lang] with your language name.)
      Each entity is the pattern for an entity to match, the last one typically
      being the newline entity. For each match, the variable is set to a
      constant defined in the enum, and the main action is called (you will need
      to create this action above the scanner).

      When you detect whether or not a line is code or comment, you should call
      the appropriate 'code' or 'comment' action defined in common.rl as soon
      as possible. It is not necessary to worry about whether or not these
      actions are called more than once for a given line; the first call to
      either sets the status of the line permanently. Sometimes you cannot call
      'code' or 'comment' for one reason or another. Do not worry, as this is
      discussed later.

      When you reach a newline, you will need to decide whether the current line
      is a line of code, comment, or blank. This is easy. Simply check if the
      line_contains_code or whole_line_comment variables are set to 1. If
      neither of them are, the line is blank. Then call the callback function
      (not action) with an "lcode", "lcomment", or "lblank" string, and the
      start and end positions of that line (including the newline). The start
      position of the line is in the line_start variable. It should be set at
      the beginning of every line either through the 'code' or 'comment'
      actions, or manually in the main action. Finally the line_contains_code,
      whole_line_comment, and line_start state variables must be reset. All this
      should be done within the main action shown below.
      Note: For most parsers, the std_newline(lang) macro is sufficient and does
      everything in the main action mentioned above. The lang parameter is the
      [lang]_LANG string.

    Main Action Structure:
      The main action looks like this:
        action [lang]_callback {
          switch(entity) {
          when ENTITY1:
            ...
            break;
          when ENTITY2:
            ...
            break;
          ...
          when ENTITYN:
            ...
            break;
          }
        }

    Defining Patterns for Entities:
      Now it is time to write patterns for each entity in your language. That
      doesn't seem very hard, except when your entity can cover multiple lines.
      Comments and strings in particular can do this. To make an accurate line
      counter, you will need to count the lines covered by multi-line entities.
      When you detect a newline inside your multi-line entity, you should set
      the entity variable to be INTERNAL_NL (-1) and call the main action. The
      main action should have a case for INTERNAL_NL separate from the newline
      entity. In it, you will check if the current line is code or comment and
      call the callback function with the appropriate string ("lcode" or
      "lcomment") and beginning and end of the line (including the newline).
      Afterwards, you will reset the line_contains_code and whole_line_comment
      state variables. Then set the line_start variable to be p, the current
      Ragel buffer position. Because line_contains_code and whole_line_comment
      have been reset, any non-newline and non-space character in the multi-line
      pattern should set line_contains_code or whole_line_comment back to 1.
      Otherwise you would count the line as blank.
      Note: For most parsers, the std_internal_newline(lang) macro is sufficient
      and does everything in the main action mentioned above. The lang parameter
      is the [lang]_LANG string.

      For multi-line matches, it is important to call the 'code' or 'comment'
      actions (mentioned earlier) before an internal newline is detected so the
      line_contains_code and whole_line_comment variables are properly set. For
      other entities, you can use the 'code' macro inside the main action which
      executes the same code as the Ragel 'code' action. Other C macros are
      'comment' and 'ls', the latter is typically used for the SPACE entity when
      defining line_start.

  Notes:
    * You can be a bit sloppy with the line counting machine. For example C
      preprocessor statements have specific words such as include, ifdef, etc.
      You wouldn't have to check for those specific words to make it a perfectly
      valid preprocessor statement, you are only interested in whether or not
      the line contains code, which in this case it would.
      The entity identifying machine would have to check words however.

  Entity Identifying Machine:
    This machine doesn't have to be written as a line-by-line parser. It only
    has to identify the positions of language entities, such as whitespace,
    comments, strings, etc. in sequence.
    The callback function will be called for each entity parsed.

  TODO: more doc for this machine.