git.oblomov.eu Git - ohcount/blob - src/parser.h

   1 // parser.h written by Mitchell Foral. mitchell<att>caladbolg.net.
   2 // See COPYING for license information.
   3
   4 #ifndef OHCOUNT_PARSER_H
   5 #define OHCOUNT_PARSER_H
   6
   7 #include "sourcefile.h"
   8
   9 /**
  10  * @page parser_doc Parser Documentation
  11  * @author Mitchell Foral
  12  *
  13  * @section overview Overview
  14  *
  15  * I will assume the reader has a decent knowledge of how Ragel works and the
  16  * Ragel syntax. If not, please review the Ragel manual found at:
  17  *   http://research.cs.queensu.ca/~thurston/ragel/
  18  *
  19  * All parsers must at least:
  20  *
  21  * @li Call a callback function when a line of code is parsed.
  22  * @li Call a callback function when a line of comment is parsed.
  23  * @li Call a callback function when a blank line is parsed.
  24  *
  25  * Additionally a parser can call the callback function for each position of
  26  * entities parsed.
  27  *
  28  * Take a look at 'c.rl' and even keep it open for reference when reading this
  29  * document to better understand how parsers work and how to write one.
  30  *
  31  * @section writing Writing a Parser
  32  *
  33  * First create your parser in 'src/parsers/'. Its name should be the language
  34  * you are parsing with a '.rl' extension. You will not have to manually compile
  35  * any parsers, as this is automatically for you. However, you do need to add
  36  * your parser to 'hash/parsers.gperf'.
  37  *
  38  * Every parser must have the following at the top:
  39  *
  40  * @include parser_doc_1
  41  *
  42  * And the following at the bottom:
  43  *
  44  * @include parser_doc_2
  45  *
  46  * (Your parser will go between these two blocks.)
  47  *
  48  * The code can be found in the existing 'c.rl' parser. You will need to change:
  49  * @li OHCOUNT_[lang]_PARSER_H - Replace [lang] with your language name. So if
  50  *   you are writing a C parser, it would be OHCOUNT_C_PARSER_H.
  51  * @li [lang]_LANG - Set the variable name to be [lang]_LANG and its value to be
  52  *   the name of your language to parse as defined in languages.h. [lang] is
  53  *   your language name. For C it would be C_LANG.
  54  * @li [lang]_entities - Set the variable name to be [lang]_entities (e.g.
  55  *   c_entries) The value is an array of string entities your language has. For
  56  *   example C has comment, string, number, etc. entities. You should definately
  57  *   have "space", and "any" entities. "any" entities are typically used for
  58  *   entity machines (discussed later) and match any character that is not
  59  *   recognized so the parser does not do something unpredictable.
  60  * @li enum - Change the value of the enum to correspond with your entities. So
  61  *   if in your parser you look up [lang]_entities[ENTITY], you will get the
  62  *   associated entity's string name.
  63  * @li parse_[lang] - Set the function name to parse_[lang] where again, [lang]
  64  *   is the name of your language. In the case of C, it is parse_c.
  65  * @li [lang]_en_[lang]_line - The line counting machine.
  66  * @li [lang]_en_[lang]_entity - The entity machine.
  67  *
  68  * You may be asking why you have to rename variables and functions. Well if
  69  * variables have the same name in header files (which is what parsers are), the
  70  * compiler complains. Also, when you have languages embedded inside each other,
  71  * any identifiers with the same name can easily be mixed up. It is also
  72  * important to prefix your Ragel definitions with your language to avoid
  73  * conflicts with other parsers.
  74  *
  75  * Additional variables available to parsers are in the parser_macros.h file.
  76  * Take a look at it and try to understand what the variables are used for. They
  77  * will make more sense later on.
  78  *
  79  * Now you can define your Ragel parser. Name your machine after your language,
  80  * "write data", and include 'common.rl', a file with common Ragel definitions,
  81  * actions, etc. For example:
  82  *
  83  * @include parser_doc_3
  84  *
  85  * Before you begin to write patterns for each entity in your language, you need
  86  * to understand how the parser should work.
  87  *
  88  * Each parser has two machines: one optimized for counting lines of code,
  89  * comments, and blanks; the other for identifying entity positions in the
  90  * buffer.
  91  *
  92  * @section line Line Counting Machine
  93  *
  94  * This machine should be written as a line-by-line parser for multiple lines.
  95  * This means you match any combination of entities except a newline up until
  96  * you do reach a newline. If the line contains only spaces, or nothing at all,
  97  * it is blank. If the line contains spaces at first, but then a comment, or
  98  * just simply a comment, the line is a comment. If the line contains anything
  99  * but a comment after spaces (if there are any), it is a line of code. You
 100  * will do this using a Ragel scanner. The callback function will be called for
 101  * each line parsed.
 102  *
 103  * @subsection line_scanner Scanner Parser Structure
 104  *
 105  * A scanner parser will look like this:
 106  *
 107  * @include parser_doc_4
 108  *
 109  * (As usual, replace [lang] with your language name.)
 110  *
 111  * Each entity is the pattern for an entity to match, the last one typically
 112  * being the newline entity. For each match, the variable is set to a constant
 113  * defined in the enum, and the main action is called (you will need to create
 114  * this action above the scanner).
 115  *
 116  * When you detect whether or not a line is code or comment, you should call the
 117  * appropriate \@code or \@comment action defined in 'common.rl' as soon as
 118  * possible. It is not necessary to worry about whether or not these actions are
 119  * called more than once for a given line; the first call to either sets the
 120  * status of the line permanently. Sometimes you cannot call \@code or \@comment
 121  * for one reason or another. Do not worry, as this is discussed later.
 122  *
 123  * When you reach a newline, you will need to decide whether the current line is
 124  * a line of code, comment, or blank. This is easy. Simply check if the
 125  * #line_contains_code or #whole_line_comment variables are set to 1. If neither
 126  * of them are, the line is blank. Then call the callback function (not action)
 127  * with an "lcode", "lcomment", or "lblank" string, and the start and end
 128  * positions of that line (including the newline). The start position of the
 129  * line is in the #line_start variable. It should be set at the beginning of
 130  * every line either through the \@code or \@comment actions, or manually in the
 131  * main action. Finally the #line_contains_code, #whole_line_comment, and
 132  * #line_start state variables must be reset. All this should be done within the
 133  * main action shown below. Note: For most parsers, the std_newline(lang) macro
 134  * is sufficient and does everything in the main action mentioned above. The
 135  * lang parameter is the [lang]_LANG string.
 136  *
 137  * @subsection line_action Main Action Structure
 138  *
 139  * The main action looks like this:
 140  *
 141  * @include parser_doc_5
 142  *
 143  * @subsection line_entity_patterns Defining Patterns for Entities
 144  *
 145  * Now it is time to write patterns for each entity in your language. That does
 146  * not seem very hard, except when your entity can cover multiple lines.
 147  * Comments and strings in particular can do this. To make an accurate line
 148  * counter, you will need to count the lines covered by multi-line entities.
 149  * When you detect a newline inside your multi-line entity, you should set the
 150  * entity variable to be #INTERNAL_NL and call the main action. The main action
 151  * should have a case for #INTERNAL_NL separate from the newline entity. In it,
 152  * you will check if the current line is code or comment and call the callback
 153  * function with the appropriate string ("lcode" or "lcomment") and beginning
 154  * and end of the line (including the newline). Afterwards, you will reset the
 155  * #line_contains_code and #whole_line_comment state variables. Then set the
 156  * #line_start variable to be #p, the current Ragel buffer position. Because
 157  * #line_contains_code and #whole_line_comment have been reset, any non-newline
 158  * and non-space character in the multi-line pattern should set
 159  * #line_contains_code or #whole_line_comment back to 1. Otherwise you would count
 160  * the line as blank.
 161  *
 162  * Note: For most parsers, the std_internal_newline(lang) macro is sufficient
 163  * and does everything in the main action mentioned above. The lang parameter
 164  * is the [lang]_LANG string.
 165  *
 166  * For multi-line matches, it is important to call the \@code or \@comment
 167  * actions (mentioned earlier) before an internal newline is detected so the
 168  * #line_contains_code and #whole_line_comment variables are properly set. For
 169  * other entities, you can use the #code macro inside the main action which
 170  * executes the same code as the Ragel \@code action. Other C macros are
 171  * #comment and #ls, the latter is typically used for the SPACE entity when
 172  * defining #line_start.
 173  *
 174  * Also for multi-line matches, it may be necessary to use the \@enqueue and
 175  * \@commit actions. If it is possible that a multi-line entity will not have an
 176  * ending delimiter (for example a string), use the \@enqueue action as soon as
 177  * the start delimitter has been detected, and the \@commit action as soon as
 178  * the end delimitter has been detected. This will eliminate the potential for
 179  * any counting errors.
 180  *
 181  * @subsection line_notes Notes
 182  *
 183  * You can be a bit sloppy with the line counting machine. For example the only
 184  * C entities that can contain newlines are strings and comments, so
 185  * #INTERNAL_NL would only be necessary inside them. Other than those, anything
 186  * other than spaces is considered code, so do not waste your time defining
 187  * specific patterns for other entities.
 188  *
 189  * @subsection line_embedded Parsers with Embedded Languages
 190  *
 191  * Notation: [lang] is the parent language, [elang] is the embedded language.
 192  *
 193  * To write a parser with embedded languages (such as HTML with embedded CSS and
 194  * Javascript), you should first \#include the parser(s) above your Ragel code.
 195  * The header file is "[elang]_parser.h".
 196  *
 197  * Next, after the inclusion of 'common.rl', add "#EMBED([elang])" on separate
 198  * lines for each embedded language. The build process looks for these special
 199  * comments to embed the language for you automatically.
 200  *
 201  * In your main action, you need to add another entity #CHECK_BLANK_ENTRY. It
 202  * should call the #check_blank_entry([lang]_LANG) macro. Blank entries are an
 203  * entry into an embedded language, but the rest of the line is blank before a
 204  * newline. For example, a CSS entry in HTML is something like:
 205  *
 206  * @code
 207  *   <style type="text/css">
 208  * @endcode
 209  *
 210  * If there is no CSS code after the entry (a blank entry), the line should be
 211  * counted as HTML code, and the #check_blank_entry macro handles this. But you
 212  * may be asking, "how do I get to the CHECK_BLANK_ENTRY entity?". This will be
 213  * discussed in just a bit.
 214  *
 215  * The #emb_newline and #emb_internal_newline macros should be used instead of
 216  * the #std_newline and #std_internal_newline macros.
 217  *
 218  * For each embedded language you will have to define an entry and outry. An
 219  * entry is the pattern that transitions from the parent language into the child
 220  * language. An outry is the pattern from child to parent. You will need to put
 221  * your entries in your [lang]_line machine. You will also need to re-create
 222  * each embedded language's line machine (define as [lang]_[elang]_line; e.g.
 223  * html_css_line) and put outry patterns in those. Entries typically would be
 224  * defined as [lang]_[elang]_entry, and outries as [lang]_[elang]_outry.
 225  *
 226  * Note: An outry should have a \@check_blank_outry action so the line is not
 227  * mistakenly counted as a line of embedded language code if it is actually a
 228  * line of parent code.
 229  *
 230  * @subsection line_entry_action Entry Pattern Actions
 231  *
 232  * @include parser_doc_6
 233  *
 234  * What this does is checks for a blank entry, and if it is, counts the line as
 235  * a line of parent language code. If it is not, the macro will not do anything.
 236  * The machine then transitions into the child language.
 237  *
 238  * @subsection line_outry_action Outry Pattern Actions
 239  *
 240  * @include parser_doc_7
 241  *
 242  * What this does is sets the current Ragel parser position to the beginning of
 243  * the outry so the line is counted as a line of parent language code if no
 244  * child code is on the same line. The machine then transitions into the parent
 245  * language.
 246  *
 247  * @section entity Entity Identifying Machine
 248  *
 249  * This machine does not have to be written as a line-by-line parser. It only
 250  * has to identify the positions of language entities, such as whitespace,
 251  * comments, strings, etc. in sequence. As a result they can be written much
 252  * faster and more easily with less thought than a line counter. Using a scanner
 253  * is most efficient. The callback function will be called for each entity
 254  * parsed.
 255  *
 256  * The \@ls, \@ code, \@comment, \@queue, and \@commit actions are completely
 257  * unnecessary.
 258  *
 259  * @subsection entity_scanner Scanner Structure
 260  *
 261  * @include parser_doc_8
 262  *
 263  * @subsection entity_action Main Action Structure
 264  *
 265  * @include parser_doc_9
 266  *
 267  * @subsection entity_embedded Parsers for Embedded Languages
 268  *
 269  * TODO:
 270  *
 271  * @section tests Including Written Tests for Parsers
 272  *
 273  * You should have two kinds of tests for parsers. One will be a header file
 274  * that goes in the 'test/unit/parsers/' directory and the other will be an
 275  * input source file that goes in the 'test/src_dir/' and an expected output
 276  * file that goes in the 'test/expected_dir/' directory.
 277  *
 278  * The header file will need to be "#include"ed in 'test/unit/test_parsers.h'.
 279  * Then add the "all_[lang]_tests()" function to the "all_parser_tests()"
 280  * function.
 281  *
 282  * Recompile the tests for the changes to take effect.
 283  *
 284  * The other files added to the 'test/{src,expected}_dir/' directories will be
 285  * automatically detected and run with the test suite.
 286  */
 287
 288 /**
 289  * Tries to use an existing Ragel parser for the given language.
 290  * @param sourcefile A SourceFile created by ohcount_sourcefile_new().
 291  * @param count An integer flag indicating whether to count lines or parse
 292  *   entities.
 293  * @param callback A callback to use for every line or entity in the source
 294  *   file discovered (depends on count).
 295  * @param userdata Pointer to userdata used by callback (if any).
 296  * @return 1 if a Ragel parser is found, 0 otherwise.
 297  */
 298 int ohcount_parse(SourceFile *sourcefile, int count,
 299                   void (*callback) (const char *, const char *, int, int,
 300                                     void *),
 301                   void *userdata);
 302
 303 #endif