All parsers must at least:
Take a look at 'c.rl' and even keep it open for reference when reading this document to better understand how parsers work and how to write one.
Every parser must have the following at the top:
/************************* Required for every parser *************************/ #ifndef OHCOUNT_C_PARSER_H #define OHCOUNT_C_PARSER_H #include "../parser_macros.h" // the name of the language const char *C_LANG = LANG_C; // the languages entities const char *c_entities[] = { "space", "comment", "string", "number", "preproc", "keyword", "identifier", "operator", "any" }; // constants associated with the entities enum { C_SPACE = 0, C_COMMENT, C_STRING, C_NUMBER, C_PREPROC, C_KEYWORD, C_IDENTIFIER, C_OPERATOR, C_ANY }; /*****************************************************************************/
And the following at the bottom:
/************************* Required for every parser *************************/ /* Parses a string buffer with C/C++ code. * * @param *buffer The string to parse. * @param length The length of the string to parse. * @param count Integer flag specifying whether or not to count lines. If yes, * uses the Ragel machine optimized for counting. Otherwise uses the Ragel * machine optimized for returning entity positions. * @param *callback Callback function. If count is set, callback is called for * every line of code, comment, or blank with 'lcode', 'lcomment', and * 'lblank' respectively. Otherwise callback is called for each entity found. */ void parse_c(char *buffer, int length, int count, void (*callback) (const char *lang, const char *entity, int s, int e, void *udata), void *userdata ) { init %% write init; cs = (count) ? c_en_c_line : c_en_c_entity; %% write exec; // if no newline at EOF; callback contents of last line if (count) { process_last_line(C_LANG) } } #endif /*****************************************************************************/
(Your parser will go between these two blocks.)
The code can be found in the existing 'c.rl' parser. You will need to change:
Additional variables available to parsers are in the parser_macros.h file. Take a look at it and try to understand what the variables are used for. They will make more sense later on.
Now you can define your Ragel parser. Name your machine after your language, "write data", and include 'common.rl', a file with common Ragel definitions, actions, etc. For example:
%%{
machine c;
write data;
include "common.rl";
...
}%%
Before you begin to write patterns for each entity in your language, you need to understand how the parser should work.
Each parser has two machines: one optimized for counting lines of code, comments, and blanks; the other for identifying entity positions in the buffer.
[lang]_line := |* entity1 ${ entity = ENTITY1; } => [lang]_ccallback; entity2 ${ entity = ENTITY2; } => [lang]_ccallback; ... entityn ${ entity = ENTITYN; } => [lang]_ccallback; *|;
(As usual, replace [lang] with your language name.)
Each entity is the pattern for an entity to match, the last one typically being the newline entity. For each match, the variable is set to a constant defined in the enum, and the main action is called (you will need to create this action above the scanner).
When you detect whether or not a line is code or comment, you should call the appropriate @code or @comment action defined in 'common.rl' as soon as possible. It is not necessary to worry about whether or not these actions are called more than once for a given line; the first call to either sets the status of the line permanently. Sometimes you cannot call @code or @comment for one reason or another. Do not worry, as this is discussed later.
When you reach a newline, you will need to decide whether the current line is a line of code, comment, or blank. This is easy. Simply check if the line_contains_code or whole_line_comment variables are set to 1. If neither of them are, the line is blank. Then call the callback function (not action) with an "lcode", "lcomment", or "lblank" string, and the start and end positions of that line (including the newline). The start position of the line is in the line_start variable. It should be set at the beginning of every line either through the @code or @comment actions, or manually in the main action. Finally the line_contains_code, whole_line_comment, and line_start state variables must be reset. All this should be done within the main action shown below. Note: For most parsers, the std_newline(lang) macro is sufficient and does everything in the main action mentioned above. The lang parameter is the [lang]_LANG string.
action [lang]_ccallback { switch(entity) { when ENTITY1: ... break; when ENTITY2: ... break; ... when ENTITYN: ... break; } }
Note: For most parsers, the std_internal_newline(lang) macro is sufficient and does everything in the main action mentioned above. The lang parameter is the [lang]_LANG string.
For multi-line matches, it is important to call the @code or @comment actions (mentioned earlier) before an internal newline is detected so the line_contains_code and whole_line_comment variables are properly set. For other entities, you can use the code macro inside the main action which executes the same code as the Ragel @code action. Other C macros are comment and ls, the latter is typically used for the SPACE entity when defining line_start.
Also for multi-line matches, it may be necessary to use the @enqueue and @commit actions. If it is possible that a multi-line entity will not have an ending delimiter (for example a string), use the @enqueue action as soon as the start delimitter has been detected, and the @commit action as soon as the end delimitter has been detected. This will eliminate the potential for any counting errors.
To write a parser with embedded languages (such as HTML with embedded CSS and Javascript), you should first #include the parser(s) above your Ragel code. The header file is "[elang]_parser.h".
Next, after the inclusion of 'common.rl', add "#EMBED([elang])" on separate lines for each embedded language. The build process looks for these special comments to embed the language for you automatically.
In your main action, you need to add another entity CHECK_BLANK_ENTRY. It should call the check_blank_entry([lang]_LANG) macro. Blank entries are an entry into an embedded language, but the rest of the line is blank before a newline. For example, a CSS entry in HTML is something like:
<style type="text/css">
If there is no CSS code after the entry (a blank entry), the line should be counted as HTML code, and the check_blank_entry macro handles this. But you may be asking, "how do I get to the CHECK_BLANK_ENTRY entity?". This will be discussed in just a bit.
The emb_newline and emb_internal_newline macros should be used instead of the std_newline and std_internal_newline macros.
For each embedded language you will have to define an entry and outry. An entry is the pattern that transitions from the parent language into the child language. An outry is the pattern from child to parent. You will need to put your entries in your [lang]_line machine. You will also need to re-create each embedded language's line machine (define as [lang]_[elang]_line; e.g. html_css_line) and put outry patterns in those. Entries typically would be defined as [lang]_[elang]_entry, and outries as [lang]_[elang]_outry.
Note: An outry should have a @check_blank_outry action so the line is not mistakenly counted as a line of embedded language code if it is actually a line of parent code.
[lang]_[elang]_entry @{ entity = CHECK_BLANK_ENTRY; } @[lang]_callback @{ saw([elang]_LANG)} => { fcall [lang]_[elang]_line; };
What this does is checks for a blank entry, and if it is, counts the line as a line of parent language code. If it is not, the macro will not do anything. The machine then transitions into the child language.
What this does is sets the current Ragel parser position to the beginning of the outry so the line is counted as a line of parent language code if no child code is on the same line. The machine then transitions into the parent language.
The @ls, @ code, @comment, @queue, and @commit actions are completely unnecessary.
[lang]_entity := |* entity1 ${ entity = ENTITY1; } => [lang]_ecallback; entity2 ${ entity = ENTITY2; } => [lang]_ecallback; ... entityn ${ entity = ENTITYN; } => [lang]_ecallback; *|;
action [lang]_ecallback { callback([lang]_LANG, [lang]_entities[entity], cint(ts), cint(te), userdata); }
The header file will need to be "#include"ed in 'test/unit/test_parsers.h'. Then add the "all_[lang]_tests()" function to the "all_parser_tests()" function.
Recompile the tests for the changes to take effect.
The other files added to the 'test/{src,expected}_dir/' directories will be automatically detected and run with the test suite.