SciTE Script Lexer |
A lexer may be written as a script in the Lua language instead of in C++. This is a little simpler and allows lexers to be developed without using a C++ compiler.
A script lexer is attached by setting the file lexer to be a name that starts with "script_". Styles and other properties can then be assigned using this name. For example,
Then the lexer is implemented in Lua similar to this:
The result looks like
The lexer loops through the part of the document indicated assigning a style to each character.
There are many different ways to structure the code that examines the text and sets lexical states. A structure that has proven useful in C++ lexers is to write two blocks of code as shown in the example. The first block checks if the current state should end and if so sets the state to the default 0. The second block is responsible for detecting whether a new state should be entered from the default state. This structure means everything is dealt with as switching from or to the default state and avoids having to consider many combinations of states.
The styler iterates over whole characters rather than bytes. Thus if the document is encoded in UTF-8, styler:Current() may be a multibyte string. If the script is also encoded in UTF-8, then it is easy to check against Unicode characters with code like
If using an encoding like Latin-1 and the script is also encoded in the same encoding then literals can be used as above.
If the language can be encoded in different ways then more complex code may be needed along with encoding-specific code.
Sometimes a lexer needs to see some information earlier in the file, perhaps a declaration changes the syntax or the particular form of quote at the start of a string must be matched at its end. Since the standard loop only goes forward from the starting position, different calls must be used like CharAt and StyleAt. These use byte positions and do not treat multi-byte characters as single entities.
The lexer above can lex approximately 90K per second on a 2.4 GHz Athlon 64. For most situations, this will feel completely fluid.
More complex lexers will be slower. If a lexer is so slow that the application becomes unresponsive then the lexer can choose to split up each request. It can do so by deciding upon a range of whole lines and using this range as the arguments to StartStyling. This allows the user's keystrokes and mouse moves to be processed. The lexer will automatically be called again to lex more of the document.
The API of the styler object passed to OnStyle:
Name | Explanation |
---|---|
StartStyling(startPos, length, initStyle) | Start setting styles from startPos for length with initial style initStyle |
EndStyling() | Styling has been completed so tidy up |
More() → boolean | Are there any more characters to process |
Forward() | Move forward one character |
Position() → integer | What is the position in the document of the current character |
AtLineStart() → boolean | Is the current character the first on a line |
AtLineEnd() → boolean | Is the current character the last on a line |
State() → integer | The current lexical state value |
SetState(state) | Set the style of the current token to the current state and then change the state to the argument |
ForwardSetState(state) | Combination of moving forward and setting the state. Useful when the current character is a token terminator like " for a string. |
ChangeState(state) | Change the current state so that the state of the current token will be set to the argument |
Current() → string | The current character |
Next() → string | The next character |
Previous() → string | The previous character |
Token() → string | The current token |
Match(string) → boolean | Is the text from the current position the same as the argument? |
Line(position) → integer | Convert a byte position into a line number |
CharAt(position) → integer | Unsigned byte value at argument |
StyleAt(position) → integer | Style value at argument |
LevelAt(line) → integer | Fold level for a line |
SetLevelAt(line, level) | Set the fold level for a line |
LineState(line) → integer | State value for a line |
SetLineState(line, state) | Set state value for a line. This can be used to store extra information from lexing, such as a current language mode, so that there is no need to look back in the document. |
startPos : integer | Start of the range to be lexed |
lengthDoc : integer | Length of the range to be lexed |
initStyle : integer | Starting style |
language : string | Name of the language. Allows implementation of multiple languages with one OnStyle function. |
This example is for a line-oriented language as is sometimes used for configuration files. It uses low level direct calls instead of the StartStyling/More/Forward/EndStyling calls.