Scintilla icon SciTE Script Lexer

Writing lexers in Lua

A lexer may be written as a script in the Lua language instead of in C++. This is a little simpler and allows lexers to be developed without using a C++ compiler.

A script lexer is attached by setting the file lexer to be a name that starts with "script_". Styles and other properties can then be assigned using this name. For example,

lexer.*.zog=script_zog
style.script_zog.0=fore:#7f007f,bold
style.script_zog.1=fore:#000000
style.script_zog.2=fore:#000080,bold
style.script_zog.3=fore:#008000,font:Georgia,italics,size:9

Then the lexer is implemented in Lua similar to this:

-- -*- coding: utf-8 -*-

function OnStyle(styler)
        S_DEFAULT = 0
        S_IDENTIFIER = 1
        S_KEYWORD = 2
        S_UNICODECOMMENT = 3
        identifierCharacters = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"

        styler:StartStyling(styler.startPos, styler.lengthDoc, styler.initStyle)
        while styler:More() do

                -- Exit state if needed
                if styler:State() == S_IDENTIFIER then
                        if not identifierCharacters:find(styler:Current(), 1, true) then
                                identifier = styler:Token()
                                if identifier == "if" or identifier == "end" then
                                        styler:ChangeState(S_KEYWORD)
                                end
                                styler:SetState(S_DEFAULT)
                        end
                elseif styler:State() == S_UNICODECOMMENT then
                        if styler:Match("»") then
                                styler:ForwardSetState(S_DEFAULT)
                        end
                end

                -- Enter state if needed
                if styler:State() == S_DEFAULT then
                        if styler:Match("«") then
                                styler:SetState(S_UNICODECOMMENT)
                        elseif identifierCharacters:find(styler:Current(), 1, true) then
                                styler:SetState(S_IDENTIFIER)
                        end
                end

                styler:Forward()
        end
        styler:EndStyling()
end

The result looks like

proc clip(int a)
« Clip into the positive zone »
if (a > 0) a
0
end

Code Structure

Document Loop

The lexer loops through the part of the document indicated assigning a style to each character.

styler:StartStyling(styler.startPos, styler.lengthDoc, styler.initStyle)
while styler:More() do
        -- Code that examines the text and sets lexical states
        styler:Forward()
end
styler:EndStyling()

There are many different ways to structure the code that examines the text and sets lexical states. A structure that has proven useful in C++ lexers is to write two blocks of code as shown in the example. The first block checks if the current state should end and if so sets the state to the default 0. The second block is responsible for detecting whether a new state should be entered from the default state. This structure means everything is dealt with as switching from or to the default state and avoids having to consider many combinations of states.

Encodings

The styler iterates over whole characters rather than bytes. Thus if the document is encoded in UTF-8, styler:Current() may be a multibyte string. If the script is also encoded in UTF-8, then it is easy to check against Unicode characters with code like

if styler:Current() == "«" then

If using an encoding like Latin-1 and the script is also encoded in the same encoding then literals can be used as above.

If the language can be encoded in different ways then more complex code may be needed along with encoding-specific code.

Checking Before

Sometimes a lexer needs to see some information earlier in the file, perhaps a declaration changes the syntax or the particular form of quote at the start of a string must be matched at its end. Since the standard loop only goes forward from the starting position, different calls must be used like CharAt and StyleAt. These use byte positions and do not treat multi-byte characters as single entities.

Performance

The lexer above can lex approximately 90K per second on a 2.4 GHz Athlon 64. For most situations, this will feel completely fluid.

More complex lexers will be slower. If a lexer is so slow that the application becomes unresponsive then the lexer can choose to split up each request. It can do so by deciding upon a range of whole lines and using this range as the arguments to StartStyling. This allows the user's keystrokes and mouse moves to be processed. The lexer will automatically be called again to lex more of the document.


API

The API of the styler object passed to OnStyle:

NameExplanation
StartStyling(startPos, length, initStyle) Start setting styles from startPos for length with initial style initStyle
EndStyling() Styling has been completed so tidy up
More() → boolean Are there any more characters to process
Forward() Move forward one character
Position() → integer What is the position in the document of the current character
AtLineStart() → boolean Is the current character the first on a line
AtLineEnd() → boolean Is the current character the last on a line
State() → integer The current lexical state value
SetState(state) Set the style of the current token to the current state and then change the state to the argument
ForwardSetState(state) Combination of moving forward and setting the state. Useful when the current character is a token terminator like " for a string.
ChangeState(state) Change the current state so that the state of the current token will be set to the argument
Current() → string The current character
Next() → string The next character
Previous() → string The previous character
Token() → string The current token
Match(string) → boolean Is the text from the current position the same as the argument?
Line(position) → integer Convert a byte position into a line number
CharAt(position) → integer Unsigned byte value at argument
StyleAt(position) → integer Style value at argument
LevelAt(line) → integer Fold level for a line
SetLevelAt(line, level) Set the fold level for a line
LineState(line) → integer State value for a line
SetLineState(line, state) Set state value for a line. This can be used to store extra information from lexing, such as a current language mode, so that there is no need to look back in the document.
startPos : integer Start of the range to be lexed
lengthDoc : integer Length of the range to be lexed
initStyle : integer Starting style
language : string Name of the language. Allows implementation of multiple languages with one OnStyle function.

A line-oriented example.

This example is for a line-oriented language as is sometimes used for configuration files. It uses low level direct calls instead of the StartStyling/More/Forward/EndStyling calls.

-- A line oriented lexer - style the line according to the first character
function OnStyle(styler)
        lineStart = editor:LineFromPosition(styler.startPos)
        lineEnd = editor:LineFromPosition(styler.startPos + styler.lengthDoc)
        editor:StartStyling(styler.startPos, 31)
        for line=lineStart,lineEnd,1 do
                lengthLine = editor:PositionFromLine(line+1) - editor:PositionFromLine(line)
                lineText = editor:GetLine(line)
                first = string.sub(lineText,1,1)
                style = 0
                if first == "+" then
                        style = 1
                elseif first == " " or first == "\t" then
                        style = 2
                end
                editor:SetStyling(lengthLine, style)
        end
end