SciTE Script Lexer

Writing lexers in Lua

A lexer may be written as a script in the Lua language instead of in C++. This is a little simpler and allows lexers to be developed without using a C++ compiler.

A script lexer is attached by setting the file lexer to be a name that starts with "script_". Styles and other properties can then be assigned using this name. For example,

lexer.*.zog=script_zog
style.script_zog.0=fore:#7f007f,bold
style.script_zog.1=fore:#000000
style.script_zog.2=fore:#000080,bold
style.script_zog.3=fore:#008000,font:Georgia,italics,size:9

Then the lexer is implemented in Lua similar to this:

-- -*- coding: utf-8 -*-

function OnStyle(styler)
S_DEFAULT = 0
S_IDENTIFIER = 1
S_KEYWORD = 2
S_UNICODECOMMENT = 3
identifierCharacters = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"

styler:StartStyling(styler.startPos, styler.lengthDoc, styler.initStyle)
while styler:More() do

-- Exit state if needed
if styler:State() == S_IDENTIFIER then
if not identifierCharacters:find(styler:Current(), 1, true) then
identifier = styler:Token()
if identifier == "if" or identifier == "end" then
styler:ChangeState(S_KEYWORD)
end
styler:SetState(S_DEFAULT)
end
elseif styler:State() == S_UNICODECOMMENT then
if styler:Match("»") then
styler:ForwardSetState(S_DEFAULT)
end
end

-- Enter state if needed
if styler:State() == S_DEFAULT then
if styler:Match("«") then
styler:SetState(S_UNICODECOMMENT)
elseif identifierCharacters:find(styler:Current(), 1, true) then
styler:SetState(S_IDENTIFIER)
end
end

styler:Forward()
end
styler:EndStyling()
end

The result looks like

proc clip(int a)
« Clip into the positive zone »
if (a > 0) a
0
end

Code Structure

Document Loop

The lexer loops through the part of the document indicated assigning a style to each character.

styler:StartStyling(styler.startPos, styler.lengthDoc, styler.initStyle)
while styler:More() do
-- Code that examines the text and sets lexical states
styler:Forward()
end
styler:EndStyling()

There are many different ways to structure the code that examines the text and sets lexical states. A structure that has proven useful in C++ lexers is to write two blocks of code as shown in the example. The first block checks if the current state should end and if so sets the state to the default 0. The second block is responsible for detecting whether a new state should be entered from the default state. This structure means everything is dealt with as switching from or to the default state and avoids having to consider many combinations of states.

Encodings

The styler iterates over whole characters rather than bytes. Thus if the document is encoded in UTF-8, styler:Current() may be a multibyte string. If the script is also encoded in UTF-8, then it is easy to check against Unicode characters with code like

if styler:Current() == "«" then

If using an encoding like Latin-1 and the script is also encoded in the same encoding then literals can be used as above.

If the language can be encoded in different ways then more complex code may be needed along with encoding-specific code.

Checking Before

Sometimes a lexer needs to see some information earlier in the file, perhaps a declaration changes the syntax or the particular form of quote at the start of a string must be matched at its end. Since the standard loop only goes forward from the starting position, different calls must be used like CharAt and StyleAt. These use byte positions and do not treat multi-byte characters as single entities.

Performance

The lexer above can lex approximately 90K per second on a 2.4 GHz Athlon 64. For most situations, this will feel completely fluid.

More complex lexers will be slower. If a lexer is so slow that the application becomes unresponsive then the lexer can choose to split up each request. It can do so by deciding upon a range of whole lines and using this range as the arguments to StartStyling. This allows the user's keystrokes and mouse moves to be processed. The lexer will automatically be called again to lex more of the document.

API

The API of the styler object passed to OnStyle:

Name	Explanation
StartStyling(startPos, length, initStyle)	Start setting styles from startPos for length with initial style initStyle
EndStyling()	Styling has been completed so tidy up
More() → boolean	Are there any more characters to process
Forward()	Move forward one character
Position() → integer	What is the position in the document of the current character
AtLineStart() → boolean	Is the current character the first on a line
AtLineEnd() → boolean	Is the current character the last on a line
State() → integer	The current lexical state value
SetState(state)	Set the style of the current token to the current state and then change the state to the argument
ForwardSetState(state)	Combination of moving forward and setting the state. Useful when the current character is a token terminator like " for a string.
ChangeState(state)	Change the current state so that the state of the current token will be set to the argument
Current() → string	The current character
Next() → string	The next character
Previous() → string	The previous character
Token() → string	The current token
Match(string) → boolean	Is the text from the current position the same as the argument?
Line(position) → integer	Convert a byte position into a line number
CharAt(position) → integer	Unsigned byte value at argument
StyleAt(position) → integer	Style value at argument
LevelAt(line) → integer	Fold level for a line
SetLevelAt(line, level)	Set the fold level for a line
LineState(line) → integer	State value for a line
SetLineState(line, state)	Set state value for a line. This can be used to store extra information from lexing, such as a current language mode, so that there is no need to look back in the document.
startPos : integer	Start of the range to be lexed
lengthDoc : integer	Length of the range to be lexed
initStyle : integer	Starting style
language : string	Name of the language. Allows implementation of multiple languages with one OnStyle function.

A line-oriented example.

This example is for a line-oriented language as is sometimes used for configuration files. It uses low level direct calls instead of the StartStyling/More/Forward/EndStyling calls.

-- A line oriented lexer - style the line according to the first character
function OnStyle(styler)
lineStart = editor:LineFromPosition(styler.startPos)
lineEnd = editor:LineFromPosition(styler.startPos + styler.lengthDoc)
editor:StartStyling(styler.startPos, 31)
for line=lineStart,lineEnd,1 do
lengthLine = editor:PositionFromLine(line+1) - editor:PositionFromLine(line)
lineText = editor:GetLine(line)
first = string.sub(lineText,1,1)
style = 0
if first == "+" then
style = 1
elseif first == " " or first == "\t" then
style = 2
end
editor:SetStyling(lengthLine, style)
end
end