Skip to content

Using the Lexer

Calling the Lexer

You can call the Lexer in a simple Python script like so:

lexer = Lexer('BEGIN <HAPPY> "Hi mom!" END')

with the String you input being the formal language you want to give out.

The Lexer provides a current_token, a line and a column.

For instance, you can use

lexer.current_token.id == Id.IDENT

to check whether or not the current Token is an identifier.

Reading Tokens

The Lexer runs on a Token stream. It doesn't fully read the entire string input in one go, but rather just fetches the next Token it can read.

You can fetch a Token like that:

lexer.next_token()
if lexer.current_token.id == Id.IDENT:
    print("I am an identifier!")

Warning

Always fetch the first token before expecting any tokens. The Lexer won't do it automatically.
The default token for any unfetched Lexer is a None value.

If you want to collect the entire Token Stream from the start, you can implement it like this:

ls = []
lexer.next_token()
while lexer.current_token.id != Id.EOF:
    ls.append(lexer.current_token)
    lexer.next_token() 

However, this is generally not recommended, as it is much more inefficient to collect them and loop through all of them once again.

Info

Tokens have a seperate section in the documentation. If you look for their definition, you can look here.

Expanding the Lexer

If you want to create more symbols for the formal language, you can expand the Lexer.

Suppose we want to add square bracket symbols [, ] that we want to read.

We first go into the parsing/token.py to create a new Id definition.

class Id(Enum):
    #...
    EOF = 8,
    LSQBRACKET = 9,  # [
    RSQBRACKET = 10, # ]

Next, we go straight to the Lexer. (parsing/lexer.py)

There, you may see a bunch of match-case-statements. We can add our square brackets into two new case statements.

match symbol:
    # ...
    case '[':
        self.current_index += 1
        self.column += 1
        self.current_token = Token(Id.LSQBRACKET)
    case ']':
        self.current_index += 1
        self.column += 1
        self.current_token = Token(Id.RSQBRACKET)
    # ...

Warning

Make sure to put the cases above the last one. The last case is a default-case that collects identifiers.
If you would put the square-bracket-cases below the default-case, it would never be reached and collected into an identifier instead.

You can test your result by just calling the Lexer with a fitting string.

lexer = Lexer('BEGIN ["Square Brackets are cool"] END')
lexer.next_token()
print(lexer.current_token.id) # BEGIN
lexer.next_token()
print(lexer.current_token.id) # LSQBRACKET
lexer.next_token()
print(lexer.current_token.id) # LITERAL
lexer.next_token()
print(lexer.current_token.id) # IDENT
lexer.next_token()
print(lexer.current_token.id) # LITERAL
lexer.next_token()
print(lexer.current_token.id) # RSQBRACKET
lexer.next_token()
print(lexer.current_token.id) # END

The Lexer class

The Lexer takes an input string and provides a Token stream.

Attributes:

Name Type Description
input_string str

The input string following the rules of a formal language.

current_index int

The Lexer's internal position. Which character index the Lexer is currently reading.

line int

The Lexer's external current line position. Which line of the input string is the current token located at?

column int

The Lexer's external current column position. Which column of the momentary line in the input string is the current token located at?

current_token Token

The last fetched token that was fetched by Lexer.next_token()

cut_off()

Cuts off input string by searching for the first BEGIN and the last END. Cutting off everything before the first BEGIN and after the last END.

Example

>>> lexer = Lexer('This is garbage text BEGIN <HAPPY> "My poem about robots" END and some more garbage text.')
>>> print(lexer.input_string)
This is garbage text about BEGIN <HAPPY> "My poem about robots" END and some more garbage text.
>>> lexer.cut_off()
>>> print(lexer.input_string)
BEGIN <HAPPY> "My poem about robots" END

next_token()

Fetches the next token from the input string by setting Lexer.current_token to the Token type corresponding to what was read.

Warning

The Lexer does not check for any syntax errors. As a result, the EOF fail-safe is implemented. If the Lexer goes beyond the input string length, we set the current token to an EOF-Token by default.