Using the Lexer

Calling the Lexer

You can call the Lexer in a simple Python script like so:

Python

lexer = Lexer('BEGIN <HAPPY> "Hi mom!" END')

with the String you input being the formal language you want to give out.

The Lexer provides a current_token, a line and a column.

For instance, you can use

Python

lexer.current_token.id == Id.IDENT

to check whether or not the current Token is an identifier.

Reading Tokens

The Lexer runs on a Token stream. It doesn't fully read the entire string input in one go, but rather just fetches the next Token it can read.

You can fetch a Token like that:

Python

lexer.next_token()
if lexer.current_token.id == Id.IDENT:
    print("I am an identifier!")

Warning

Always fetch the first token before expecting any tokens. The Lexer won't do it automatically.
The default token for any unfetched Lexer is a None value.

If you want to collect the entire Token Stream from the start, you can implement it like this:

Python

ls = []
lexer.next_token()
while lexer.current_token.id != Id.EOF:
    ls.append(lexer.current_token)
    lexer.next_token()

However, this is generally not recommended, as it is much more inefficient to collect them and loop through all of them once again.

Info

Tokens have a seperate section in the documentation. If you look for their definition, you can look here.

Expanding the Lexer

If you want to create more symbols for the formal language, you can expand the Lexer.

Suppose we want to add square bracket symbols [, ] that we want to read.

We first go into the parsing/token.py to create a new Id definition.

Python

class Id(Enum):
    #...
    EOF = 8,
    LSQBRACKET = 9,  # [
    RSQBRACKET = 10, # ]

Next, we go straight to the Lexer. (parsing/lexer.py)

There, you may see a bunch of match-case-statements. We can add our square brackets into two new case statements.

Python

match symbol:
    # ...
    case '[':
        self.current_index += 1
        self.column += 1
        self.current_token = Token(Id.LSQBRACKET)
    case ']':
        self.current_index += 1
        self.column += 1
        self.current_token = Token(Id.RSQBRACKET)
    # ...

Warning

Make sure to put the cases above the last one. The last case is a default-case that collects identifiers.
If you would put the square-bracket-cases below the default-case, it would never be reached and collected into an identifier instead.

You can test your result by just calling the Lexer with a fitting string.

Python

lexer = Lexer('BEGIN ["Square Brackets are cool"] END')
lexer.next_token()
print(lexer.current_token.id) # BEGIN
lexer.next_token()
print(lexer.current_token.id) # LSQBRACKET
lexer.next_token()
print(lexer.current_token.id) # LITERAL
lexer.next_token()
print(lexer.current_token.id) # IDENT
lexer.next_token()
print(lexer.current_token.id) # LITERAL
lexer.next_token()
print(lexer.current_token.id) # RSQBRACKET
lexer.next_token()
print(lexer.current_token.id) # END

The Lexer class

The Lexer takes an input string and provides a Token stream.

Attributes:

Name	Type	Description
`input_string`	`str`	The input string following the rules of a formal language.
`current_index`	`int`	The Lexer's internal position. Which character index the Lexer is currently reading.
`line`	`int`	The Lexer's external current line position. Which line of the input string is the current token located at?
`column`	`int`	The Lexer's external current column position. Which column of the momentary line in the input string is the current token located at?
`current_token`	`Token`	The last fetched token that was fetched by `Lexer.next_token()`

`cut_off()`

Cuts off input string by searching for the first BEGIN and the last END. Cutting off everything before the first BEGIN and after the last END.

Example

>>> lexer = Lexer('This is garbage text BEGIN <HAPPY> "My poem about robots" END and some more garbage text.')
>>> print(lexer.input_string)
This is garbage text about BEGIN <HAPPY> "My poem about robots" END and some more garbage text.
>>> lexer.cut_off()
>>> print(lexer.input_string)
BEGIN <HAPPY> "My poem about robots" END

`next_token()`

Fetches the next token from the input string by setting Lexer.current_token to the Token type corresponding to what was read.

Warning

The Lexer does not check for any syntax errors. As a result, the EOF fail-safe is implemented. If the Lexer goes beyond the input string length, we set the current token to an EOF-Token by default.