Using the Lexer
Calling the Lexer
You can call the Lexer in a simple Python script like so:
with the String you input being the formal language you want to give out.
The Lexer provides a current_token
, a line
and a column
.
For instance, you can use
to check whether or not the current Token is an identifier.
Reading Tokens
The Lexer runs on a Token stream. It doesn't fully read the entire string input in one go, but rather just fetches the next Token it can read.
You can fetch a Token like that:
Warning
Always fetch the first token before expecting any tokens. The Lexer won't do it automatically.
The default token for any unfetched Lexer is a None
value.
If you want to collect the entire Token Stream from the start, you can implement it like this:
However, this is generally not recommended, as it is much more inefficient to collect them and loop through all of them once again.
Info
Tokens have a seperate section in the documentation. If you look for their definition, you can look here.
Expanding the Lexer
If you want to create more symbols for the formal language, you can expand the Lexer.
Suppose we want to add square bracket symbols [
, ]
that we want to read.
We first go into the parsing/token.py
to create a new Id
definition.
Next, we go straight to the Lexer. (parsing/lexer.py
)
There, you may see a bunch of match
-case
-statements. We can add our square brackets into two new case
statements.
Warning
Make sure to put the cases above the last one. The last case is a default-case that collects identifiers.
If you would put the square-bracket-cases below the default-case, it would never be reached and collected into an identifier instead.
You can test your result by just calling the Lexer with a fitting string.
lexer = Lexer('BEGIN ["Square Brackets are cool"] END')
lexer.next_token()
print(lexer.current_token.id) # BEGIN
lexer.next_token()
print(lexer.current_token.id) # LSQBRACKET
lexer.next_token()
print(lexer.current_token.id) # LITERAL
lexer.next_token()
print(lexer.current_token.id) # IDENT
lexer.next_token()
print(lexer.current_token.id) # LITERAL
lexer.next_token()
print(lexer.current_token.id) # RSQBRACKET
lexer.next_token()
print(lexer.current_token.id) # END
The Lexer class
The Lexer takes an input string and provides a Token stream.
Attributes:
Name | Type | Description |
---|---|---|
input_string |
str
|
The input string following the rules of a formal language. |
current_index |
int
|
The Lexer's internal position. Which character index the Lexer is currently reading. |
line |
int
|
The Lexer's external current line position. Which line of the input string is the current token located at? |
column |
int
|
The Lexer's external current column position. Which column of the momentary line in the input string is the current token located at? |
current_token |
Token
|
The last fetched token that was fetched by |
cut_off()
Cuts off input string by searching for the first BEGIN
and the last END
. Cutting off everything before the first BEGIN
and after the last END
.
Example
>>> lexer = Lexer('This is garbage text BEGIN <HAPPY> "My poem about robots" END and some more garbage text.')
>>> print(lexer.input_string)
This is garbage text about BEGIN <HAPPY> "My poem about robots" END and some more garbage text.
>>> lexer.cut_off()
>>> print(lexer.input_string)
BEGIN <HAPPY> "My poem about robots" END
next_token()
Fetches the next token from the input string by setting Lexer.current_token
to the Token type corresponding to what was read.
Warning
The Lexer does not check for any syntax errors. As a result, the EOF
fail-safe is implemented.
If the Lexer goes beyond the input string length, we set the current token to an EOF
-Token by default.