railt/lexer

Fast implementation of the stateful and stateless lexers

Tuesday, July 10, 2018
by Serafim
Repository
1 Watchers
2 Stars
644 Installations

PHP
3 Dependents
0 Suggesters
0 Forks
0 Open issues
3 Versions
0 % Grown

, _(*1)

, _(*2)

Lexer

Note: All questions and issues please send to https://github.com/railt/railt/issues, _(*3)

In order to quickly understand how it works - just write ~4 lines of code:, _(*4)

$lexer = Railt\Lexer\Factory::create(['T_WHITESPACE' => '\s+', 'T_DIGIT' => '\d+'], ['T_WHITESPACE']);

foreach ($lexer->lex(Railt\Io\File::fromSources('23 42')) as $token) {
    echo $token . "\n";
}

This example will read the source text and return the set of tokens from which it is composed: 1) T_DIGIT with value "23" 2) T_DIGIT with value "42", _(*5)

The second argument to the Factory class is the list of token names that are ignored in the lex method result. That's why we only got two significant tokens T_DIGIT. Although this is not entirely true, the answer contains a T_EOI (End Of Input) token which can also be removed from the output by adding an array of the second argument of Factory class., _(*6)

...and now let's try to understand more!, _(*7)

The lexer contains two types of runtime: 1) Basic - Set of algorithms with one state. 2) Multistate - Set of algorithms with the possibility of state transition between tokens., _(*8)

In connection with the fact that there were almost no differences in speed between several implementations (Stateful vs Stateless) of the same algorithm, it was decided to abandon the immutable stateful lexers., _(*9)

use Railt\Lexer\Factory;

/**
 * List of available tokens in format "name => pcre"
 */
$tokens = ['T_DIGIT' => '\d+', 'T_WHITESPACE' => '\s+'];

/**
 * List of skipped tokens
 */
$skip   = ['T_WHITESPACE'];

/**
 * Options:
 *   0 - Nothing.
 *   2 - With PCRE lookahead support.
 *   4 - With multistate support.
 */
$flags = Factory::LOOKAHEAD | Factory::MULTISTATE;

/**
 * Create lexer and tokenize sources. 
 */
$lexer = Factory::create($tokens, $skip, $flags);

In order to tokenize the source text, you must use the method ->lex(...), which returns iterator of the TokenInterface objects., _(*10)

foreach ($lexer->lex(File::fromSources('23 42')) as $token) {
    echo $token . "\n";
}

A TokenInterface provides a convenient API to obtain information about a token:, _(*11)

interface TokenInterface
{
    public function getName(): string;
    public function getOffset(): int;
    public function getValue(int $group = 0): ?string;
    public function getGroups(): iterable;
    public function getBytes(): int;
    public function getLength(): int;
}

Drivers

The factory returns one of the available implementations, however you can create it yourself., _(*12)

Basic

NativeRegex

NativeRegex implementation is based on the built-in php PCRE functions., _(*13)

use Railt\Lexer\Driver\NativeRegex;
use Railt\Io\File;

$lexer = new NativeRegex(['T_WHITESPACE' => '\s+', 'T_DIGIT' => '\d+'], ['T_WHITESPACE', 'T_EOI']);

foreach ($lexer->lex(File::fromSources('23 42')) as $token) {
    echo $token->getName() . ' -> ' . $token->getValue() . ' at ' . $token->getOffset() . "\n";
}

// Outputs:
// T_DIGIT -> 23 at 0
// T_DIGIT -> 42 at 3

Lexertl

Experimental lexer based on the C++ lexertl library. To use it, you need support for Parle extension., _(*14)

use Railt\Lexer\Driver\ParleLexer;
use Railt\Io\File;

$lexer = new ParleLexer(['T_WHITESPACE' => '\s+', 'T_DIGIT' => '\d+'], ['T_WHITESPACE', 'T_EOI']);

foreach ($lexer->lex(File::fromSources('23 42')) as $token) {
    echo $token->getName() . ' -> ' . $token->getValue() . ' at ' . $token->getOffset() . "\n";
}

// Outputs:
// T_DIGIT -> 23 at 0
// T_DIGIT -> 42 at 3

Be careful: The library is not fully compatible with the PCRE regex syntax. See the official documentation., _(*15)

Multistate

This functionality is not yet implemented., _(*16)

10/07 2018

dev-master

9999999-dev http://railt.org

Fast implementation of the stateful and stateless lexers

Sources Download

MIT

The Requires

php >=7.1.3
ext-spl *
ext-pcre *
ext-mbstring *
railt/io 1.2.*

The Development Requires

phpunit/phpunit ^6.5

by Kirill Nesmeyanov

language php lexer

10/07 2018

1.2.0

1.2.0.0 http://railt.org

Fast implementation of the stateful and stateless lexers

Sources Download

MIT

The Requires

php >=7.1.3
ext-spl *
ext-pcre *
ext-mbstring *
railt/io 1.2.*

The Development Requires

phpunit/phpunit ^6.5

by Kirill Nesmeyanov

language php lexer

10/07 2018

1.2.1

1.2.1.0 http://railt.org

Fast implementation of the stateful and stateless lexers

Sources Download

MIT

The Requires

php >=7.1.3
ext-spl *
ext-pcre *
ext-mbstring *
railt/io 1.2.*

The Development Requires

phpunit/phpunit ^6.5

by Kirill Nesmeyanov

language php lexer

library lexer

Fast implementation of the stateful and stateless lexers

railt/lexer

The README.md

Lexer

Drivers

Basic

NativeRegex

Lexertl

Multistate

The Versions

dev-master

The Requires

The Development Requires

by Kirill Nesmeyanov

1.2.0

The Requires

The Development Requires

by Kirill Nesmeyanov

1.2.1

The Requires

The Development Requires

by Kirill Nesmeyanov