Introduction

Elyze is a general purpose parser framework written is Rust.

Noa Parser is a collection of building blocks allowing creating various parsers on any kind of data.

This book will explain how to use the parser framework to create your own parsers.

If you want a deep dive into the basics concepts, you can read the concepts chapter.

Basic concepts

This part of the book will explain the basic concepts of the parser framework.

Scanner : A scanner is a cursor over the data to parse.
Matching : A matching is a process that recognizes a pattern in the data.
Error handling : Error handling
Recognizing : Recognizing data on matching patterns
Visitor : The visitor pattern and the concept of accepting scanner data
Peeking : Looking ahead in the data to find a pattern

Scanner

A parser is like an eye that sweeps over the data to find relevant elements.

To represent this eye, Elyze uses a structure called a Scanner.

It's a thin wrapper around the std::io::Cursor. It allows advancing or rewinding the cursor position as the parsing progresses.

#![allow(unused)]
fn main() {
use std::io::Cursor;
/// Wrapper around a `Cursor`.
#[derive(Debug, PartialEq)]
pub struct Scanner<'a, T> {
    /// The internal cursor.
    cursor: Cursor<&'a [T]>,
}
}

The scanner is generic over the type of the data it scans. This implies that the scanner can be used to scan data of any type. It's intended to be used over bytes. But you can use it over a slice of structure if you want to.

extern crate elyze;
// import the scanner
use elyze::scanner::Scanner;

struct Foo;

fn main() {
    // create a scanner over a slice of arbitrary data
    let mut scanner = Scanner::new(&[Foo, Foo, Foo]);
}

And because a reference over a slice of data, no Clone nor Copy constraint is required.

The scanner is a reference to the data it scans.

Manipulate the cursor

As your parse will progress, you want to make the scanner move too.

Move forward

You can call the bump_by method to move forward the cursor position in the slice of data embedded in the scanner by the number of elements you have consumed.

impl<'a, T> Scanner<'a, T> {
    /// Move the internal cursor forward by `n` positions.
    ///
    /// # Arguments
    ///
    /// * `n` - The number of positions to move the cursor forward.
    ///
    /// # Panics
    ///
    /// Panics if the internal cursor is moved past the end of the data.
    pub fn bump_by(&mut self, n: usize);
}

Let's take an example, you have a slice of bytes, and you want to move by 3 bytes the scanner.

let mut scanner = Scanner::new(&[0x01, 0x02, 0x03, 0x04, 0x05]);
scanner.bump_by(3);

The scanner will be at position 3. So it's now pointing to the byte 0x04.

Move backward

The rewind method moves backward the cursor. It's useful when your parser takes a wrong decision path, and you want to rewind the scanner to the previous state.

/// Move the internal cursor backward by `n` positions.
    ///
    /// # Arguments
    ///
    /// * `n` - The number of positions to move the cursor backward.
    ///
    /// # Panics
    ///
    /// Panics if the internal cursor is moved to a position before the start of the data.
    pub fn rewind(&mut self, n: usize);
}

Like in this example, you have a slice of bytes, you first move by 3 bytes the scanner, but it found that was not possible to continue the parsing, but it is maybe possible to parse something else. So you have to go back to the previous state.

let mut scanner = Scanner::new(&[0x01, 0x02, 0x03, 0x04, 0x05]);
scanner.bump_by(3);
// do something afterward but we want to go back to the previous state
scanner.rewind(3);

The scanner will be back at position 0. So it's pointing to the byte 0x01 again.

Get the current position

It can be found that your parsing logic needs to know the current position of the scanner.

To get it, you can call the current_position method.

impl<'a, T> Scanner<'a, T> {
    /// Return the current position of the internal cursor.
    pub fn current_position(&self) -> usize
}

Example

let mut scanner = Scanner::new(&[0x01, 0x02, 0x03, 0x04, 0x05]);
assert_eq!(scanner.current_position(), 0);
scanner.bump_by(3);
assert_eq!(scanner.current_position(), 3);

Move to a specific position

If you know exactly where you want to go, you can call the jump_to method.

Contrary to the bump_by and rewind methods, the jump_to methods which works on relative positions, the jump_to takes an absolute position to move to.

impl<'a, T> Scanner<'a, T> {
    /// Move the internal cursor to the specified position.
    ///
    /// # Arguments
    ///
    /// * `n` - The position to move the cursor to.
    ///
    /// # Panics
    ///
    /// Panics if the internal cursor is moved past the end of the data.
    pub fn jump_to(&mut self, n: usize);
}

Like in this example, you have a slice of bytes, you first move by 3 bytes the scanner, and some treament force you to go to , so we jump to the absolute position 1.

let mut scanner = Scanner::new(&[0x01, 0x02, 0x03, 0x04, 0x05]);
// a previous operation bumped the scanner
scanner.bump_by(1);

// record the initial position
let initial_position = scanner.current_position();
assert_eq!(initial_position, 1);

// do something
scanner.bump_by(3);
scanner.rewind(2);
scanner.bump_by(1);

// rewind the state
scanner.jump_to(initial_position);
assert_eq!(scanner.current_position(), 1);

// the remaining data is back to the initial state
assert_eq!(scanner.remaining(), &[0x02, 0x03, 0x04, 0x05]);

Manipulate the data

The sole cursor is not enough. You also need to be able to access the data within the scanner.

Get the remaining data

That's the most useful method it gets the remaining data of the scanner. Remaining means all data after the current cursor position.

The Scanner exposes the remaining method to do that.

impl<'a, T> Scanner<'a, T> {
    /// Return the remaining data of the internal cursor.
    pub fn remaining(&self) -> &'a [T]
}

Because of the returning lifetime, the data are ensured to live as long as the scanner slice of data does. The scanner can be dropped but data from remaining call won't.

fn process(mut scanner: Scanner<'_, u8>) -> &[u8] {
    // do something with the data
    // then bump the scanner
    scanner.bump_by(3);
    // return the remaining
    scanner.remaining()
}

fn main() {
    let data = b"hello world";
    let remaining = process(Scanner::new(data));
    assert_eq!(remaining, b"lo world");
}

Get all the data

If you want a reference to the whole data, you can call the data method.

it can be useful to get data earlier in the parsing process.

impl<'a, T> Scanner<'a, T> {
    /// Return the whole data of the internal cursor.
    pub fn data(&self) -> &'a [T]
}

Same as remaining, it's safe to drop the scanner.

fn process(mut scanner: Scanner<'_, u8>) -> &[u8] {
    // do something with the data
    // then bump the scanner
    scanner.bump_by(3);
    // return the whole data
    scanner.data()
}

fn main() {
    let data = b"hello world";
    let remaining = process(Scanner::new(data));
    assert_eq!(remaining, b"hello world");
}

Match data

Parsing is a process that recognizes a pattern in the data.

If we come back with the eye analogy, when you are sweeping over the characters is the text, you will discover letter by letter, which word you are looking for, and so on.

If you take, for example, the list of characters ['h', 'e', 'l', 'l', 'o', ' ', w', 'o', 'r', 'l', 'd'] and you want to find the word hello.

You will see successively the letter h, then e, then l, then l, then o.

If all the characters match, you have found the word. Otherwise, it's something else.

The `Match` trait

To materialize this idea, Elyze defines a Match trait.

#![allow(unused)]
fn main() {
pub trait Match<T> {
    /// Returns true if the data matches the pattern.
    ///
    /// # Arguments
    /// data - the data to match
    ///
    /// # Returns
    /// (found, number of matched characters)
    /// 
    /// (true, index) if the data matches the pattern,
    /// (false, index) otherwise
    fn is_matching(&self, data: &[T]) -> (bool, usize);

    /// Returns the size of the data to match.
    fn size(&self) -> usize;
}
}

The matcher method returns a tuple (found, index) where found is a boolean and index is the number of characters matched.

Like the Scanner trait, the Match trait is generic over the type of the data it matches.

To come back to our example, we can create a struct that implements the Match trait for the substring hello.

extern crate elyze;
use elyze::matcher::Match;

// define a structure to implement the `Match` trait
struct Hello;

// implement the `Match` trait
impl Match<u8> for Hello {
    fn is_matching(&self, data: &[u8]) -> (bool, usize) {
        // define the pattern to match
        let pattern = b"hello";
        // check if the subslice of data matches the pattern
        (&data[..pattern.len()] == pattern, pattern.len())
    }

    fn size(&self) -> usize {
        5
    }
}

fn main() {
    let hello = Hello;
    assert_eq!(hello.matcher(b"hello world"), (true, hello.size()));
    assert_eq!(hello.matcher(b"world is beautiful"), (false, 5));
}

Errors

All parsings can't be perfect. Sometimes, you will find that the data you are parsing is not what you expect.

Elyze provides its internal error type called ParseError this one is built on top of the crate thiserror.

To help readability, Elyze provides a type alias called ParseResult<T> that is an alias for Result<T, ParseError>.

#![allow(unused)]
fn main() {
/// The result of a parse operation
pub type ParseResult<T> = Result<T, ParseError>;

#[derive(Debug, thiserror::Error)]
pub enum ParseError {
    /// The parser reached the end of the input
    #[error("Unexpected end of input")]
    UnexpectedEndOfInput,
    #[error("Unexpected token have been encountered")]
    /// The parser encountered an unexpected token
    UnexpectedToken,
    /// Unable to decode a string as UTF-8
    #[error("UTF-8 error: {0}")]
    Utf8Error(#[from] std::str::Utf8Error),
    /// Unable to parse an integer from a string
    #[error("ParseIntError: {0}")]
    ParseIntError(#[from] std::num::ParseIntError),
}

}

Recognizing data

Matching data is important, but there are a lot of checks to do.

You first need to check if the start of the data matches the pattern.

Then, if it does, you get the subslice of the data that matches the pattern.

And most importantly, you need to bump the scanner to the end of the matched data. If you don't, your parser will never see the next data.

Let's take the previous example, where we want to find the word hello.

extern crate elyze;
use elyze::matcher::Match;

// define a structure to implement the `Match` trait
struct Hello;

// implement the `Match` trait
impl Match<u8> for Hello {
    fn is_matching(&self, data: &[u8]) -> (bool, usize) {
        // define the pattern to match
        let pattern = b"hello";
        // check if the subslice of data matches the pattern
        (&data[..pattern.len()] == pattern, pattern.len())
    }

    fn size(&self) -> usize {
        5
    }
}

fn main() {
    let mut scanner = Scanner::new(b"hello world");
    let (found, size) = Hello.is_matching(scanner.remaining());
    if !found {
        println!("not found");
        return;
    }
    let data = &scanner.remaining()[..size];
    scanner.bump_by(size);
    println!("found: {:?}", String::from_utf8_lossy(data)); // found: "hello"
    print!("remaining: {:?}", String::from_utf8_lossy(scanner.remaining())); // remaining: " world"
}

Recognizable trait

Because it is a common operation to recognize an object, Elyze provides the Recognizable trait.

Its role is to provide a unified way to recognize objects.

#![allow(unused)]
fn main() {
pub trait Recognizable<'a, T, V>: Match<T> {
    /// Try to recognize the object for the given scanner.
    ///
    /// # Type Parameters
    /// V - The type of the object to recognize
    ///
    /// # Arguments
    /// * `scanner` - The scanner to recognize the object for.
    ///
    /// # Returns
    /// * `Ok(Some(V))` if the object was recognized,
    /// * `Ok(None)` if the object was not recognized,
    /// * `Err(ParseError)` if an error occurred
    ///
    fn recognize(self, scanner: &mut Scanner<'a, T>) -> ParseResult<Option<V>>;

    /// Try to recognize the object for the given scanner.
    ///
    /// # Arguments
    /// * `scanner` - The scanner to recognize the object for.
    ///
    /// # Returns
    /// * `Ok(Some(&[T]))` if the object was recognized,
    /// * `Ok(None)` if the object was not recognized,
    /// * `Err(ParseError)` if an error occurred
    fn recognize_slice(self, scanner: &mut Scanner<'a, T>) -> ParseResult<Option<&'a [T]>>;
}
}

It defines to methods:

recognize(self, scanner: &mut Scanner<'a, T>) -> ParseResult<Option<V>>
recognize_slice(self, scanner: &mut Scanner<'a, T>) -> ParseResult<Option<&'a [T]>>

The input is both the mutable reference to the scanner, but their return type differs.

The recognize method returns the value of the object that was recognized. Whereas the recognize_slice method returns a slice of the data that was recognized.

This distinction is done because sometimes, you don't want the structure itself but rather the data that encodes it.

For example, we want to get all bytes until we get the first space character.

extern crate elyze;
use elyze::errors::{ParseError, ParseResult};
use elyze::matcher::Match;
use elyze::recognizer::Recognizable;
use elyze::scanner::Scanner;

// define a structure to implement the `Match` trait
struct UntilFirstSpace;

// implement the `Match` trait
impl Match<u8> for UntilFirstSpace {
    /// Check if the given data matches the pattern.
    fn is_matching(&self, data: &[u8]) -> (bool, usize) {
        let mut pos = 0;
        while pos < data.len() && data[pos] != b' ' {
            pos += 1;
        }
        (pos > 0, pos)
    }

    // The size of the object is unknown
    fn size(&self) -> usize {
        0
    }
}

// implement the `Recognizable` trait
impl<'a> Recognizable<'a, u8, UntilFirstSpace> for UntilFirstSpace {
    fn recognize(self, scanner: &mut Scanner<'a, u8>) -> ParseResult<Option<UntilFirstSpace>> {
        // check if the scanner has enough data
        if self.size() > scanner.remaining().len() {
            return Err(ParseError::UnexpectedEndOfInput);
        }

        let data = scanner.remaining();

        let (result, size) = self.is_matching(data);
        if !result {
            return Ok(None);
        }
        if !scanner.is_empty() {
            scanner.bump_by(size);
        }
        Ok(Some(self))
    }

    /// Try to recognize the object for the given scanner.
    /// Return the slice of elements that were recognized.
    fn recognize_slice(self, scanner: &mut Scanner<'a, u8>) -> ParseResult<Option<&'a [u8]>> {
        // Check if the scanner is empty
        if scanner.is_empty() {
            return Err(ParseError::UnexpectedEndOfInput);
        }

        let data = scanner.remaining();

        let (result, size) = self.is_matching(data);
        if !result {
            return Ok(None);
        }
        if !scanner.is_empty() {
            scanner.bump_by(size);
        }
        Ok(Some(&data[..size]))
    }
}

fn main() {
    let mut scanner = Scanner::new(b"hello world");
    let result = UntilFirstSpace
        .recognize_slice(&mut scanner)
        .expect("failed to parse");
    println!("{:?}", result.map(|s| String::from_utf8_lossy(s))); // Some("hello")

    let mut scanner = Scanner::new(b"loooooooooong string");
    let result = UntilFirstSpace
        .recognize_slice(&mut scanner)
        .expect("failed to parse");
    println!("{:?}", result.map(|s| String::from_utf8_lossy(s))); // Some("loooooooooong")
}

But this code won't compile, the rustc will complain that there is a conflict implementation.

error[E0119]: conflicting implementations of trait `Recognizable<'_, u8, UntilFirstSpace>` for type `UntilFirstSpace`
   |
25 | impl<'a> Recognizable<'a, u8, UntilFirstSpace> for UntilFirstSpace {
   | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |
   = note: conflicting implementation in crate `elyze`:
           - impl<'a, T, M> Recognizable<'a, T, M> for M
             where M: elyze::matcher::Match<T>;

And effectively, Elyze has already done the work for you using a marvelous feature of the rust language called the blanket implementation.

This one says that all "things" implementing the Match trait also implement the Recognizable trait.

Here is it is the blanket implementation for the Recognizable trait against any M that implements the Match trait:

/// Recognize an object for the given scanner.
/// Return the recognized object.
impl<'a, T, M: Match<T>> Recognizable<'a, T, M> for M {
    fn recognize(self, scanner: &mut Scanner<'a, T>) -> ParseResult<Option<M>> {
        // check if the scanner has enough data
        if self.size() > scanner.remaining().len() {
            return Err(ParseError::UnexpectedEndOfInput);
        }

        let data = scanner.remaining();

        // check if the data matches the pattern
        let (result, size) = self.is_matching(data);
        if !result {
            return Ok(None);
        }
        
        // bump the scanner if it's not empty
        if !scanner.is_empty() {
            scanner.bump_by(size);
        }
        
        // return the object
        Ok(Some(self))
    }

    /// Try to recognize the object for the given scanner.
    /// Return the slice of elements that were recognized.
    fn recognize_slice(self, scanner: &mut Scanner<'a, T>) -> ParseResult<Option<&'a [T]>> {
        // Check if the scanner is empty
        if scanner.is_empty() {
            return Err(ParseError::UnexpectedEndOfInput);
        }

        let data = scanner.remaining();

        // check if the data matches the pattern
        let (result, size) = self.is_matching(data);
        if !result {
            return Ok(None);
        }
        
        // bump the scanner if it's not empty
        if !scanner.is_empty() {
            scanner.bump_by(size);
        }
        
        // return the slice of data that was recognized
        Ok(Some(&data[..size]))
    }
}

You must simplify your code into:

extern crate elyze;
use elyze::matcher::Match;
use elyze::recognizer::Recognizable;
use elyze::scanner::Scanner;

// define a structure to implement the `Match` trait
struct UntilFirstSpace;

// implement the `Match` trait
impl Match<u8> for UntilFirstSpace {
    fn is_matching(&self, data: &[u8]) -> (bool, usize) {
        let mut pos = 0;
        while pos < data.len() && data[pos] != b' ' {
            pos += 1;
        }
        (pos > 0, pos)
    }

    // The size of the object is unknown
    fn size(&self) -> usize {
        0
    }
}

fn main() {
    let mut scanner = Scanner::new(b"hello world");
    let result = UntilFirstSpace
        .recognize_slice(&mut scanner)
        .expect("failed to parse");
    println!("{:?}", result.map(|s| String::from_utf8_lossy(s))); // Some("hello")

    let mut scanner = Scanner::new(b"loooooooooong string");
    let result = UntilFirstSpace
        .recognize_slice(&mut scanner)
        .expect("failed to parse");
    println!("{:?}", result.map(|s| String::from_utf8_lossy(s))); // Some("loooooooooong")
}

Signature breakdown

Because function signatures are polymorphic, they become a bit more complicated.

impl<'a, T, M: Match<T>> Recognizable<'a, T, M> for M {
    pub fn recognize(self, scanner: &mut Scanner<'a, T>) -> ParseResult<Option<M>>
}

'a type parameter is the lifetime of the data to parse.
T type parameter is the type of the data to parse.
M type parameter is the type of the object that we want to recognize.

We implement the Recognizable for:

'a the lifetime of the data to parse.
T the type of the data to parse.
M the type of the object that we want to recognize.

And the return type is ParseResult<Option<M>>.

The recognition may fail, and even in case of success, the process couldn't be able to recognize the object.

The same goes for the recognize_slice function but differs in the return type which is &'a [T] instead of M.

impl<'a, T, M: Match<T>> Recognizable<'a, T, M> for M {
    pub fn recognize_slice(self, scanner: &mut Scanner<'a, T>) -> ParseResult<Option<&'a [T]>>
}

Utility functions

With the actual toolbox, you're already able to write these lines:

fn main() {
    let mut scanner = Scanner::new(b"hello world");
    let data = Hello.recognize(&mut scanner).expect("failed to parse");

    if let Some(hello) = data {
        println!("found: {hello:?}"); // found: "Hello"
        print!(
            "remaining: {:?}",
            String::from_utf8_lossy(scanner.remaining())
        ); // remaining: " world"
    } else {
        println!("not found");
    }
}

That's great, but it's a bit verbose.

To help the readability, Elyze provides a few utility functions.

recognize
recognize_slice

Both are a thin wrapper around the Recognizable trait.

They basically call the recognize or recognize_slice function on the Recognizable trait, and transform the None variant to an Err(ParseError::UnexpectedToken).

recognize_slice

fn main() -> ParseResult<()> {
    let mut scanner = Scanner::new(b"hello world");
    let hello_string : &[u8] = recognize_slice(Hello, &mut scanner)?;

    println!("found: {}", String::from_utf8_lossy(hello_string)); // found: "hello"
    print!(
        "remaining: {:?}",
        String::from_utf8_lossy(scanner.remaining())
    ); // remaining: " world"

    Ok(())
}

The main benefit is the ability to use the ? operator to handle errors. So you can chain recognize functions.

Example, recognize 3 successive "hello"s.

extern crate elyze;
fn main() -> ParseResult<()> {
    let mut scanner = Scanner::new(b"hellohellohello world");
    // recognize the first "hello"
    recognize_slice(Hello, &mut scanner)?;
    // recognize the second "hello"
    recognize_slice(Hello, &mut scanner)?;
    // recognize the third "hello"
    recognize_slice(Hello, &mut scanner)?;

    Ok(())
}

Because recognize_slice is a thin wrapper around the Recognizable trait, it has the same type parameters as the Recognizable::recognize_slice method's trait.

// the signature of the `recognize_slice` function
pub fn recognize_slice<'a, T, V, R>(
    recognizable: R,
    scanner: &mut Scanner<'a, T>,
) -> ParseResult<&'a [T]>
where
    R: Recognizable<'a, T, V>,

'a type parameter is the lifetime of the data to parse.
T type parameter is the type of the data to parse.
V type parameter is the type of the object that we want to recognize.
R type parameter is the type of the object that we want to recognize.

recognize

Same as recognize_slice, the recognize function returns the object that was recognized.

extern crate elyze;

fn main() -> ParseResult<()> {
    let mut scanner = Scanner::new(b"hello world");
    let hello : Hello = recognize(Hello, &mut scanner)?;

    println!("found: {hello:?}"); // found: "hello"
    print!(
        "remaining: {:?}",
        String::from_utf8_lossy(scanner.remaining())
    ); // remaining: " world"

    Ok(())
}

This time that's the Hello structure which is returned and not the &[u8] slice.

Its signature is quite the same as the recognize_slice function, but differs in the return type.

```rust,ignore
// the signature of the `recognize` function
pub fn recognize<'a, T, V, R>(
    recognizable: R,
    scanner: &mut Scanner<'a, T>,
) -> ParseResult<V>
where
    R: Recognizable<'a, T, V>,

V is the type of the object that was recognized.

Visitor

The Elyze's keystone is the visitor pattern.

It is materialized by a Visitor trait.

#![allow(unused)]
fn main() {
extern crate elyze;
use elyze::errors::ParseResult;
use elyze::scanner::Scanner;

/// A visitor pattern.
///
/// # Type parameters
///
/// * `'a` - The lifetime of the data to visit.
/// * `T` - The type of the data to visit.
pub trait Visitor<'a, T>: Sized {
    /// Try to accept the `Scanner` and return the result of the visit.
    ///
    /// # Arguments
    ///
    /// * `scanner` - The scanner to accept.
    ///
    /// # Returns
    ///
    /// The result of the visit.
    fn accept(scanner: &mut Scanner<'a, T>) -> ParseResult<Self>;
}
}

This one defines a unique method accept that takes a Scanner as argument and returns a ParseResult of the visitor.

The visitor returns itself when it is accepted by scanner data.

The Visitor is generic over the type of the data to visit, and its lifetime corresponds to the lifetime of the data visited.

Accepting data

To accept scanner data, the object must implement the Visitor trait.

Let's say that you have three Recognizable objects:

Hello : recognize the word "hello"
Space : recognize the space character
World : recognize the word "world"

You want that your HelloWorld structure recognize the sentence "hello world" and return a HelloWorld object.

You have to implement the Visitor trait for the HelloWorld structure.

extern crate elyze;
use elyze::errors::ParseResult;
use elyze::matcher::Match;
use elyze::recognizer::Recognizable;
use elyze::scanner::Scanner;
use elyze::visitor::Visitor;

// define a structure to implement the `Visitor` trait
#[derive(Debug)]
struct HelloWorld;

// implement the `Visitor` trait
impl<'a> Visitor<'a, u8> for HelloWorld {
    fn accept(scanner: &mut Scanner<'a, u8>) -> ParseResult<Self> {
        recognize(Hello, scanner)?; // recognize the word "hello"
        recognize(Space, scanner)?; // recognize the space character
        recognize(World, scanner)?; // recognize the word "world"
        // return the `HelloWorld` object
        Ok(HelloWorld)
    }
}

fn main() {
    let data = b"hello world";
    let mut scanner = Scanner::new(data);
    // Use the accept method on HelloWorld
    let result = HelloWorld::accept(&mut scanner);
    println!("{:?}", result); // Ok(HelloWorld)
}

Accept from Recognizable

If your Recognizable data implements the Default trait, you automatically get a Visitor implementation.

This is done by another blanket implementation.

#![allow(unused)]
fn main() {
extern crate elyze;
/// Allow a `Recognizable` to be used as a `Visitor`.
///
/// # Type Parameters
///
/// * `T` - The type of the data to scan.
/// * `'a` - The lifetime of the data to scan.
/// * `'b` - The lifetime of the `Scanner`.
///
impl<'a, T, R: Recognizable<'a, T, R> + Default> Visitor<'a, T> for R {
    fn accept(scanner: &mut Scanner<'a, T>) -> ParseResult<Self> {
        recognize(R::default(), scanner)?;
        Ok(R::default())
    }
}
}

This makes the conversion from Recognizable and Visitor world trivial.

extern crate elyze;

#[derive(Default)]
struct Hello; // implement `Match` and `Default`

#[derive(Default)]
struct Space; // implement `Match` and `Default`

#[derive(Default)]
struct World; // implement `Match` and `Default`

impl<'a> Visitor<'a, u8> for HelloWorld {
    fn accept(scanner: &mut Scanner<'a, u8>) -> ParseResult<Self> {
        Hello::accept(scanner)?; // accept the word "hello"
        Space::accept(scanner)?; // accept the space character?; // recognize the space character
        World::accept(scanner)?; // accept the word "world"?; // recognize the word "world"
        // return the `HelloWorld` object
        Ok(HelloWorld)
    }
}

fn main() {
    let data = b"hello world";
    let mut scanner = elyze::scanner::Scanner::new(data);
    let result = HelloWorld::accept(&mut scanner);
    println!("{:?}", result); // Ok(HelloWorld)
}

One of the major differences between the Visitor and the Recognizable is the fact that the Visitor can call other Visitor objects.

This allows creating more complex parsing trees and thus more complex parsers.

Peeking data

Sometimes you want to look at the next data without consuming it.

Example, you have to match the starting of a parenthesis-delimited expression, and you want to check if one of the next characters is a ).

If so, you want the contents of the parenthesis to be consumed.

5 * ( 1 + 2 )
    ^
    you're here

To do that, you have to advance the scanner and check for each step if the scanner matches the close parenthesis.

Then you get the slice of the data between the open parenthesis and the close parenthesis.

 1 + 2

Peekable trait

Elyze uses the Peekable trait to define peekable data. Peekable data stands for data that you can look at without consuming it.

#![allow(unused)]
fn main() {
extern crate elyze;

pub trait Peekable<'a, T> {
    /// Attempt to match the `Peekable` against the current position of the
    /// `Scanner`.
    ///
    /// This method will temporarily advance the position of the `Scanner` to
    /// find a match. If a match is found, the `Scanner` is rewound to the
    /// original position and a `PeekResult` is returned. If no match is found,
    /// the `Scanner` is rewound to the original position and an `Err` is
    /// returned.
    ///
    /// # Arguments
    ///
    /// * `data` - The `Scanner` to use when matching.
    ///
    /// # Returns
    ///
    /// A `PeekResult` if the `Peekable` matches the current position of the
    /// `Scanner`, or an `Err` otherwise.
    fn peek(&self, data: &Scanner<'a, T>) -> ParseResult<PeekResult>;
}
}

This trait defines an unique peek method.

This one remains the scanner unchanged and returns a PeekResult.

The PeekResult is shield by a ParseResult because peeking can fail either by recognizing or by accepting the data. The error is propagated and left for the caller to handle it.

PeekResult

The PeekResult itself is an enumeration.

#![allow(unused)]
fn main() {
extern crate elyze;
pub enum PeekResult {
    /// The match was successful.
    Found {
        // The last index of the end slice
        end_slice: usize,
        // The size of the start element
        start_element_size: usize,
        // The size of the end element
        end_element_size: usize,
    },
    /// The match was unsuccessful.
    NotFound,
}
}

In its Found variant it embeds the last index of the end slice, the size of the start element and the size of the end element.

Example

Let's implement a Match for the closing parenthesis.

#![allow(unused)]
fn main() {
extern crate elyze; 
struct CloseParentheses;

impl Match<u8> for CloseParentheses {
    fn is_matching(&self, data: &[u8]) -> (bool, usize) {
        if data[0] == b')' {
            (true, 1)
        } else {
            (false, 0)
        }
    }

    fn size(&self) -> usize {
        1
    }
}
}

Then define something that will bear the Peekable trait.

#![allow(unused)]
fn main() {
struct ParenthesesGroup;
}

Then implement the Peekable

#![allow(unused)]
fn main() {
impl<'a> Peekable<'a, u8> for ParenthesesGroup {
    fn peek(&self, scanner: &Scanner<'a, u8>) -> ParseResult<PeekResult> {
        // create an internal scanner allowing to peek data without alterating the original scanner
        let mut inner_scanner = Scanner::new(&scanner.remaining());

        // loop on each byte until we find a close parenthesis
        loop {
            if inner_scanner.is_empty() {
                // we have reached the end without finding a close parenthesis
                break;
            }
            if CloseParentheses.recognize(&mut inner_scanner)?.is_some() {
                // we have found a close parenthesis
                return Ok(PeekResult::Found {
                    // we return the position of the close parenthesis
                    end_slice: inner_scanner.current_position(),
                    // our peeking doesn't include a start element
                    start_element_size: 0,
                    // the size of the end element is a close parenthesis of 1 byte
                    end_element_size: 1,
                });
            }

            // consume the current byte
            inner_scanner.bump_by(1);
        }

        // At this point, we have reached the end of available data without finding a close parenthesis
        Ok(PeekResult::NotFound)
    }
}
}

Its implementation is not perfect, it takes the first close parenthesis and doesn't take into account the case where there are multiple close parentheses in the case of nested parentheses, for example.

But enough to demonstrate the concept.

extern crate elyze;

fn main() -> ParseResult<()> {
    let data = b"7 * ( 1 + 2 )";
    let mut scanner = Scanner::new(data);
    scanner.bump_by(5); // consumes : 7 * (
    let result = ParenthesesGroup.peek(&scanner)?;
    if let PeekResult::Found {
        end_slice,
        end_element_size,
        ..
    } = result
    {
        println!(
            "{:?}",
            // to found the real size of enclosed data, we need to subtract the size of the end element
            String::from_utf8_lossy(&scanner.remaining()[..end_slice - end_element_size]) // 1 + 2
        );
    } else {
        println!("not found");
    }
    println!(
        "scanner: {:?}",
        // the scanner itself remains unchanged
        String::from_utf8_lossy(scanner.remaining()) // scanner: " 1 + 2 )"
    );
    Ok(())
}

Peeking

To stroll a successful peek, Elyze defines a structure called Peeking

#![allow(unused)]
fn main() {
pub struct Peeking<'a, T> {
    /// The start of the match.
    pub start_element_size: usize,
    /// The end of the match.
    pub end_element_size: usize,
    /// The length of peeked slice.
    pub end_slice: usize,
    /// The data that was peeked.
    pub data: &'a [T],
}
}

Like you can see, the Peeking struct embeds PeekResult::Found and the data slice.

peek method

This Peeking is used by the peek method.

#![allow(unused)]
fn main() {
extern crate elyze;

/// Attempt to match a `Peekable` against the current position of a `Scanner`.
///
/// This function will temporarily advance the position of the `Scanner` to find
/// a match. If a match is found, the `Scanner` is rewound to the original
/// position and a `Peeking` is returned. If no match is found, the `Scanner` is
/// rewound to the original position and an `Err` is returned.
///
/// # Arguments
///
/// * `peekable` - The `Peekable` to attempt to match.
/// * `scanner` - The `Scanner` to use when matching.
///
/// # Returns
///
/// A `Peeking` if the `Peekable` matches the current position of the `Scanner`,
/// or an `Err` otherwise.
pub fn peek<'a, T, P: Peekable<'a, T>>(
    peekable: P,
    scanner: &Scanner<'a, T>,
) -> ParseResult<Option<Peeking<'a, T>>>;
}

This one is a short syntax of using directly the Peekable::peek method.

It takes care of the arithmetic data slice for you.

extern crate elyze;
fn main() -> ParseResult<()> {
    let data = b"7 * ( 1 + 2 )";
    let mut scanner = Scanner::new(data);
    scanner.bump_by(5); // consumes : 7 * (
    
    // use peek method instead of ParenthesesGroup.peek
    let result = peek(ParenthesesGroup, &scanner)?;
    if let Some(peeking) = result {
        println!(
            "{:?}",
            // the peek_slice method returns the slice of recognized without the end element
            String::from_utf8_lossy(peeking.peeked_slice()) // 1 + 2
        );
    } else {
        println!("not found");
    }
    println!(
        "scanner: {:?}",
        // the scanner itself remains unchanged
        String::from_utf8_lossy(scanner.remaining()) // scanner: " 1 + 2 )"
    );
    Ok(())
}

Components

Components are built on top of basic concepts, this allows more complex parsing behaviors.

Tokens : Basic element to recognize in your parse.
Recognizer : Allows recognizing an alternative of Recognizable.
Acceptor : Allows accepting an alternative of Visitor.
Peeker : Allows peeking an alternative of Peekable.
Peek from Visitor : Explain how to transform a Visitor into a Peekable.
Last : Get the last occurence of a Peekable in the data.
Separated List : A separated list is a process that accepts a list of elements separated by a separator.

Tokens

Based on the Recognizable trait, a token is the atomic element you want to recognize in your parse.

The idea behind the token recognition is to create a union type that will allow you to recognize any token in the same way.

In Rust this union is materialized with an enumeration.

All you need to do is to implement the Match trait on your enumeration. And then you can use the recognize function to recognize the token variant.

extern crate elyze;
use elyze::errors::ParseResult;
use elyze::matcher::Match;
use elyze::recognizer::recognize;

// define a matching function
fn match_char(c: char, data: &[u8]) -> (bool, usize) {
    match data.get(0) {
        Some(&d) => (d == c as u8, 1),
        None => (false, 0),
    }
}

// create an enumeration of tokens to recognize
enum Token {
    Plus,
    Minus,
    Star,
    Slash,
    LParen,
    RParen,
}

// implement the `Match` trait for the `Token` enum
impl Match<u8> for Token {
    fn is_matching(&self, data: &[u8]) -> (bool, usize) {
        match self {
            Token::Plus => match_char('+', data),
            Token::Minus => match_char('-', data),
            Token::Star => match_char('*', data),
            Token::Slash => match_char('/', data),
            Token::LParen => match_char('(', data),
            Token::RParen => match_char(')', data),
        }
    }

    fn size(&self) -> usize {
        match self {
            Token::Plus => 1,
            Token::Minus => 1,
            Token::Star => 1,
            Token::Slash => 1,
            Token::LParen => 1,
            Token::RParen => 1,
        }
    }
}

// Profit !
fn main() -> ParseResult<()> {
    let data = b"((+-)*/)end";
    let mut scanner = elyze::scanner::Scanner::new(data);
    recognize(Token::LParen, &mut scanner)?;
    recognize(Token::LParen, &mut scanner)?;
    recognize(Token::Plus, &mut scanner)?;
    recognize(Token::Minus, &mut scanner)?;
    recognize(Token::RParen, &mut scanner)?;
    recognize(Token::Star, &mut scanner)?;
    recognize(Token::Slash, &mut scanner)?;
    recognize(Token::RParen, &mut scanner)?;

    print!("{:?}", String::from_utf8_lossy(scanner.remaining())); // "end"

    Ok(())
}

Recognizer

The Recognizer is meant to fix the problem when you are matching a pattern which can vary.

That's the case with numeric operators like + or - for example.

The Recognizer works by matching Recognizable one by one until a match is found. Il no pattern matches, it returns a None value.

The Recognizer takes a mutable to the Scanner as a parameter.

Then a chain of try_or methods is used to add Recognizable objects to the Recognizer.

extern crate elyze;
use elyze::recognizer::Recognizer;

fn main() -> ParseResult<()> { 
Recognizer::new(scanner)
    .try_or(Operator::Add)?
    .try_or(Operator::Sub)?
    .finish();
}

Step by step

The Recognizer works in 3 steps:

Initializing with a scanner
Add a Recognizable to the Recognizer
Call the finish method to get the result

Step 1: Initializing with a scanner

extern crate elyze;
use elyze::recognizer::Recognizer;
use elyze::scanner::Scanner;

fn main() {
    let data = b"+";
    let mut scanner = Scanner::new(data);
    let result = Recognizer::<u8, Operator>::new(&mut scanner);
}

The Recognizer is defined by

#![allow(unused)]
fn main() {
extern crate elyze;
/// A `Recognizer` is a type that wraps a `Scanner` and holds a successfully
/// recognized value.
///
/// When a value is successfully recognized, the `Recognizer` stores the value in
/// its `data` field and returns itself. If a value is not recognized, the
/// `Recognizer` rewinds the scanner to the previous position and returns itself.
///
/// # Type Parameters
///
/// * `T` - The type of the data to scan.
/// * `U` - The type of the value to recognize.
/// * `'a` - The lifetime of the data to scan.
/// * `'b` - The lifetime of the `Scanner`.
pub struct Recognizer<'a, 'b, T, R> {
    data: Option<R>,
    scanner: &'b mut Scanner<'a, T>,
}
}

That's why you need to specify the type of T and R in its new call.

Step 2: Add a `Recognizable` to the `Recognizer`

Once to the Recognizer is initialized, you can add one or more Recognizable to it.

To do so, the Recognizer provides the try_or method.

#![allow(unused)]
fn main() {
extern crate elyze;
impl<'a, 'b, T, R: Recognizable<'a, T, R>> Recognizer<'a, 'b, T, R> {
    pub fn try_or(mut self, element: R) -> ParseResult<Self>;
}
}

This one takes a R object where R implements the Recognizable trait. And returns a ParseResult containing the Recognizer object.

You can add as many Recognizable as you want. But by rust limitations, th R must be the same at each call. That's why using an enumeration of tokens is a good idea.

Here are my tokens:

#![allow(unused)]
fn main() {
#[derive(Debug)]
enum Operator {
    Add,
    Sub,
}
}

It implements Match<u8>

extern crate elyze;

fn main() -> ParseResult<()> {
    let data = b"+";
    let mut scanner = Scanner::new(data);
    // Initialize the recognizer (type can be inferred)
    let recognizer = Recognizer::new(&mut scanner);
    // Try to apply the recognizer on the operator add, if it fails, return an error
    let recognizer_add = recognizer.try_or(Operator::Add)?;
    // Try to apply the recognizer on the operator sub, if it fails, return an error
    let recognizer_add_and_sub = recognizer_add.try_or(Operator::Sub)?;
    
    Ok(())
}

The try_or method works as follows:

Checks the internal state, if internal state is None:
- It calls the recognize method of the Recognizable object.
- If the recognize method returns Ok(Some(value)), it sets the internal state to Some(value).
- If the recognize method returns Ok(None)
- If an error occurs, it returns the error.
If the internal state is Some, it returns the Recognizer as is.

The execution is immediate not lazy, you are not registering an operation but applying it.

Step 3: Call the `finish` method to get the result

Once all the Recognizable have been added to the Recognizer, you can call the finish method to get the result.

#![allow(unused)]
fn main() {
extern crate elyze;
impl<'a, 'b, T, R: Recognizable<'a, T, R>> Recognizer<'a, 'b, T, R> {
    pub fn finish(self) -> Option<R>;
}
}

This step can't fail. It returns the internal state of the Recognizer.

extern crate elyze;
fn main() -> ParseResult<()> {
    let data = b"+";
    let mut scanner = Scanner::new(data);
    // Initialize the recognizer
    let recognizer = Recognizer::new(&mut scanner);
    // Try to apply the recognizer on the operator add, if it fails, return an error
    let recognizer_add = recognizer.try_or(Operator::Add)?;
    // Try to apply the recognizer on the operator sub, if it fails, return an error
    let recognizer_add_and_sub = recognizer_add.try_or(Operator::Sub)?;
    // Finish the recognizer
    let result = recognizer_add_and_sub.finish();
    dbg!(result); // Some(Operator::Add)
}

Recognizer to Visitor

If you want to reuse the Recognizer object, you can convert it to a Visitor object.

#![allow(unused)]
fn main() {
extern crate elyze;
#[derive(Debug)]
// Define a structure to implement the `Visitor` trait
struct OperatorData(Operator);

impl<'a> Visitor<'a, u8> for OperatorData {
  fn accept(scanner: &mut Scanner<'a, u8>) -> ParseResult<Self> {
    // Build and apply the recognizer
    let operator = Recognizer::new(scanner)
            .try_or(Operator::Add)?
            .try_or(Operator::Sub)?
            .finish()
            // If the recognizer fails, return an error
            .ok_or(ParseError::UnexpectedToken)?;

    Ok(OperatorData(operator))
  }
}
}

The behavior is now reusable.

extern crate elyze;
fn main() -> ParseResult<()> {
    let data = b"+";
    let mut scanner = Scanner::new(data);
    // Initialize the recognizer
    let result = OperatorData::accept(&mut scanner)?.0;
    dbg!(result); // Operator::Add

    let data = b"-";
    let mut scanner = Scanner::new(data);
    // Initialize the recognizer
    let result = OperatorData::accept(&mut scanner)?.0;
    dbg!(result); // Operator::Sub

    let data = b"x";
    let mut scanner = Scanner::new(data);
    // Initialize the recognizer
    let result = OperatorData::accept(&mut scanner);
    dbg!(result); // Err(UnexpectedToken)

    Ok(())
}

Recommendations

Because that the first matching is returned, you should add the most specific patterns first.

Example:

If you match "hell" and "hello". You should add "hello" first. Otherwise, the recognizer will always return "hell".

Acceptor

The Accecptor is quite the same thing as the Recognizer but instead of taking Recognizable objects, it takes Visitor objects.

Let's define two visitors.

#![allow(unused)]
fn main() {
extern crate elyze;
#[derive(Default, Debug)]
struct OperatorPlus;
#[derive(Default, Debug)]
struct OperatorMinus;

impl Match<u8> for OperatorPlus {
    fn is_matching(&self, data: &[u8]) -> (bool, usize) {
        match_pattern(b"+", data)
    }
    fn size(&self) -> usize {
        1
    }
}
impl Match<u8> for OperatorMinus {
    fn is_matching(&self, data: &[u8]) -> (bool, usize) {
        match_pattern(b"-", data)
    }
    fn size(&self) -> usize {
        1
    }
}
}

And another more complex.

#![allow(unused)]
fn main() {
#[derive(Default)]
struct Hello;
#[derive(Default)]
struct Space;
#[derive(Default)]
struct World;

impl Match<u8> for Hello {
    fn is_matching(&self, data: &[u8]) -> (bool, usize) {
        (&data[..5] == b"hello", 5)
    }

    fn size(&self) -> usize {
        5
    }
}

impl Match<u8> for Space {
    fn is_matching(&self, data: &[u8]) -> (bool, usize) {
        (data[0] as char == ' ', 1)
    }

    fn size(&self) -> usize {
        1
    }
}

impl Match<u8> for World {
    fn is_matching(&self, data: &[u8]) -> (bool, usize) {
        (&data[..5] == b"world", 5)
    }

    fn size(&self) -> usize {
        5
    }
}

// define a structure to implement the `Visitor` trait
#[derive(Debug)]
struct HelloWorld;

impl<'a> Visitor<'a, u8> for HelloWorld {
    fn accept(scanner: &mut Scanner<'a, u8>) -> ParseResult<Self> {
        Hello::accept(scanner)?; // accept the word "hello"
        Space::accept(scanner)?; // accept the space character?; // recognize the space character
        World::accept(scanner)?; // accept the word "world"?; // recognize the word "world"
        // return the `HelloWorld` object
        Ok(HelloWorld)
    }
}
}

We have now, 3 visitors : OperatorPlus, OperatorMinus and HelloWorld.

Because all Acceptor result must be homogenous, we use an enumeration.

#![allow(unused)]
fn main() {
extern crate elyze;
#[derive(Debug)]
enum Operator {
    Plus(OperatorPlus),
    Minus(OperatorMinus),
    HelloWorld(HelloWorld),
}
}

Then we can use it.

extern crate elyze;
use elyze::acceptor::Acceptor;
fn main() -> ParseResult<()> {
    let data = b"+ 2";
    let mut scanner = Scanner::new(data);
    let accepted = Acceptor::new(&mut scanner)
        .try_or(Operator::Plus)?
        .try_or(Operator::HelloWorld)?
        .try_or(Operator::Minus)?
        .finish()
        .ok_or(ParseError::UnexpectedToken)?;

    println!("{:?}", accepted); // +

    let data = b"- 2";
    let mut scanner = Scanner::new(data);
    let accepted = Acceptor::new(&mut scanner)
        .try_or(Operator::Plus)?
        .try_or(Operator::HelloWorld)?
        .try_or(Operator::Minus)?
        .finish()
        .ok_or(ParseError::UnexpectedToken)?;

    println!("{:?}", accepted); // -

    let data = b"hello world 2";
    let mut scanner = Scanner::new(data);
    let accepted = Acceptor::new(&mut scanner)
        .try_or(Operator::Plus)?
        .try_or(Operator::HelloWorld)?
        .try_or(Operator::Minus)?
        .finish()
        .ok_or(ParseError::UnexpectedToken)?;

    println!("{:?}", accepted); // HelloWorld

    Ok(())
}

Reusable Acceptor

Likewise the Recognizer, an Acceptor can be embedded in a Visitor.

#![allow(unused)]
fn main() {
extern crate elyze;

impl<'a> Visitor<'a, u8> for Operator {
    fn accept(scanner: &mut Scanner<'a, u8>) -> ParseResult<Self> {
        Acceptor::new(scanner)
            .try_or(Operator::Plus)?
            .try_or(Operator::HelloWorld)?
            .try_or(Operator::Minus)?
            .finish()
            .ok_or(ParseError::UnexpectedToken)
    }
}
}

Which simplifies the code.

extern crate elyze;
fn main() -> ParseResult<()> {
    let data = b"+ 2";
    let mut scanner = Scanner::new(data);
    let accepted = Operator::accept(&mut scanner)?;

    println!("{:?}", accepted); // +

    let data = b"- 2";
    let mut scanner = Scanner::new(data);
    let accepted = Operator::accept(&mut scanner)?;

    println!("{:?}", accepted); // -

    let data = b"hello world 2";
    let mut scanner = Scanner::new(data);
    let accepted = Operator::accept(&mut scanner)?;

    println!("{:?}", accepted); // HelloWorld

    Ok(())
}

Peeker

The Peeker is like the Acceptor but it doesn't consume the data.

The other difference is that the Peeker returns a Peeking so the constraint to have a homogeneous data type isn't required.

This implies that any kind of Visitor can be used as Peekable.

Peeking a variation of elements is more complex than accepting it. You want the shortest possible data slice.

7 * ( 1 + 2 )

If you are peeking "*" or "+".

Following the order of registration, the peek will be either:

* -> + : 7
+ -> * : 7 * ( 1

We want that the peeking always returns the shortest possible data slice so 7 independently of the order of registration.

Inverting the order of registration may fail the peeking with this slice:

( 1 + 2 ) * 7

To allow this behavior, we need to register all Peekables, then execute the peeking for each Peekable. If the peeked length is shorter than the internal state, this state is replaced. Otherwise, the internal state is left unchanged.

Example

Let's take this selection of tokens:

#![allow(unused)]
fn main() {
enum OperatorTokens {
    Plus,
    Times,
}
}

OperatorTokens implements Visitior, so it's the case of OperatorTokens::Plus and OperatorTokens::Times.

You can then use the Until with it to create a Peekable variant.

extern crate elyze;

fn main() -> ParseResult<()> {
    let data = b"7 * ( 1 + 2 )";
    let scanner = Scanner::new(data);
    // create a peeker with a scanner
    let slice = Peeker::new(&scanner)
        // register the peekable until `OperatorTokens::Plus`
        .add_peekable(Until::new(OperatorTokens::Times))
        // register the peekable until `OperatorTokens::Plus`
        .add_peekable(Until::new(OperatorTokens::Plus))
        // peek the scanner for the first `OperatorTokens`
        .peek()?;
    // if found
    if let Some(slice) = slice {
        // the slice is the shortest possible
        println!("{:?}", String::from_utf8_lossy(slice.peeked_slice())); // "7 "
    }
    Ok(())
}

If we want something more reusable, we can create a struct that implements Peekable and put it inside the Peeker.

#![allow(unused)]
fn main() {
extern crate elyze;

// define a struct that implements `Peekable`
struct FirstOperator;

// implement `Peekable` for `FirstOperator`
impl<'a> Peekable<'a, u8> for FirstOperator {
    fn peek(&self, scanner: &Scanner<'a, u8>) -> ParseResult<PeekResult> {
        Peeker::new(scanner)
            .add_peekable(Until::new(OperatorTokens::Plus))
            .add_peekable(Until::new(OperatorTokens::Times))
            .peek()
            // convert the `Peeking` into a `PeekResult`    
            .map(Into::into)
    }
}
}

Which simplifies the code.

extern crate elyze;
fn main() -> ParseResult<()> {
    let data = b"7 * ( 1 + 2 )";
    let scanner = Scanner::new(data);
    let slice = peek(FirstOperator, &scanner)?;
    if let Some(slice) = slice {
        println!("{:?}", String::from_utf8_lossy(slice.peeked_slice())); // "7 "
    }

    let data = b"7 * ( 1 + 2 )";
    let scanner = Scanner::new(data);
    let slice = peek(FirstOperator, &scanner)?;
    if let Some(slice) = slice {
        println!("{:?}", String::from_utf8_lossy(slice.peeked_slice())); // "7 "
    }

    let data = b"1 + 2 * 7";
    let scanner = Scanner::new(data);
    let slice = peek(FirstOperator, &scanner)?;
    if let Some(slice) = slice {
        println!("{:?}", String::from_utf8_lossy(slice.peeked_slice())); // "1 "
    }

    Ok(())
}

Peek from Visitor

It may be interesting to be able to peek from a Visitor.

You can implement by yourself the Peekable trait, or you can let Elyze do it for you.

We just need to say to Elyze that we want it to implement the Peekable trait using the Visitor pattern.

#![allow(unused)]
fn main() {
extern crate elyze;
use elyze::peek::{DefaultPeekableImplementation, PeekableImplementation};

impl PeekableImplementation for CloseParentheses {
    type Type = DefaultPeekableImplementation;
}
}

This will automatically enable the Peekable trait for CloseParentheses.

extern crate elyze;
use elyze::errors::ParseResult;
use elyze::matcher::Match;
use elyze::peek::{peek, Until};
use elyze::scanner::Scanner;

// The Default makes the `CloseParentheses` structure 
// implements the `Visitor` trait
#[derive(Default)]
struct CloseParentheses;

// implementing the `Match` trait needed by Until
impl Match<u8> for CloseParentheses {
    fn is_matching(&self, data: &[u8]) -> (bool, usize) {
        if data[0] == b')' {
            (true, 1)
        } else {
            (false, 0)
        }
    }

    fn size(&self) -> usize {
        1
    }
}

/// Active the Default implementation of Peekable for CloseParentheses
impl PeekableImplementation for CloseParentheses {
    type Type = DefaultPeekableImplementation;
}

fn main() -> ParseResult<()> {
    let data = b"( 7 * ( 1 + 2 ) )";
    let mut scanner = Scanner::new(data);
    scanner.bump_by(7); // consumes : ( 7 * (
    
    // peek the first ")"
    let result = peek(CloseParentheses, &scanner)?;
    if let Some(peeking) = result {
        println!(
            "{:?}",
            // the peek_slice method returns the slice of recognized data without the end element
            String::from_utf8_lossy(peeking.peeked_slice()) // 1 + 2
        );
    } else {
        println!("not found");
    }
    println!(
        "scanner: {:?}",
        // the scanner itself remains unchanged
        String::from_utf8_lossy(scanner.remaining()) // scanner: " 1 + 2 ) )"
    );
    Ok(())
}

The result of the peeking is the slice until the first element peeked.

Last

If you want the last element peekable in the data, you can use the Last modifier.

The Last modifier takes a Peekable as argument. So it may be a Visitor.

The code remains the same. But the behavior is different.

Instead of returning at the first peeked element, it will advance an internal scanner and re-apply the peek operation until reaching the end of the data.

The last element peeked is returned.

extern crate elyze;
#[derive(Default)] // Enable the Visitor implementation for CloseParentheses
struct CloseParentheses;

/// Enable the PeekSize and Recognizable implementation for CloseParentheses
impl Match<u8> for CloseParentheses {
    fn is_matching(&self, data: &[u8]) -> (bool, usize) {
        if data[0] == b')' {
            (true, 1)
        } else {
            (false, 0)
        }
    }

    fn size(&self) -> usize {
        1
    }
}

/// Active the Default implementation of Peekable for CloseParentheses
impl PeekableImplementation for CloseParentheses {
    type Type = DefaultPeekableImplementation;
}

fn main() -> ParseResult<()> {
    let data = b"8 / ( 7 * ( 1 + 2 ) )";
    let mut scanner = Scanner::new(data);
    // consumes : "8 / ( " to reach the start of the enclosed data
    scanner.bump_by(b"8 / (".len());
    // because CloseParentheses implements the Peekable trait, we can peek it with the modifier Last
    let result = peek(Last::new(CloseParentheses), &scanner)?;
    if let Some(peeking) = result {
        println!(
            "{:?}",
            // the peek_slice method returns the all enclosed data not the first occurrence of ")" -> "7 * ( 1 + 2 "
            String::from_utf8_lossy(peeking.peeked_slice()) //  7 * ( 1 + 2 )
        );
    }
    Ok(())
}

Separated list

The SeparatedList component is a list of objects separated by a separator.

It is built using to Visitor. One is the element of the list to accept, and the other is the separator.

The SeparatedList will try to accept the element if it's a success push it into the list. Then, try to accept the separator. And repeat the process until a failure.

If one of the Visitor (element or separator) fails, it will return the list of elements accepted.

extern crate elyze;

// Define how to match a number
struct TokenNumber;

impl Match<u8> for TokenNumber {
    fn is_matching(&self, data: &[u8]) -> (bool, usize) {
        let mut pos = 0;
        while pos < data.len() && data[pos].is_ascii_digit() {
            pos += 1;
        }
        (pos > 0, pos)
    }
    fn size(&self) -> usize {
        0
    }
}

#[derive(Debug)]
struct NumberData(usize);

impl<'a> Visitor<'a, u8> for NumberData {
    fn accept(scanner: &mut Scanner<u8>) -> ParseResult<Self> {
        let slice = recognize_slice(TokenNumber, scanner)?;
        let number = std::str::from_utf8(slice)?.parse::<usize>()?;
        Ok(NumberData(number))
    }
}

// Define how to match a separator
#[derive(Debug)]
struct Separator;

impl<'a> Visitor<'a, u8> for Separator {
    fn accept(scanner: &mut Scanner<u8>) -> ParseResult<Self> {
        recognize(Token::Tilde, scanner)?;
        recognize(Token::Tilde, scanner)?;
        recognize(Token::Tilde, scanner)?;
        Ok(Separator)
    }
}

// Apply to a separated list
fn main() -> ParseResult<()> {
    let data = b"1~~~2~~~3~~~4";
    let mut scanner = Scanner::new(data);
    // define the separated list using types parameters NumberData and Separator
    let result = SeparatedList::<_, NumberData, Separator>::accept(&mut scanner)?
        // the inner list can be extracted with the `data` attribute
        .data;
    println!("{:?}", result); // Ok([NumberData(1), NumberData(2), NumberData(3), NumberData(4)])
    Ok(())
}

Trailing separator

Caution: The separate list expects an element after successfully accepting the separator.

1~~~2~~~3~~~4~~~

Will make fail the parse.

Data must be cleanup before using it.

1~~~2~~~3~~~4

To avoid this, Elize exposes a function called get_scanner_without_trailing_separator.

#![allow(unused)]
fn main() {
extern crate elyze;
/// Return a scanner without the trailing separator.
///
/// # Arguments
///
/// * `element` - The peekable element.
/// * `separator` - The peekable separator.
/// * `scanner` - The scanner.
///
/// # Returns
///
/// A `ParseResult` containing a `Scanner` without the trailing separator.
pub fn get_scanner_without_trailing_separator<'a, T, P1, P2>(
    element: P1,
    separator: P2,
    scanner: &Scanner<'a, T>,
) -> ParseResult<Scanner<'a, T>>
where
    P1: Peekable<'a, T> + PeekableImplementation<Type = DefaultPeekableImplementation>,
    P2: Peekable<'a, T> + PeekableImplementation<Type = DefaultPeekableImplementation>;
}

It takes two Peekable as arguments. The first is the element and the second is the separator. And a reference to a Scanner as third argument.

This function returns a Scanner truncated to not include the last separator.

Example

We can create a visitor to use it.

#![allow(unused)]
fn main() {
extern crate elyze;
use elyze::separated_list::get_scanner_without_trailing_separator;

#[derive(Debug)]
struct NumberList {
    data: Vec<usize>,
}

impl<'a> Visitor<'a, u8> for NumberList {
    fn accept(scanner: &mut Scanner<'a, u8>) -> ParseResult<Self> {
        // get the scanner without the trailing separator
        let mut data_scanner =
            get_scanner_without_trailing_separator(TokenNumber, Separator, &scanner)?;

        // accept the separated list and extract the data
        let data = SeparatedList::<u8, Number<usize>, Separator>::accept(&mut data_scanner)?
            .data
            .into_iter()
            .map(|x| x.0)
            .collect::<Vec<usize>>();
        
        // clean up the scanner because all data has been extracted
        scanner.bump_by(scanner.data().len());

        Ok(NumberList { data })
    }
}       
}

This cleanup allows us to handle all problematic cases.

extern crate elyze;
fn main() -> ParseResult<()> {
    // list of elements separated by a separator
    let data = b"1~~~2~~~3~~~4";
    let mut scanner = Scanner::new(data);

    let result = NumberList::accept(&mut scanner)?;
    println!("{:?}", result); // NumberList { data: [1, 2, 3, 4] }

    // list of elements separated by a separator with trailing separator
    let data = b"1~~~2~~~3~~~4~~~";
    let mut scanner = Scanner::new(data);

    let result = NumberList::accept(&mut scanner)?;
    println!("{:?}", result); // NumberList { data: [1, 2, 3, 4] }

    // list of 1 element with trailing separator
    let data = b"1~~~";
    let mut scanner = Scanner::new(data);

    let result = NumberList::accept(&mut scanner)?;
    println!("{:?}", result); // NumberList { data: [1] }

    // list of 1 element without trailing separator
    let data = b"1";
    let mut scanner = Scanner::new(data);

    let result = NumberList::accept(&mut scanner)?;
    println!("{:?}", result); // NumberList { data: [1] }

    // list of 0 elements
    let data = b"";
    let mut scanner = Scanner::new(data);

    let result = NumberList::accept(&mut scanner)?;
    println!("{:?}", result); // NumberList { data: [] }

    // bad data
    let data = b"bad~~~";
    let mut scanner = Scanner::new(data);

    let result = NumberList::accept(&mut scanner);
    println!("{:?}", result); // Err(UnexpectedToken)

    Ok(())
}

Bytes

Although Elyze is meant to be used with any kind of data slice. You'll probably want to use it with bytes.

There are some builtin components available out of the box to help you to parse strings or bytes data.

Tokens : A collection of well-known patterns, already acceptable and peekable.
Delimited Groups : A delimited groups allow matching a range of bytes between delimiters.

Tokens

There are some patterns that we can recognize on bytes. All common symbols are grouped in a Token enumeration.

#![allow(unused)]
fn main() {
pub enum Token {
    /// The "(" character
    OpenParen,
    /// The `)` character
    CloseParen,
    /// The `,` character
    Comma,
    /// The `;` character
    Semicolon,
    /// The `:` character
    Colon,
    /// The whitespace character
    Whitespace,
    /// The `>` character
    GreaterThan,
    /// The `<` character
    LessThan,
    /// The `!` character
    Exclamation,
    /// The `'` character
    Quote,
    /// The `"` character
    DoubleQuote,
    /// The `=` character
    Equal,
    /// The `+` character
    Plus,
    /// The `-` character
    Dash,
    /// The `/` character
    Slash,
    /// The `*` character
    Star,
    /// The `%` character
    Percent,
    /// The `&` character
    Ampersand,
    /// The `|` character
    Pipe,
    /// The `^` character
    Caret,
    /// The `~` character
    Tilde,
    /// The `.` character
    Dot,
    /// The `?` character
    Question,
    /// The `@` character
    At,
    /// The `#` character
    Hash,
    /// The `$` character
    Dollar,
    /// The `\\` character
    Backslash,
    /// The `_` character
    Underscore,
    /// The `#` character
    Sharp,
    /// The `\n` character
    Ln,
    /// The `\r` character
    Cr,
    /// The `\t` character
    Tab,
    /// The `\r\n` character
    CrLn,
}
}

This one already implements the Match, Recognizable,Visitor and Peekable traits.

extern crate elyze;
use elyze::bytes::token::Token;
use elyze::errors::{ParseError, ParseResult};
use elyze::peek::{peek, Last};
use elyze::recognizer::{recognize, Recognizer};
use elyze::scanner::Scanner;
use elyze::visitor::Visitor;

fn main() -> ParseResult<()> {
    let data = b"+-*";

    // use recognize
    let mut scanner = Scanner::new(data);
    let recognized = recognize(Token::Plus, &mut scanner)?;
    assert_eq!(recognized, Token::Plus);

    // use the recognizer
    let mut scanner = Scanner::new(data);
    let recognized = Recognizer::new(&mut scanner)
        .try_or(Token::Dash)?
        .try_or(Token::Plus)?
        .try_or(Token::Star)?
        .finish()
        .ok_or(ParseError::UnexpectedToken)?;
    assert_eq!(recognized, Token::Plus);

    // use the visitor
    let mut scanner = Scanner::new(data);
    let accepted = Token::accept(&mut scanner)?;
    assert_eq!(accepted, Token::Plus);

    // use peek
    let mut scanner = Scanner::new(data);
    let peeked = peek(Token::Dash, &mut scanner)?;
    if let Some(peeked) = peeked {
        assert_eq!(peeked.peeked_slice(), b"+");
    }

    // last token
    let data = b" 8 + ( 7 * ( 1 + 2 ) )";
    let mut scanner = Scanner::new(data);
    let peeked = peek(Last::new(Token::CloseParen), &mut scanner)?;
    if let Some(peeked) = peeked {
        assert_eq!(peeked.peeked_slice(), b" 8 + ( 7 * ( 1 + 2 ) ");
    }

    Ok(())
}

Separated List

By playing will all these implementations, we can build a separated list of tokens non including the comma token.

#![allow(unused)]
fn main() {
extern crate elyze;

// define a structure to implement Peekable
// using the Visitor pattern excluding the comma token
struct AnyTokenExceptComma;

// Enable the Peekable trait using the Visitor pattern
impl PeekableImplementation for AnyTokenExceptComma {
    type Type = DefaultPeekableImplementation;
}

// Define the PeekSize trait
impl PeekSize<u8> for AnyTokenExceptComma {
    fn peek_size(&self) -> usize {
        // The size is not important can be default to 0
        0
    }
}

// Define the Visitor trait for the AnyTokenExceptComma structure
// excluding the comma token
impl<'a> Visitor<'a, u8> for AnyTokenExceptComma {
    fn accept(scanner: &mut Scanner<'a, u8>) -> ParseResult<Self> {
        let token = Token::accept(scanner)?;
        match token {
            Token::Comma => Err(ParseError::UnexpectedToken),
            _ => Ok(AnyTokenExceptComma),
        }
    }
}

// Define a structure to implement Visitor
#[derive(Debug, PartialEq)]
struct TokenData(Token);

impl<'a> Visitor<'a, u8> for TokenData {
    fn accept(scanner: &mut Scanner<'a, u8>) -> ParseResult<Self> {
        let token = Token::accept(scanner)?;
        match token {
            Token::Comma => Err(ParseError::UnexpectedToken),
            _ => Ok(TokenData(token)),
        }
    }
}

// Define a structure to implement Visitor for the separator
struct SeparatorComma;

impl<'a> Visitor<'a, u8> for SeparatorComma {
    fn accept(scanner: &mut Scanner<'a, u8>) -> ParseResult<Self> {
        recognize(Token::Comma, scanner)?;
        Ok(SeparatorComma)
    }
}
}

Then we can build the separated list

extern crate elyze;
fn main() -> ParseResult<()> {
    let data = b"*,-,+,/,";
    let scanner = Scanner::new(data);
    // clean up the data of its trailing comma
    let mut data_scanner =
        get_scanner_without_trailing_separator(AnyTokenExceptComma, Token::Comma, &scanner)?;
    assert_eq!(data_scanner.data(), b"*,-,+,/"); // data without a trailing comma
    // accept the separated list
    let list = SeparatedList::<u8, TokenData, SeparatorComma>::accept(&mut data_scanner)?;
    assert_eq!(
        list.data,
        vec![
            TokenData(Token::Star),
            TokenData(Token::Dash),
            TokenData(Token::Plus),
            TokenData(Token::Slash),
        ]
    );
    Ok(())
}

Delimited Groups

There is a special case of peeking when you want to get data embedded in a delimited group.

For example, you want to get the contents of a parentheses-delimited expression.

( 1 + 2 )
^
you're here

You want the inner data

 1 + 2

You have to recognize the opening parentheses, then search for the closing parentheses. To do this, the peeking is the best solution.

But you also have to deal with balanced expressions: sometimes you will have nested parentheses-delimited expressions.

( ( 1 + 2 ) + 3 )
^
you're here

If you stop at the first closing parenthesis, you will get ( ( 1 + 2 ). So the inner expression will be:

( 1 + 2 .

That's because your group is unbalanced. There are more opening parentheses than closing parentheses.

To make it work, you've to keep track of the number of opening and closing parentheses.

A number is the perfect solution. Initially at 0, you increase it when you find an opening parenthesis, and decrease it when you find a closing parenthesis.

The algorithm stops when the number is 0. Because we always match the opening parentheses as first bytes. The balancing starts at 1.

( ( 1 + 2 ) + 3 )
^
b: 1

The next opening parentheses increments by 1 the balancing. Because balancing equals 2, the algorithm continues.

( ( 1 + 2 ) + 3 )
  ^
  b: 2

The next recognized element is a closing parentheses. The balancing is decreased by 1. The algorithm continues.

( ( 1 + 2 ) + 3 )
          ^
          b: 1

The next recognized element is also a closing parentheses. The balancing is decreased by 1. The algorithm stops because balancing is now 0.

( ( 1 + 2 ) + 3 )
                ^
                b: 0

The real slice of data is:

 ( 1 + 2 ) + 3

GroupKind

Elyze defines a GroupKind enumeration that implements the Peekable trait.

This one allows peeking into a delimited group.

Parentheses Group

One of the GroupKind is the ParenthesesGroup. Like explained in the introduction.

extern crate elyze;
use elyze::bytes::components::groups::GroupKind;
use elyze::peek::peek;
use elyze::scanner::Scanner;

fn main() -> ParseResult<()> {
    let data = b"( 5 + 3 - ( 10 * 8 ) ) + 54";
    let mut tokenizer = Scanner::new(data);
    let result = peek(GroupKind::Parenthesis, &mut tokenizer)?;

    if let Some(peeked) = result {
        assert_eq!(peeked.peeked_slice(), b" 5 + 3 - ( 10 * 8 ) ");
    }
    Ok(())
}

It supports the character escaping of parentheses.

If your data are like this:

( ( 1 + 2 ) \) + 3 )

The escaped closing parenthesis \) will be ignored. And so the real the data will be correctly parsed. And includes escaped characters.

 ( 1 + 2 ) \) + 3

The escaping is done by the \ character.

extern crate elyze;
use elyze::bytes::components::groups::GroupKind;
use elyze::peek::peek;
use elyze::scanner::Scanner;

fn main() -> ParseResult<()> {
    let data = b"( 5 + 3 - \\( ( 10 * 8 \\)) \\)) + 54";
    let mut tokenizer = Scanner::new(data);
    let result = peek(GroupKind::Parenthesis, &mut tokenizer)?;

    if let Some(peeked) = result {
        assert_eq!(peeked.peeked_slice(), b" 5 + 3 - \\( ( 10 * 8 \\)) \\)");
    }
    Ok(())
}

Quoted Groups

In addition, Elyze also supports quoted groups.

extern crate elyze;
use elyze::bytes::components::groups::GroupKind;
use elyze::peek::peek;
use elyze::scanner::Scanner;

fn main() -> ParseResult<()> {
    let data = b"'hello world' data";
    let mut tokenizer = Scanner::new(data);
    let result = peek(GroupKind::Quotes, &mut tokenizer)?;

    if let Some(peeked) = result {
        assert_eq!(peeked.peeked_slice(), b"hello world");
    }
    Ok(())
}

Because quote groups use the same symbol for opening and closing the group, you can't detect nested groups. But you can escape identifiers if you want.

extern crate elyze;
use elyze::bytes::components::groups::GroupKind;
use elyze::peek::peek;
use elyze::scanner::Scanner;

fn main() -> ParseResult<()> {
    let data = "'I\\'m a quoted data' - 'yes me too'";
    let mut tokenizer = Scanner::new(data.as_bytes());
    let result = peek(GroupKind::Quotes, &mut tokenizer).expect("failed to parse");

    if let Some(peeked) = result {
        assert_eq!(peeked.peeked_slice(), b"I\\'m a quoted data");
    }
    Ok(())
}

The same can be done with double quotes.

extern crate elyze;
use elyze::bytes::components::groups::GroupKind;
use elyze::peek::peek;
use elyze::scanner::Scanner;

fn main() -> ParseResult<()> {
    let data = "\"I'm a quoted data\" - \"yes me too\"";
    let mut tokenizer = Scanner::new(data.as_bytes());
    let result = peek(GroupKind::DoubleQuotes, &mut tokenizer).expect("failed to parse");

    if let Some(peeked) = result {
        assert_eq!(peeked.peeked_slice(), b"I'm a quoted data");
    }
    Ok(())
}

Elyze Book