Introduction
Elyze is a general purpose parser framework written is Rust.
Noa Parser is a collection of building blocks allowing creating various parsers on any kind of data.
This book will explain how to use the parser framework to create your own parsers.
If you want a deep dive into the basics concepts, you can read the concepts chapter.
Basic concepts
This part of the book will explain the basic concepts of the parser framework.
- Scanner : A scanner is a cursor over the data to parse.
- Matching : A matching is a process that recognizes a pattern in the data.
- Error handling : Error handling
- Recognizing : Recognizing data on matching patterns
- Visitor : The visitor pattern and the concept of accepting scanner data
- Peeking : Looking ahead in the data to find a pattern
Scanner
A parser is like an eye that sweeps over the data to find relevant elements.
To represent this eye, Elyze uses a structure called a Scanner.
It's a thin wrapper around the std::io::Cursor. It allows advancing or rewinding the cursor position as the parsing progresses.
#![allow(unused)] fn main() { use std::io::Cursor; /// Wrapper around a `Cursor`. #[derive(Debug, PartialEq)] pub struct Scanner<'a, T> { /// The internal cursor. cursor: Cursor<&'a [T]>, } }
The scanner is generic over the type of the data it scans. This implies that the scanner can be used to scan data of any type. It's intended to be used over bytes. But you can use it over a slice of structure if you want to.
extern crate elyze; // import the scanner use elyze::scanner::Scanner; struct Foo; fn main() { // create a scanner over a slice of arbitrary data let mut scanner = Scanner::new(&[Foo, Foo, Foo]); }
And because a reference over a slice of data, no Clone nor Copy constraint is required.
The scanner is a reference to the data it scans.
Manipulate the cursor
As your parse will progress, you want to make the scanner move too.
Move forward
You can call the bump_by method to move forward the cursor
position in the slice of data embedded in the scanner by the number of elements you have consumed.
impl<'a, T> Scanner<'a, T> {
/// Move the internal cursor forward by `n` positions.
///
/// # Arguments
///
/// * `n` - The number of positions to move the cursor forward.
///
/// # Panics
///
/// Panics if the internal cursor is moved past the end of the data.
pub fn bump_by(&mut self, n: usize);
}
Let's take an example, you have a slice of bytes, and you want to move by 3 bytes the scanner.
let mut scanner = Scanner::new(&[0x01, 0x02, 0x03, 0x04, 0x05]);
scanner.bump_by(3);
The scanner will be at position 3. So it's now pointing to the byte 0x04.
Move backward
The rewind method moves backward the cursor. It's useful when your parser takes
a wrong decision path, and you want to rewind the scanner to the previous state.
/// Move the internal cursor backward by `n` positions.
///
/// # Arguments
///
/// * `n` - The number of positions to move the cursor backward.
///
/// # Panics
///
/// Panics if the internal cursor is moved to a position before the start of the data.
pub fn rewind(&mut self, n: usize);
}
Like in this example, you have a slice of bytes, you first move by 3 bytes the scanner, but it found that was not possible to continue the parsing, but it is maybe possible to parse something else. So you have to go back to the previous state.
let mut scanner = Scanner::new(&[0x01, 0x02, 0x03, 0x04, 0x05]);
scanner.bump_by(3);
// do something afterward but we want to go back to the previous state
scanner.rewind(3);
The scanner will be back at position 0. So it's pointing to the byte 0x01 again.
Get the current position
It can be found that your parsing logic needs to know the current position of the scanner.
To get it, you can call the current_position method.
impl<'a, T> Scanner<'a, T> {
/// Return the current position of the internal cursor.
pub fn current_position(&self) -> usize
}
Example
let mut scanner = Scanner::new(&[0x01, 0x02, 0x03, 0x04, 0x05]);
assert_eq!(scanner.current_position(), 0);
scanner.bump_by(3);
assert_eq!(scanner.current_position(), 3);
Move to a specific position
If you know exactly where you want to go, you can call the jump_to method.
Contrary to the bump_by and rewind methods, the jump_to methods which works on relative positions, the jump_to takes an
absolute position to move to.
impl<'a, T> Scanner<'a, T> {
/// Move the internal cursor to the specified position.
///
/// # Arguments
///
/// * `n` - The position to move the cursor to.
///
/// # Panics
///
/// Panics if the internal cursor is moved past the end of the data.
pub fn jump_to(&mut self, n: usize);
}
Like in this example, you have a slice of bytes, you first move by 3 bytes the scanner, and some treament force you to go to , so we jump to the absolute position 1.
let mut scanner = Scanner::new(&[0x01, 0x02, 0x03, 0x04, 0x05]);
// a previous operation bumped the scanner
scanner.bump_by(1);
// record the initial position
let initial_position = scanner.current_position();
assert_eq!(initial_position, 1);
// do something
scanner.bump_by(3);
scanner.rewind(2);
scanner.bump_by(1);
// rewind the state
scanner.jump_to(initial_position);
assert_eq!(scanner.current_position(), 1);
// the remaining data is back to the initial state
assert_eq!(scanner.remaining(), &[0x02, 0x03, 0x04, 0x05]);
Manipulate the data
The sole cursor is not enough. You also need to be able to access the data within the scanner.
Get the remaining data
That's the most useful method it gets the remaining data of the scanner. Remaining means all data after the current cursor position.
The Scanner exposes the remaining method to do that.
impl<'a, T> Scanner<'a, T> {
/// Return the remaining data of the internal cursor.
pub fn remaining(&self) -> &'a [T]
}
Because of the returning lifetime, the data are ensured to live as long as the scanner slice of data does. The scanner
can be dropped but data from remaining call won't.
fn process(mut scanner: Scanner<'_, u8>) -> &[u8] {
// do something with the data
// then bump the scanner
scanner.bump_by(3);
// return the remaining
scanner.remaining()
}
fn main() {
let data = b"hello world";
let remaining = process(Scanner::new(data));
assert_eq!(remaining, b"lo world");
}
Get all the data
If you want a reference to the whole data, you can call the data method.
it can be useful to get data earlier in the parsing process.
impl<'a, T> Scanner<'a, T> {
/// Return the whole data of the internal cursor.
pub fn data(&self) -> &'a [T]
}
Same as remaining, it's safe to drop the scanner.
fn process(mut scanner: Scanner<'_, u8>) -> &[u8] {
// do something with the data
// then bump the scanner
scanner.bump_by(3);
// return the whole data
scanner.data()
}
fn main() {
let data = b"hello world";
let remaining = process(Scanner::new(data));
assert_eq!(remaining, b"hello world");
}
Match data
Parsing is a process that recognizes a pattern in the data.
If we come back with the eye analogy, when you are sweeping over the characters is the text, you will discover letter by letter, which word you are looking for, and so on.
If you take, for example, the list of characters ['h', 'e', 'l', 'l', 'o', ' ', w', 'o', 'r', 'l', 'd'] and you
want to find the word hello.
You will see successively the letter h, then e, then l, then l, then o.
If all the characters match, you have found the word. Otherwise, it's something else.
The Match trait
To materialize this idea, Elyze defines a Match trait.
#![allow(unused)] fn main() { pub trait Match<T> { /// Returns true if the data matches the pattern. /// /// # Arguments /// data - the data to match /// /// # Returns /// (found, number of matched characters) /// /// (true, index) if the data matches the pattern, /// (false, index) otherwise fn is_matching(&self, data: &[T]) -> (bool, usize); /// Returns the size of the data to match. fn size(&self) -> usize; } }
The matcher method returns a tuple (found, index) where found is a boolean and index is the number of
characters matched.
Like the Scanner trait, the Match trait is generic over the type of the data it matches.
To come back to our example, we can create a struct that implements the Match trait for the substring hello.
extern crate elyze; use elyze::matcher::Match; // define a structure to implement the `Match` trait struct Hello; // implement the `Match` trait impl Match<u8> for Hello { fn is_matching(&self, data: &[u8]) -> (bool, usize) { // define the pattern to match let pattern = b"hello"; // check if the subslice of data matches the pattern (&data[..pattern.len()] == pattern, pattern.len()) } fn size(&self) -> usize { 5 } } fn main() { let hello = Hello; assert_eq!(hello.matcher(b"hello world"), (true, hello.size())); assert_eq!(hello.matcher(b"world is beautiful"), (false, 5)); }
Errors
All parsings can't be perfect. Sometimes, you will find that the data you are parsing is not what you expect.
Elyze provides its internal error type called ParseError this one is built on top of the crate thiserror.
To help readability, Elyze provides a type alias called ParseResult<T> that is an alias for Result<T, ParseError>.
#![allow(unused)] fn main() { /// The result of a parse operation pub type ParseResult<T> = Result<T, ParseError>; #[derive(Debug, thiserror::Error)] pub enum ParseError { /// The parser reached the end of the input #[error("Unexpected end of input")] UnexpectedEndOfInput, #[error("Unexpected token have been encountered")] /// The parser encountered an unexpected token UnexpectedToken, /// Unable to decode a string as UTF-8 #[error("UTF-8 error: {0}")] Utf8Error(#[from] std::str::Utf8Error), /// Unable to parse an integer from a string #[error("ParseIntError: {0}")] ParseIntError(#[from] std::num::ParseIntError), } }
Recognizing data
Matching data is important, but there are a lot of checks to do.
You first need to check if the start of the data matches the pattern.
Then, if it does, you get the subslice of the data that matches the pattern.
And most importantly, you need to bump the scanner to the end of the matched data. If you don't, your parser will never see the next data.
Let's take the previous example, where we want to find the word hello.
extern crate elyze; use elyze::matcher::Match; // define a structure to implement the `Match` trait struct Hello; // implement the `Match` trait impl Match<u8> for Hello { fn is_matching(&self, data: &[u8]) -> (bool, usize) { // define the pattern to match let pattern = b"hello"; // check if the subslice of data matches the pattern (&data[..pattern.len()] == pattern, pattern.len()) } fn size(&self) -> usize { 5 } } fn main() { let mut scanner = Scanner::new(b"hello world"); let (found, size) = Hello.is_matching(scanner.remaining()); if !found { println!("not found"); return; } let data = &scanner.remaining()[..size]; scanner.bump_by(size); println!("found: {:?}", String::from_utf8_lossy(data)); // found: "hello" print!("remaining: {:?}", String::from_utf8_lossy(scanner.remaining())); // remaining: " world" }
Recognizable trait
Because it is a common operation to recognize an object, Elyze provides the Recognizable trait.
Its role is to provide a unified way to recognize objects.
#![allow(unused)] fn main() { pub trait Recognizable<'a, T, V>: Match<T> { /// Try to recognize the object for the given scanner. /// /// # Type Parameters /// V - The type of the object to recognize /// /// # Arguments /// * `scanner` - The scanner to recognize the object for. /// /// # Returns /// * `Ok(Some(V))` if the object was recognized, /// * `Ok(None)` if the object was not recognized, /// * `Err(ParseError)` if an error occurred /// fn recognize(self, scanner: &mut Scanner<'a, T>) -> ParseResult<Option<V>>; /// Try to recognize the object for the given scanner. /// /// # Arguments /// * `scanner` - The scanner to recognize the object for. /// /// # Returns /// * `Ok(Some(&[T]))` if the object was recognized, /// * `Ok(None)` if the object was not recognized, /// * `Err(ParseError)` if an error occurred fn recognize_slice(self, scanner: &mut Scanner<'a, T>) -> ParseResult<Option<&'a [T]>>; } }
It defines to methods:
recognize(self, scanner: &mut Scanner<'a, T>) -> ParseResult<Option<V>>recognize_slice(self, scanner: &mut Scanner<'a, T>) -> ParseResult<Option<&'a [T]>>
The input is both the mutable reference to the scanner, but their return type differs.
The recognize method returns the value of the object that was recognized. Whereas the recognize_slice method returns
a slice of the data that was recognized.
This distinction is done because sometimes, you don't want the structure itself but rather the data that encodes it.
For example, we want to get all bytes until we get the first space character.
extern crate elyze; use elyze::errors::{ParseError, ParseResult}; use elyze::matcher::Match; use elyze::recognizer::Recognizable; use elyze::scanner::Scanner; // define a structure to implement the `Match` trait struct UntilFirstSpace; // implement the `Match` trait impl Match<u8> for UntilFirstSpace { /// Check if the given data matches the pattern. fn is_matching(&self, data: &[u8]) -> (bool, usize) { let mut pos = 0; while pos < data.len() && data[pos] != b' ' { pos += 1; } (pos > 0, pos) } // The size of the object is unknown fn size(&self) -> usize { 0 } } // implement the `Recognizable` trait impl<'a> Recognizable<'a, u8, UntilFirstSpace> for UntilFirstSpace { fn recognize(self, scanner: &mut Scanner<'a, u8>) -> ParseResult<Option<UntilFirstSpace>> { // check if the scanner has enough data if self.size() > scanner.remaining().len() { return Err(ParseError::UnexpectedEndOfInput); } let data = scanner.remaining(); let (result, size) = self.is_matching(data); if !result { return Ok(None); } if !scanner.is_empty() { scanner.bump_by(size); } Ok(Some(self)) } /// Try to recognize the object for the given scanner. /// Return the slice of elements that were recognized. fn recognize_slice(self, scanner: &mut Scanner<'a, u8>) -> ParseResult<Option<&'a [u8]>> { // Check if the scanner is empty if scanner.is_empty() { return Err(ParseError::UnexpectedEndOfInput); } let data = scanner.remaining(); let (result, size) = self.is_matching(data); if !result { return Ok(None); } if !scanner.is_empty() { scanner.bump_by(size); } Ok(Some(&data[..size])) } } fn main() { let mut scanner = Scanner::new(b"hello world"); let result = UntilFirstSpace .recognize_slice(&mut scanner) .expect("failed to parse"); println!("{:?}", result.map(|s| String::from_utf8_lossy(s))); // Some("hello") let mut scanner = Scanner::new(b"loooooooooong string"); let result = UntilFirstSpace .recognize_slice(&mut scanner) .expect("failed to parse"); println!("{:?}", result.map(|s| String::from_utf8_lossy(s))); // Some("loooooooooong") }
But this code won't compile, the rustc will complain that there is a conflict implementation.
error[E0119]: conflicting implementations of trait `Recognizable<'_, u8, UntilFirstSpace>` for type `UntilFirstSpace`
|
25 | impl<'a> Recognizable<'a, u8, UntilFirstSpace> for UntilFirstSpace {
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
= note: conflicting implementation in crate `elyze`:
- impl<'a, T, M> Recognizable<'a, T, M> for M
where M: elyze::matcher::Match<T>;
And effectively, Elyze has already done the work for you using a marvelous feature of the rust language called the blanket implementation.
This one says that all "things" implementing the Match trait also implement the Recognizable trait.
Here is it is the blanket implementation for the Recognizable trait against any M that implements the Match trait:
/// Recognize an object for the given scanner.
/// Return the recognized object.
impl<'a, T, M: Match<T>> Recognizable<'a, T, M> for M {
fn recognize(self, scanner: &mut Scanner<'a, T>) -> ParseResult<Option<M>> {
// check if the scanner has enough data
if self.size() > scanner.remaining().len() {
return Err(ParseError::UnexpectedEndOfInput);
}
let data = scanner.remaining();
// check if the data matches the pattern
let (result, size) = self.is_matching(data);
if !result {
return Ok(None);
}
// bump the scanner if it's not empty
if !scanner.is_empty() {
scanner.bump_by(size);
}
// return the object
Ok(Some(self))
}
/// Try to recognize the object for the given scanner.
/// Return the slice of elements that were recognized.
fn recognize_slice(self, scanner: &mut Scanner<'a, T>) -> ParseResult<Option<&'a [T]>> {
// Check if the scanner is empty
if scanner.is_empty() {
return Err(ParseError::UnexpectedEndOfInput);
}
let data = scanner.remaining();
// check if the data matches the pattern
let (result, size) = self.is_matching(data);
if !result {
return Ok(None);
}
// bump the scanner if it's not empty
if !scanner.is_empty() {
scanner.bump_by(size);
}
// return the slice of data that was recognized
Ok(Some(&data[..size]))
}
}
You must simplify your code into:
extern crate elyze; use elyze::matcher::Match; use elyze::recognizer::Recognizable; use elyze::scanner::Scanner; // define a structure to implement the `Match` trait struct UntilFirstSpace; // implement the `Match` trait impl Match<u8> for UntilFirstSpace { fn is_matching(&self, data: &[u8]) -> (bool, usize) { let mut pos = 0; while pos < data.len() && data[pos] != b' ' { pos += 1; } (pos > 0, pos) } // The size of the object is unknown fn size(&self) -> usize { 0 } } fn main() { let mut scanner = Scanner::new(b"hello world"); let result = UntilFirstSpace .recognize_slice(&mut scanner) .expect("failed to parse"); println!("{:?}", result.map(|s| String::from_utf8_lossy(s))); // Some("hello") let mut scanner = Scanner::new(b"loooooooooong string"); let result = UntilFirstSpace .recognize_slice(&mut scanner) .expect("failed to parse"); println!("{:?}", result.map(|s| String::from_utf8_lossy(s))); // Some("loooooooooong") }
Signature breakdown
Because function signatures are polymorphic, they become a bit more complicated.
impl<'a, T, M: Match<T>> Recognizable<'a, T, M> for M {
pub fn recognize(self, scanner: &mut Scanner<'a, T>) -> ParseResult<Option<M>>
}
'atype parameter is the lifetime of the data to parse.Ttype parameter is the type of the data to parse.Mtype parameter is the type of the object that we want to recognize.
We implement the Recognizable for:
'athe lifetime of the data to parse.Tthe type of the data to parse.Mthe type of the object that we want to recognize.
And the return type is ParseResult<Option<M>>.
The recognition may fail, and even in case of success, the process couldn't be able to recognize the object.
The same goes for the recognize_slice function but differs in the return type which is &'a [T] instead of M.
impl<'a, T, M: Match<T>> Recognizable<'a, T, M> for M {
pub fn recognize_slice(self, scanner: &mut Scanner<'a, T>) -> ParseResult<Option<&'a [T]>>
}
Utility functions
With the actual toolbox, you're already able to write these lines:
fn main() {
let mut scanner = Scanner::new(b"hello world");
let data = Hello.recognize(&mut scanner).expect("failed to parse");
if let Some(hello) = data {
println!("found: {hello:?}"); // found: "Hello"
print!(
"remaining: {:?}",
String::from_utf8_lossy(scanner.remaining())
); // remaining: " world"
} else {
println!("not found");
}
}
That's great, but it's a bit verbose.
To help the readability, Elyze provides a few utility functions.
recognizerecognize_slice
Both are a thin wrapper around the Recognizable trait.
They basically call the recognize or recognize_slice function on the Recognizable trait, and transform the
None variant to an Err(ParseError::UnexpectedToken).
recognize_slice
fn main() -> ParseResult<()> { let mut scanner = Scanner::new(b"hello world"); let hello_string : &[u8] = recognize_slice(Hello, &mut scanner)?; println!("found: {}", String::from_utf8_lossy(hello_string)); // found: "hello" print!( "remaining: {:?}", String::from_utf8_lossy(scanner.remaining()) ); // remaining: " world" Ok(()) }
The main benefit is the ability to use the ? operator to handle errors. So you can chain recognize functions.
Example, recognize 3 successive "hello"s.
extern crate elyze; fn main() -> ParseResult<()> { let mut scanner = Scanner::new(b"hellohellohello world"); // recognize the first "hello" recognize_slice(Hello, &mut scanner)?; // recognize the second "hello" recognize_slice(Hello, &mut scanner)?; // recognize the third "hello" recognize_slice(Hello, &mut scanner)?; Ok(()) }
Because recognize_slice is a thin wrapper around the Recognizable trait, it has the same type parameters as the
Recognizable::recognize_slice method's trait.
// the signature of the `recognize_slice` function
pub fn recognize_slice<'a, T, V, R>(
recognizable: R,
scanner: &mut Scanner<'a, T>,
) -> ParseResult<&'a [T]>
where
R: Recognizable<'a, T, V>,
'atype parameter is the lifetime of the data to parse.Ttype parameter is the type of the data to parse.Vtype parameter is the type of the object that we want to recognize.Rtype parameter is the type of the object that we want to recognize.
recognize
Same as recognize_slice, the recognize function returns the object that was recognized.
extern crate elyze; fn main() -> ParseResult<()> { let mut scanner = Scanner::new(b"hello world"); let hello : Hello = recognize(Hello, &mut scanner)?; println!("found: {hello:?}"); // found: "hello" print!( "remaining: {:?}", String::from_utf8_lossy(scanner.remaining()) ); // remaining: " world" Ok(()) }
This time that's the Hello structure which is returned and not the &[u8] slice.
Its signature is quite the same as the recognize_slice function, but differs in the return type.
```rust,ignore
// the signature of the `recognize` function
pub fn recognize<'a, T, V, R>(
recognizable: R,
scanner: &mut Scanner<'a, T>,
) -> ParseResult<V>
where
R: Recognizable<'a, T, V>,
V is the type of the object that was recognized.
Visitor
The Elyze's keystone is the visitor pattern.
It is materialized by a Visitor trait.
#![allow(unused)] fn main() { extern crate elyze; use elyze::errors::ParseResult; use elyze::scanner::Scanner; /// A visitor pattern. /// /// # Type parameters /// /// * `'a` - The lifetime of the data to visit. /// * `T` - The type of the data to visit. pub trait Visitor<'a, T>: Sized { /// Try to accept the `Scanner` and return the result of the visit. /// /// # Arguments /// /// * `scanner` - The scanner to accept. /// /// # Returns /// /// The result of the visit. fn accept(scanner: &mut Scanner<'a, T>) -> ParseResult<Self>; } }
This one defines a unique method accept that takes a Scanner as argument and returns a ParseResult of the visitor.
The visitor returns itself when it is accepted by scanner data.
The Visitor is generic over the type of the data to visit, and its lifetime corresponds
to the lifetime of the data visited.
Accepting data
To accept scanner data, the object must implement the Visitor trait.
Let's say that you have three Recognizable objects:
Hello: recognize the word "hello"Space: recognize the space characterWorld: recognize the word "world"
You want that your HelloWorld structure recognize the sentence "hello world" and return a HelloWorld object.
You have to implement the Visitor trait for the HelloWorld structure.
extern crate elyze; use elyze::errors::ParseResult; use elyze::matcher::Match; use elyze::recognizer::Recognizable; use elyze::scanner::Scanner; use elyze::visitor::Visitor; // define a structure to implement the `Visitor` trait #[derive(Debug)] struct HelloWorld; // implement the `Visitor` trait impl<'a> Visitor<'a, u8> for HelloWorld { fn accept(scanner: &mut Scanner<'a, u8>) -> ParseResult<Self> { recognize(Hello, scanner)?; // recognize the word "hello" recognize(Space, scanner)?; // recognize the space character recognize(World, scanner)?; // recognize the word "world" // return the `HelloWorld` object Ok(HelloWorld) } } fn main() { let data = b"hello world"; let mut scanner = Scanner::new(data); // Use the accept method on HelloWorld let result = HelloWorld::accept(&mut scanner); println!("{:?}", result); // Ok(HelloWorld) }
Accept from Recognizable
If your Recognizable data implements the Default trait, you automatically get a Visitor implementation.
This is done by another blanket implementation.
#![allow(unused)] fn main() { extern crate elyze; /// Allow a `Recognizable` to be used as a `Visitor`. /// /// # Type Parameters /// /// * `T` - The type of the data to scan. /// * `'a` - The lifetime of the data to scan. /// * `'b` - The lifetime of the `Scanner`. /// impl<'a, T, R: Recognizable<'a, T, R> + Default> Visitor<'a, T> for R { fn accept(scanner: &mut Scanner<'a, T>) -> ParseResult<Self> { recognize(R::default(), scanner)?; Ok(R::default()) } } }
This makes the conversion from Recognizable and Visitor world trivial.
extern crate elyze; #[derive(Default)] struct Hello; // implement `Match` and `Default` #[derive(Default)] struct Space; // implement `Match` and `Default` #[derive(Default)] struct World; // implement `Match` and `Default` impl<'a> Visitor<'a, u8> for HelloWorld { fn accept(scanner: &mut Scanner<'a, u8>) -> ParseResult<Self> { Hello::accept(scanner)?; // accept the word "hello" Space::accept(scanner)?; // accept the space character?; // recognize the space character World::accept(scanner)?; // accept the word "world"?; // recognize the word "world" // return the `HelloWorld` object Ok(HelloWorld) } } fn main() { let data = b"hello world"; let mut scanner = elyze::scanner::Scanner::new(data); let result = HelloWorld::accept(&mut scanner); println!("{:?}", result); // Ok(HelloWorld) }
One of the major differences between the Visitor and the Recognizable is the fact that the Visitor can call other
Visitor objects.
This allows creating more complex parsing trees and thus more complex parsers.
Peeking data
Sometimes you want to look at the next data without consuming it.
Example, you have to match the starting of a parenthesis-delimited expression, and you want to check if one of the next
characters is a ).
If so, you want the contents of the parenthesis to be consumed.
5 * ( 1 + 2 )
^
you're here
To do that, you have to advance the scanner and check for each step if the scanner matches the close parenthesis.
Then you get the slice of the data between the open parenthesis and the close parenthesis.
1 + 2
Peekable trait
Elyze uses the Peekable trait to define peekable data. Peekable data stands for data that you can look at without
consuming it.
#![allow(unused)] fn main() { extern crate elyze; pub trait Peekable<'a, T> { /// Attempt to match the `Peekable` against the current position of the /// `Scanner`. /// /// This method will temporarily advance the position of the `Scanner` to /// find a match. If a match is found, the `Scanner` is rewound to the /// original position and a `PeekResult` is returned. If no match is found, /// the `Scanner` is rewound to the original position and an `Err` is /// returned. /// /// # Arguments /// /// * `data` - The `Scanner` to use when matching. /// /// # Returns /// /// A `PeekResult` if the `Peekable` matches the current position of the /// `Scanner`, or an `Err` otherwise. fn peek(&self, data: &Scanner<'a, T>) -> ParseResult<PeekResult>; } }
This trait defines an unique peek method.
This one remains the scanner unchanged and returns a PeekResult.
The PeekResult is shield by a ParseResult because peeking can fail either by recognizing or by accepting the data.
The error is propagated and left for the caller to handle it.
PeekResult
The PeekResult itself is an enumeration.
#![allow(unused)] fn main() { extern crate elyze; pub enum PeekResult { /// The match was successful. Found { // The last index of the end slice end_slice: usize, // The size of the start element start_element_size: usize, // The size of the end element end_element_size: usize, }, /// The match was unsuccessful. NotFound, } }
In its Found variant it embeds the last index of the end slice, the size of the start element and the size of the end element.
Example
Let's implement a Match for the closing parenthesis.
#![allow(unused)] fn main() { extern crate elyze; struct CloseParentheses; impl Match<u8> for CloseParentheses { fn is_matching(&self, data: &[u8]) -> (bool, usize) { if data[0] == b')' { (true, 1) } else { (false, 0) } } fn size(&self) -> usize { 1 } } }
Then define something that will bear the Peekable trait.
#![allow(unused)] fn main() { struct ParenthesesGroup; }
Then implement the Peekable
#![allow(unused)] fn main() { impl<'a> Peekable<'a, u8> for ParenthesesGroup { fn peek(&self, scanner: &Scanner<'a, u8>) -> ParseResult<PeekResult> { // create an internal scanner allowing to peek data without alterating the original scanner let mut inner_scanner = Scanner::new(&scanner.remaining()); // loop on each byte until we find a close parenthesis loop { if inner_scanner.is_empty() { // we have reached the end without finding a close parenthesis break; } if CloseParentheses.recognize(&mut inner_scanner)?.is_some() { // we have found a close parenthesis return Ok(PeekResult::Found { // we return the position of the close parenthesis end_slice: inner_scanner.current_position(), // our peeking doesn't include a start element start_element_size: 0, // the size of the end element is a close parenthesis of 1 byte end_element_size: 1, }); } // consume the current byte inner_scanner.bump_by(1); } // At this point, we have reached the end of available data without finding a close parenthesis Ok(PeekResult::NotFound) } } }
Its implementation is not perfect, it takes the first close parenthesis and doesn't take into account the case where there are multiple close parentheses in the case of nested parentheses, for example.
But enough to demonstrate the concept.
extern crate elyze; fn main() -> ParseResult<()> { let data = b"7 * ( 1 + 2 )"; let mut scanner = Scanner::new(data); scanner.bump_by(5); // consumes : 7 * ( let result = ParenthesesGroup.peek(&scanner)?; if let PeekResult::Found { end_slice, end_element_size, .. } = result { println!( "{:?}", // to found the real size of enclosed data, we need to subtract the size of the end element String::from_utf8_lossy(&scanner.remaining()[..end_slice - end_element_size]) // 1 + 2 ); } else { println!("not found"); } println!( "scanner: {:?}", // the scanner itself remains unchanged String::from_utf8_lossy(scanner.remaining()) // scanner: " 1 + 2 )" ); Ok(()) }
Peeking
To stroll a successful peek, Elyze defines a structure called Peeking
#![allow(unused)] fn main() { pub struct Peeking<'a, T> { /// The start of the match. pub start_element_size: usize, /// The end of the match. pub end_element_size: usize, /// The length of peeked slice. pub end_slice: usize, /// The data that was peeked. pub data: &'a [T], } }
Like you can see, the Peeking struct embeds PeekResult::Found and the data slice.
peek method
This Peeking is used by the peek method.
#![allow(unused)] fn main() { extern crate elyze; /// Attempt to match a `Peekable` against the current position of a `Scanner`. /// /// This function will temporarily advance the position of the `Scanner` to find /// a match. If a match is found, the `Scanner` is rewound to the original /// position and a `Peeking` is returned. If no match is found, the `Scanner` is /// rewound to the original position and an `Err` is returned. /// /// # Arguments /// /// * `peekable` - The `Peekable` to attempt to match. /// * `scanner` - The `Scanner` to use when matching. /// /// # Returns /// /// A `Peeking` if the `Peekable` matches the current position of the `Scanner`, /// or an `Err` otherwise. pub fn peek<'a, T, P: Peekable<'a, T>>( peekable: P, scanner: &Scanner<'a, T>, ) -> ParseResult<Option<Peeking<'a, T>>>; }
This one is a short syntax of using directly the Peekable::peek method.
It takes care of the arithmetic data slice for you.
extern crate elyze; fn main() -> ParseResult<()> { let data = b"7 * ( 1 + 2 )"; let mut scanner = Scanner::new(data); scanner.bump_by(5); // consumes : 7 * ( // use peek method instead of ParenthesesGroup.peek let result = peek(ParenthesesGroup, &scanner)?; if let Some(peeking) = result { println!( "{:?}", // the peek_slice method returns the slice of recognized without the end element String::from_utf8_lossy(peeking.peeked_slice()) // 1 + 2 ); } else { println!("not found"); } println!( "scanner: {:?}", // the scanner itself remains unchanged String::from_utf8_lossy(scanner.remaining()) // scanner: " 1 + 2 )" ); Ok(()) }
Components
Components are built on top of basic concepts, this allows more complex parsing behaviors.
- Tokens : Basic element to recognize in your parse.
- Recognizer : Allows recognizing an alternative of Recognizable.
- Acceptor : Allows accepting an alternative of Visitor.
- Peeker : Allows peeking an alternative of Peekable.
- Peek from Visitor : Explain how to transform a Visitor into a Peekable.
- Last : Get the last occurence of a Peekable in the data.
- Separated List : A separated list is a process that accepts a list of elements separated by a separator.
Tokens
Based on the Recognizable trait, a token is the atomic element you want to recognize in your parse.
The idea behind the token recognition is to create a union type that will allow you to recognize any token in the same way.
In Rust this union is materialized with an enumeration.
All you need to do is to implement the Match trait on your enumeration. And then you can use the
recognize function to recognize the token variant.
extern crate elyze; use elyze::errors::ParseResult; use elyze::matcher::Match; use elyze::recognizer::recognize; // define a matching function fn match_char(c: char, data: &[u8]) -> (bool, usize) { match data.get(0) { Some(&d) => (d == c as u8, 1), None => (false, 0), } } // create an enumeration of tokens to recognize enum Token { Plus, Minus, Star, Slash, LParen, RParen, } // implement the `Match` trait for the `Token` enum impl Match<u8> for Token { fn is_matching(&self, data: &[u8]) -> (bool, usize) { match self { Token::Plus => match_char('+', data), Token::Minus => match_char('-', data), Token::Star => match_char('*', data), Token::Slash => match_char('/', data), Token::LParen => match_char('(', data), Token::RParen => match_char(')', data), } } fn size(&self) -> usize { match self { Token::Plus => 1, Token::Minus => 1, Token::Star => 1, Token::Slash => 1, Token::LParen => 1, Token::RParen => 1, } } } // Profit ! fn main() -> ParseResult<()> { let data = b"((+-)*/)end"; let mut scanner = elyze::scanner::Scanner::new(data); recognize(Token::LParen, &mut scanner)?; recognize(Token::LParen, &mut scanner)?; recognize(Token::Plus, &mut scanner)?; recognize(Token::Minus, &mut scanner)?; recognize(Token::RParen, &mut scanner)?; recognize(Token::Star, &mut scanner)?; recognize(Token::Slash, &mut scanner)?; recognize(Token::RParen, &mut scanner)?; print!("{:?}", String::from_utf8_lossy(scanner.remaining())); // "end" Ok(()) }
Recognizer
The Recognizer is meant to fix the problem when you are matching a pattern which can vary.
That's the case with numeric operators like + or - for example.
The Recognizer works by matching Recognizable one by one until a match is found. Il no pattern matches, it returns
a None value.
The Recognizer takes a mutable to the Scanner as a parameter.
Then a chain of try_or methods is used to add Recognizable objects to the Recognizer.
extern crate elyze; use elyze::recognizer::Recognizer; fn main() -> ParseResult<()> { Recognizer::new(scanner) .try_or(Operator::Add)? .try_or(Operator::Sub)? .finish(); }
Step by step
The Recognizer works in 3 steps:
- Initializing with a scanner
- Add a
Recognizableto theRecognizer - Call the
finishmethod to get the result
Step 1: Initializing with a scanner
extern crate elyze; use elyze::recognizer::Recognizer; use elyze::scanner::Scanner; fn main() { let data = b"+"; let mut scanner = Scanner::new(data); let result = Recognizer::<u8, Operator>::new(&mut scanner); }
The Recognizer is defined by
#![allow(unused)] fn main() { extern crate elyze; /// A `Recognizer` is a type that wraps a `Scanner` and holds a successfully /// recognized value. /// /// When a value is successfully recognized, the `Recognizer` stores the value in /// its `data` field and returns itself. If a value is not recognized, the /// `Recognizer` rewinds the scanner to the previous position and returns itself. /// /// # Type Parameters /// /// * `T` - The type of the data to scan. /// * `U` - The type of the value to recognize. /// * `'a` - The lifetime of the data to scan. /// * `'b` - The lifetime of the `Scanner`. pub struct Recognizer<'a, 'b, T, R> { data: Option<R>, scanner: &'b mut Scanner<'a, T>, } }
That's why you need to specify the type of T and R in its new call.
Step 2: Add a Recognizable to the Recognizer
Once to the Recognizer is initialized, you can add one or more Recognizable to it.
To do so, the Recognizer provides the try_or method.
#![allow(unused)] fn main() { extern crate elyze; impl<'a, 'b, T, R: Recognizable<'a, T, R>> Recognizer<'a, 'b, T, R> { pub fn try_or(mut self, element: R) -> ParseResult<Self>; } }
This one takes a R object where R implements the Recognizable trait. And returns a ParseResult containing the
Recognizer object.
You can add as many Recognizable as you want. But by rust limitations, th R must be the same at each call. That's why
using an enumeration of tokens is a good idea.
Here are my tokens:
#![allow(unused)] fn main() { #[derive(Debug)] enum Operator { Add, Sub, } }
It implements Match<u8>
extern crate elyze; fn main() -> ParseResult<()> { let data = b"+"; let mut scanner = Scanner::new(data); // Initialize the recognizer (type can be inferred) let recognizer = Recognizer::new(&mut scanner); // Try to apply the recognizer on the operator add, if it fails, return an error let recognizer_add = recognizer.try_or(Operator::Add)?; // Try to apply the recognizer on the operator sub, if it fails, return an error let recognizer_add_and_sub = recognizer_add.try_or(Operator::Sub)?; Ok(()) }
The try_or method works as follows:
- Checks the internal state, if internal state is
None:- It calls the
recognizemethod of theRecognizableobject. - If the
recognizemethod returnsOk(Some(value)), it sets the internal state toSome(value). - If the
recognizemethod returnsOk(None) - If an error occurs, it returns the error.
- It calls the
- If the internal state is
Some, it returns theRecognizeras is.
The execution is immediate not lazy, you are not registering an operation but applying it.
Step 3: Call the finish method to get the result
Once all the Recognizable have been added to the Recognizer, you can call the finish method to get the result.
#![allow(unused)] fn main() { extern crate elyze; impl<'a, 'b, T, R: Recognizable<'a, T, R>> Recognizer<'a, 'b, T, R> { pub fn finish(self) -> Option<R>; } }
This step can't fail. It returns the internal state of the Recognizer.
extern crate elyze; fn main() -> ParseResult<()> { let data = b"+"; let mut scanner = Scanner::new(data); // Initialize the recognizer let recognizer = Recognizer::new(&mut scanner); // Try to apply the recognizer on the operator add, if it fails, return an error let recognizer_add = recognizer.try_or(Operator::Add)?; // Try to apply the recognizer on the operator sub, if it fails, return an error let recognizer_add_and_sub = recognizer_add.try_or(Operator::Sub)?; // Finish the recognizer let result = recognizer_add_and_sub.finish(); dbg!(result); // Some(Operator::Add) }
Recognizer to Visitor
If you want to reuse the Recognizer object, you can convert it to a Visitor object.
#![allow(unused)] fn main() { extern crate elyze; #[derive(Debug)] // Define a structure to implement the `Visitor` trait struct OperatorData(Operator); impl<'a> Visitor<'a, u8> for OperatorData { fn accept(scanner: &mut Scanner<'a, u8>) -> ParseResult<Self> { // Build and apply the recognizer let operator = Recognizer::new(scanner) .try_or(Operator::Add)? .try_or(Operator::Sub)? .finish() // If the recognizer fails, return an error .ok_or(ParseError::UnexpectedToken)?; Ok(OperatorData(operator)) } } }
The behavior is now reusable.
extern crate elyze; fn main() -> ParseResult<()> { let data = b"+"; let mut scanner = Scanner::new(data); // Initialize the recognizer let result = OperatorData::accept(&mut scanner)?.0; dbg!(result); // Operator::Add let data = b"-"; let mut scanner = Scanner::new(data); // Initialize the recognizer let result = OperatorData::accept(&mut scanner)?.0; dbg!(result); // Operator::Sub let data = b"x"; let mut scanner = Scanner::new(data); // Initialize the recognizer let result = OperatorData::accept(&mut scanner); dbg!(result); // Err(UnexpectedToken) Ok(()) }
Recommendations
Because that the first matching is returned, you should add the most specific patterns first.
Example:
If you match "hell" and "hello". You should add "hello" first. Otherwise, the recognizer will always return "hell".
Acceptor
The Accecptor is quite the same thing as the Recognizer but instead of taking Recognizable objects,
it takes Visitor objects.
Let's define two visitors.
#![allow(unused)] fn main() { extern crate elyze; #[derive(Default, Debug)] struct OperatorPlus; #[derive(Default, Debug)] struct OperatorMinus; impl Match<u8> for OperatorPlus { fn is_matching(&self, data: &[u8]) -> (bool, usize) { match_pattern(b"+", data) } fn size(&self) -> usize { 1 } } impl Match<u8> for OperatorMinus { fn is_matching(&self, data: &[u8]) -> (bool, usize) { match_pattern(b"-", data) } fn size(&self) -> usize { 1 } } }
And another more complex.
#![allow(unused)] fn main() { #[derive(Default)] struct Hello; #[derive(Default)] struct Space; #[derive(Default)] struct World; impl Match<u8> for Hello { fn is_matching(&self, data: &[u8]) -> (bool, usize) { (&data[..5] == b"hello", 5) } fn size(&self) -> usize { 5 } } impl Match<u8> for Space { fn is_matching(&self, data: &[u8]) -> (bool, usize) { (data[0] as char == ' ', 1) } fn size(&self) -> usize { 1 } } impl Match<u8> for World { fn is_matching(&self, data: &[u8]) -> (bool, usize) { (&data[..5] == b"world", 5) } fn size(&self) -> usize { 5 } } // define a structure to implement the `Visitor` trait #[derive(Debug)] struct HelloWorld; impl<'a> Visitor<'a, u8> for HelloWorld { fn accept(scanner: &mut Scanner<'a, u8>) -> ParseResult<Self> { Hello::accept(scanner)?; // accept the word "hello" Space::accept(scanner)?; // accept the space character?; // recognize the space character World::accept(scanner)?; // accept the word "world"?; // recognize the word "world" // return the `HelloWorld` object Ok(HelloWorld) } } }
We have now, 3 visitors : OperatorPlus, OperatorMinus and HelloWorld.
Because all Acceptor result must be homogenous, we use an enumeration.
#![allow(unused)] fn main() { extern crate elyze; #[derive(Debug)] enum Operator { Plus(OperatorPlus), Minus(OperatorMinus), HelloWorld(HelloWorld), } }
Then we can use it.
extern crate elyze; use elyze::acceptor::Acceptor; fn main() -> ParseResult<()> { let data = b"+ 2"; let mut scanner = Scanner::new(data); let accepted = Acceptor::new(&mut scanner) .try_or(Operator::Plus)? .try_or(Operator::HelloWorld)? .try_or(Operator::Minus)? .finish() .ok_or(ParseError::UnexpectedToken)?; println!("{:?}", accepted); // + let data = b"- 2"; let mut scanner = Scanner::new(data); let accepted = Acceptor::new(&mut scanner) .try_or(Operator::Plus)? .try_or(Operator::HelloWorld)? .try_or(Operator::Minus)? .finish() .ok_or(ParseError::UnexpectedToken)?; println!("{:?}", accepted); // - let data = b"hello world 2"; let mut scanner = Scanner::new(data); let accepted = Acceptor::new(&mut scanner) .try_or(Operator::Plus)? .try_or(Operator::HelloWorld)? .try_or(Operator::Minus)? .finish() .ok_or(ParseError::UnexpectedToken)?; println!("{:?}", accepted); // HelloWorld Ok(()) }
Reusable Acceptor
Likewise the Recognizer, an Acceptor can be embedded in a Visitor.
#![allow(unused)] fn main() { extern crate elyze; impl<'a> Visitor<'a, u8> for Operator { fn accept(scanner: &mut Scanner<'a, u8>) -> ParseResult<Self> { Acceptor::new(scanner) .try_or(Operator::Plus)? .try_or(Operator::HelloWorld)? .try_or(Operator::Minus)? .finish() .ok_or(ParseError::UnexpectedToken) } } }
Which simplifies the code.
extern crate elyze; fn main() -> ParseResult<()> { let data = b"+ 2"; let mut scanner = Scanner::new(data); let accepted = Operator::accept(&mut scanner)?; println!("{:?}", accepted); // + let data = b"- 2"; let mut scanner = Scanner::new(data); let accepted = Operator::accept(&mut scanner)?; println!("{:?}", accepted); // - let data = b"hello world 2"; let mut scanner = Scanner::new(data); let accepted = Operator::accept(&mut scanner)?; println!("{:?}", accepted); // HelloWorld Ok(()) }
Peeker
The Peeker is like the Acceptor but it doesn't consume the data.
The other difference is that the Peeker returns a Peeking so the constraint to have
a homogeneous data type isn't required.
This implies that any kind of Visitor can be used as Peekable.
Peeking a variation of elements is more complex than accepting it. You want the shortest possible data slice.
7 * ( 1 + 2 )
If you are peeking "*" or "+".
Following the order of registration, the peek will be either:
*->+:7+->*:7 * ( 1
We want that the peeking always returns the shortest possible data slice so 7 independently of the order of registration.
Inverting the order of registration may fail the peeking with this slice:
( 1 + 2 ) * 7
To allow this behavior, we need to register all Peekables, then execute the peeking for each Peekable. If the peeked length is shorter than the internal state, this state is replaced. Otherwise, the internal state is left unchanged.
Example
Let's take this selection of tokens:
#![allow(unused)] fn main() { enum OperatorTokens { Plus, Times, } }
OperatorTokens implements Visitior, so it's the case of OperatorTokens::Plus and OperatorTokens::Times.
You can then use the Until with it to create a Peekable variant.
extern crate elyze; fn main() -> ParseResult<()> { let data = b"7 * ( 1 + 2 )"; let scanner = Scanner::new(data); // create a peeker with a scanner let slice = Peeker::new(&scanner) // register the peekable until `OperatorTokens::Plus` .add_peekable(Until::new(OperatorTokens::Times)) // register the peekable until `OperatorTokens::Plus` .add_peekable(Until::new(OperatorTokens::Plus)) // peek the scanner for the first `OperatorTokens` .peek()?; // if found if let Some(slice) = slice { // the slice is the shortest possible println!("{:?}", String::from_utf8_lossy(slice.peeked_slice())); // "7 " } Ok(()) }
If we want something more reusable, we can create a struct that implements Peekable and put it inside the Peeker.
#![allow(unused)] fn main() { extern crate elyze; // define a struct that implements `Peekable` struct FirstOperator; // implement `Peekable` for `FirstOperator` impl<'a> Peekable<'a, u8> for FirstOperator { fn peek(&self, scanner: &Scanner<'a, u8>) -> ParseResult<PeekResult> { Peeker::new(scanner) .add_peekable(Until::new(OperatorTokens::Plus)) .add_peekable(Until::new(OperatorTokens::Times)) .peek() // convert the `Peeking` into a `PeekResult` .map(Into::into) } } }
Which simplifies the code.
extern crate elyze; fn main() -> ParseResult<()> { let data = b"7 * ( 1 + 2 )"; let scanner = Scanner::new(data); let slice = peek(FirstOperator, &scanner)?; if let Some(slice) = slice { println!("{:?}", String::from_utf8_lossy(slice.peeked_slice())); // "7 " } let data = b"7 * ( 1 + 2 )"; let scanner = Scanner::new(data); let slice = peek(FirstOperator, &scanner)?; if let Some(slice) = slice { println!("{:?}", String::from_utf8_lossy(slice.peeked_slice())); // "7 " } let data = b"1 + 2 * 7"; let scanner = Scanner::new(data); let slice = peek(FirstOperator, &scanner)?; if let Some(slice) = slice { println!("{:?}", String::from_utf8_lossy(slice.peeked_slice())); // "1 " } Ok(()) }
Peek from Visitor
It may be interesting to be able to peek from a Visitor.
You can implement by yourself the Peekable trait, or you can let Elyze do it for you.
We just need to say to Elyze that we want it to implement the Peekable trait using the Visitor pattern.
#![allow(unused)] fn main() { extern crate elyze; use elyze::peek::{DefaultPeekableImplementation, PeekableImplementation}; impl PeekableImplementation for CloseParentheses { type Type = DefaultPeekableImplementation; } }
This will automatically enable the Peekable trait for CloseParentheses.
extern crate elyze; use elyze::errors::ParseResult; use elyze::matcher::Match; use elyze::peek::{peek, Until}; use elyze::scanner::Scanner; // The Default makes the `CloseParentheses` structure // implements the `Visitor` trait #[derive(Default)] struct CloseParentheses; // implementing the `Match` trait needed by Until impl Match<u8> for CloseParentheses { fn is_matching(&self, data: &[u8]) -> (bool, usize) { if data[0] == b')' { (true, 1) } else { (false, 0) } } fn size(&self) -> usize { 1 } } /// Active the Default implementation of Peekable for CloseParentheses impl PeekableImplementation for CloseParentheses { type Type = DefaultPeekableImplementation; } fn main() -> ParseResult<()> { let data = b"( 7 * ( 1 + 2 ) )"; let mut scanner = Scanner::new(data); scanner.bump_by(7); // consumes : ( 7 * ( // peek the first ")" let result = peek(CloseParentheses, &scanner)?; if let Some(peeking) = result { println!( "{:?}", // the peek_slice method returns the slice of recognized data without the end element String::from_utf8_lossy(peeking.peeked_slice()) // 1 + 2 ); } else { println!("not found"); } println!( "scanner: {:?}", // the scanner itself remains unchanged String::from_utf8_lossy(scanner.remaining()) // scanner: " 1 + 2 ) )" ); Ok(()) }
The result of the peeking is the slice until the first element peeked.
Last
If you want the last element peekable in the data, you can use the Last modifier.
The Last modifier takes a Peekable as argument. So it may be a Visitor.
The code remains the same. But the behavior is different.
Instead of returning at the first peeked element, it will advance an internal scanner and re-apply the peek operation until reaching the end of the data.
The last element peeked is returned.
extern crate elyze; #[derive(Default)] // Enable the Visitor implementation for CloseParentheses struct CloseParentheses; /// Enable the PeekSize and Recognizable implementation for CloseParentheses impl Match<u8> for CloseParentheses { fn is_matching(&self, data: &[u8]) -> (bool, usize) { if data[0] == b')' { (true, 1) } else { (false, 0) } } fn size(&self) -> usize { 1 } } /// Active the Default implementation of Peekable for CloseParentheses impl PeekableImplementation for CloseParentheses { type Type = DefaultPeekableImplementation; } fn main() -> ParseResult<()> { let data = b"8 / ( 7 * ( 1 + 2 ) )"; let mut scanner = Scanner::new(data); // consumes : "8 / ( " to reach the start of the enclosed data scanner.bump_by(b"8 / (".len()); // because CloseParentheses implements the Peekable trait, we can peek it with the modifier Last let result = peek(Last::new(CloseParentheses), &scanner)?; if let Some(peeking) = result { println!( "{:?}", // the peek_slice method returns the all enclosed data not the first occurrence of ")" -> "7 * ( 1 + 2 " String::from_utf8_lossy(peeking.peeked_slice()) // 7 * ( 1 + 2 ) ); } Ok(()) }
Separated list
The SeparatedList component is a list of objects separated by a separator.
It is built using to Visitor. One is the element of the list to accept, and the other is the separator.
The SeparatedList will try to accept the element if it's a success push it into the list. Then, try to accept the separator.
And repeat the process until a failure.
If one of the Visitor (element or separator) fails, it will return the list of elements accepted.
extern crate elyze; // Define how to match a number struct TokenNumber; impl Match<u8> for TokenNumber { fn is_matching(&self, data: &[u8]) -> (bool, usize) { let mut pos = 0; while pos < data.len() && data[pos].is_ascii_digit() { pos += 1; } (pos > 0, pos) } fn size(&self) -> usize { 0 } } #[derive(Debug)] struct NumberData(usize); impl<'a> Visitor<'a, u8> for NumberData { fn accept(scanner: &mut Scanner<u8>) -> ParseResult<Self> { let slice = recognize_slice(TokenNumber, scanner)?; let number = std::str::from_utf8(slice)?.parse::<usize>()?; Ok(NumberData(number)) } } // Define how to match a separator #[derive(Debug)] struct Separator; impl<'a> Visitor<'a, u8> for Separator { fn accept(scanner: &mut Scanner<u8>) -> ParseResult<Self> { recognize(Token::Tilde, scanner)?; recognize(Token::Tilde, scanner)?; recognize(Token::Tilde, scanner)?; Ok(Separator) } } // Apply to a separated list fn main() -> ParseResult<()> { let data = b"1~~~2~~~3~~~4"; let mut scanner = Scanner::new(data); // define the separated list using types parameters NumberData and Separator let result = SeparatedList::<_, NumberData, Separator>::accept(&mut scanner)? // the inner list can be extracted with the `data` attribute .data; println!("{:?}", result); // Ok([NumberData(1), NumberData(2), NumberData(3), NumberData(4)]) Ok(()) }
Trailing separator
Caution: The separate list expects an element after successfully accepting the separator.
1~~~2~~~3~~~4~~~
Will make fail the parse.
Data must be cleanup before using it.
1~~~2~~~3~~~4
To avoid this, Elize exposes a function called get_scanner_without_trailing_separator.
#![allow(unused)] fn main() { extern crate elyze; /// Return a scanner without the trailing separator. /// /// # Arguments /// /// * `element` - The peekable element. /// * `separator` - The peekable separator. /// * `scanner` - The scanner. /// /// # Returns /// /// A `ParseResult` containing a `Scanner` without the trailing separator. pub fn get_scanner_without_trailing_separator<'a, T, P1, P2>( element: P1, separator: P2, scanner: &Scanner<'a, T>, ) -> ParseResult<Scanner<'a, T>> where P1: Peekable<'a, T> + PeekableImplementation<Type = DefaultPeekableImplementation>, P2: Peekable<'a, T> + PeekableImplementation<Type = DefaultPeekableImplementation>; }
It takes two Peekable as arguments. The first is the element and the second is the separator. And a reference to a Scanner as third argument.
This function returns a Scanner truncated to not include the last separator.
Example
We can create a visitor to use it.
#![allow(unused)] fn main() { extern crate elyze; use elyze::separated_list::get_scanner_without_trailing_separator; #[derive(Debug)] struct NumberList { data: Vec<usize>, } impl<'a> Visitor<'a, u8> for NumberList { fn accept(scanner: &mut Scanner<'a, u8>) -> ParseResult<Self> { // get the scanner without the trailing separator let mut data_scanner = get_scanner_without_trailing_separator(TokenNumber, Separator, &scanner)?; // accept the separated list and extract the data let data = SeparatedList::<u8, Number<usize>, Separator>::accept(&mut data_scanner)? .data .into_iter() .map(|x| x.0) .collect::<Vec<usize>>(); // clean up the scanner because all data has been extracted scanner.bump_by(scanner.data().len()); Ok(NumberList { data }) } } }
This cleanup allows us to handle all problematic cases.
extern crate elyze; fn main() -> ParseResult<()> { // list of elements separated by a separator let data = b"1~~~2~~~3~~~4"; let mut scanner = Scanner::new(data); let result = NumberList::accept(&mut scanner)?; println!("{:?}", result); // NumberList { data: [1, 2, 3, 4] } // list of elements separated by a separator with trailing separator let data = b"1~~~2~~~3~~~4~~~"; let mut scanner = Scanner::new(data); let result = NumberList::accept(&mut scanner)?; println!("{:?}", result); // NumberList { data: [1, 2, 3, 4] } // list of 1 element with trailing separator let data = b"1~~~"; let mut scanner = Scanner::new(data); let result = NumberList::accept(&mut scanner)?; println!("{:?}", result); // NumberList { data: [1] } // list of 1 element without trailing separator let data = b"1"; let mut scanner = Scanner::new(data); let result = NumberList::accept(&mut scanner)?; println!("{:?}", result); // NumberList { data: [1] } // list of 0 elements let data = b""; let mut scanner = Scanner::new(data); let result = NumberList::accept(&mut scanner)?; println!("{:?}", result); // NumberList { data: [] } // bad data let data = b"bad~~~"; let mut scanner = Scanner::new(data); let result = NumberList::accept(&mut scanner); println!("{:?}", result); // Err(UnexpectedToken) Ok(()) }
Bytes
Although Elyze is meant to be used with any kind of data slice. You'll probably want to use it with bytes.
There are some builtin components available out of the box to help you to parse strings or bytes data.
- Tokens : A collection of well-known patterns, already acceptable and peekable.
- Delimited Groups : A delimited groups allow matching a range of bytes between delimiters.
Tokens
There are some patterns that we can recognize on bytes. All common symbols are grouped in a Token enumeration.
#![allow(unused)] fn main() { pub enum Token { /// The "(" character OpenParen, /// The `)` character CloseParen, /// The `,` character Comma, /// The `;` character Semicolon, /// The `:` character Colon, /// The whitespace character Whitespace, /// The `>` character GreaterThan, /// The `<` character LessThan, /// The `!` character Exclamation, /// The `'` character Quote, /// The `"` character DoubleQuote, /// The `=` character Equal, /// The `+` character Plus, /// The `-` character Dash, /// The `/` character Slash, /// The `*` character Star, /// The `%` character Percent, /// The `&` character Ampersand, /// The `|` character Pipe, /// The `^` character Caret, /// The `~` character Tilde, /// The `.` character Dot, /// The `?` character Question, /// The `@` character At, /// The `#` character Hash, /// The `$` character Dollar, /// The `\\` character Backslash, /// The `_` character Underscore, /// The `#` character Sharp, /// The `\n` character Ln, /// The `\r` character Cr, /// The `\t` character Tab, /// The `\r\n` character CrLn, } }
This one already implements the Match, Recognizable,Visitor and Peekable traits.
extern crate elyze; use elyze::bytes::token::Token; use elyze::errors::{ParseError, ParseResult}; use elyze::peek::{peek, Last}; use elyze::recognizer::{recognize, Recognizer}; use elyze::scanner::Scanner; use elyze::visitor::Visitor; fn main() -> ParseResult<()> { let data = b"+-*"; // use recognize let mut scanner = Scanner::new(data); let recognized = recognize(Token::Plus, &mut scanner)?; assert_eq!(recognized, Token::Plus); // use the recognizer let mut scanner = Scanner::new(data); let recognized = Recognizer::new(&mut scanner) .try_or(Token::Dash)? .try_or(Token::Plus)? .try_or(Token::Star)? .finish() .ok_or(ParseError::UnexpectedToken)?; assert_eq!(recognized, Token::Plus); // use the visitor let mut scanner = Scanner::new(data); let accepted = Token::accept(&mut scanner)?; assert_eq!(accepted, Token::Plus); // use peek let mut scanner = Scanner::new(data); let peeked = peek(Token::Dash, &mut scanner)?; if let Some(peeked) = peeked { assert_eq!(peeked.peeked_slice(), b"+"); } // last token let data = b" 8 + ( 7 * ( 1 + 2 ) )"; let mut scanner = Scanner::new(data); let peeked = peek(Last::new(Token::CloseParen), &mut scanner)?; if let Some(peeked) = peeked { assert_eq!(peeked.peeked_slice(), b" 8 + ( 7 * ( 1 + 2 ) "); } Ok(()) }
Separated List
By playing will all these implementations, we can build a separated list of tokens non including the comma token.
#![allow(unused)] fn main() { extern crate elyze; // define a structure to implement Peekable // using the Visitor pattern excluding the comma token struct AnyTokenExceptComma; // Enable the Peekable trait using the Visitor pattern impl PeekableImplementation for AnyTokenExceptComma { type Type = DefaultPeekableImplementation; } // Define the PeekSize trait impl PeekSize<u8> for AnyTokenExceptComma { fn peek_size(&self) -> usize { // The size is not important can be default to 0 0 } } // Define the Visitor trait for the AnyTokenExceptComma structure // excluding the comma token impl<'a> Visitor<'a, u8> for AnyTokenExceptComma { fn accept(scanner: &mut Scanner<'a, u8>) -> ParseResult<Self> { let token = Token::accept(scanner)?; match token { Token::Comma => Err(ParseError::UnexpectedToken), _ => Ok(AnyTokenExceptComma), } } } // Define a structure to implement Visitor #[derive(Debug, PartialEq)] struct TokenData(Token); impl<'a> Visitor<'a, u8> for TokenData { fn accept(scanner: &mut Scanner<'a, u8>) -> ParseResult<Self> { let token = Token::accept(scanner)?; match token { Token::Comma => Err(ParseError::UnexpectedToken), _ => Ok(TokenData(token)), } } } // Define a structure to implement Visitor for the separator struct SeparatorComma; impl<'a> Visitor<'a, u8> for SeparatorComma { fn accept(scanner: &mut Scanner<'a, u8>) -> ParseResult<Self> { recognize(Token::Comma, scanner)?; Ok(SeparatorComma) } } }
Then we can build the separated list
extern crate elyze; fn main() -> ParseResult<()> { let data = b"*,-,+,/,"; let scanner = Scanner::new(data); // clean up the data of its trailing comma let mut data_scanner = get_scanner_without_trailing_separator(AnyTokenExceptComma, Token::Comma, &scanner)?; assert_eq!(data_scanner.data(), b"*,-,+,/"); // data without a trailing comma // accept the separated list let list = SeparatedList::<u8, TokenData, SeparatorComma>::accept(&mut data_scanner)?; assert_eq!( list.data, vec![ TokenData(Token::Star), TokenData(Token::Dash), TokenData(Token::Plus), TokenData(Token::Slash), ] ); Ok(()) }
Delimited Groups
There is a special case of peeking when you want to get data embedded in a delimited group.
For example, you want to get the contents of a parentheses-delimited expression.
( 1 + 2 )
^
you're here
You want the inner data
1 + 2
You have to recognize the opening parentheses, then search for the closing parentheses. To do this, the peeking is the best solution.
But you also have to deal with balanced expressions: sometimes you will have nested parentheses-delimited expressions.
( ( 1 + 2 ) + 3 )
^
you're here
If you stop at the first closing parenthesis, you will get ( ( 1 + 2 ). So the inner expression will be:
( 1 + 2 .
That's because your group is unbalanced. There are more opening parentheses than closing parentheses.
To make it work, you've to keep track of the number of opening and closing parentheses.
A number is the perfect solution. Initially at 0, you increase it when you find an opening parenthesis, and decrease it when you find a closing parenthesis.
The algorithm stops when the number is 0. Because we always match the opening parentheses as first bytes. The balancing starts at 1.
( ( 1 + 2 ) + 3 )
^
b: 1
The next opening parentheses increments by 1 the balancing. Because balancing equals 2, the algorithm continues.
( ( 1 + 2 ) + 3 )
^
b: 2
The next recognized element is a closing parentheses. The balancing is decreased by 1. The algorithm continues.
( ( 1 + 2 ) + 3 )
^
b: 1
The next recognized element is also a closing parentheses. The balancing is decreased by 1. The algorithm stops because balancing is now 0.
( ( 1 + 2 ) + 3 )
^
b: 0
The real slice of data is:
( 1 + 2 ) + 3
GroupKind
Elyze defines a GroupKind enumeration that implements the Peekable trait.
This one allows peeking into a delimited group.
Parentheses Group
One of the GroupKind is the ParenthesesGroup. Like explained in the introduction.
extern crate elyze; use elyze::bytes::components::groups::GroupKind; use elyze::peek::peek; use elyze::scanner::Scanner; fn main() -> ParseResult<()> { let data = b"( 5 + 3 - ( 10 * 8 ) ) + 54"; let mut tokenizer = Scanner::new(data); let result = peek(GroupKind::Parenthesis, &mut tokenizer)?; if let Some(peeked) = result { assert_eq!(peeked.peeked_slice(), b" 5 + 3 - ( 10 * 8 ) "); } Ok(()) }
It supports the character escaping of parentheses.
If your data are like this:
( ( 1 + 2 ) \) + 3 )
The escaped closing parenthesis \) will be ignored. And so the real the data will be correctly parsed. And includes
escaped characters.
( 1 + 2 ) \) + 3
The escaping is done by the \ character.
extern crate elyze; use elyze::bytes::components::groups::GroupKind; use elyze::peek::peek; use elyze::scanner::Scanner; fn main() -> ParseResult<()> { let data = b"( 5 + 3 - \\( ( 10 * 8 \\)) \\)) + 54"; let mut tokenizer = Scanner::new(data); let result = peek(GroupKind::Parenthesis, &mut tokenizer)?; if let Some(peeked) = result { assert_eq!(peeked.peeked_slice(), b" 5 + 3 - \\( ( 10 * 8 \\)) \\)"); } Ok(()) }
Quoted Groups
In addition, Elyze also supports quoted groups.
extern crate elyze; use elyze::bytes::components::groups::GroupKind; use elyze::peek::peek; use elyze::scanner::Scanner; fn main() -> ParseResult<()> { let data = b"'hello world' data"; let mut tokenizer = Scanner::new(data); let result = peek(GroupKind::Quotes, &mut tokenizer)?; if let Some(peeked) = result { assert_eq!(peeked.peeked_slice(), b"hello world"); } Ok(()) }
Because quote groups use the same symbol for opening and closing the group, you can't detect nested groups. But you can escape identifiers if you want.
extern crate elyze; use elyze::bytes::components::groups::GroupKind; use elyze::peek::peek; use elyze::scanner::Scanner; fn main() -> ParseResult<()> { let data = "'I\\'m a quoted data' - 'yes me too'"; let mut tokenizer = Scanner::new(data.as_bytes()); let result = peek(GroupKind::Quotes, &mut tokenizer).expect("failed to parse"); if let Some(peeked) = result { assert_eq!(peeked.peeked_slice(), b"I\\'m a quoted data"); } Ok(()) }
The same can be done with double quotes.
extern crate elyze; use elyze::bytes::components::groups::GroupKind; use elyze::peek::peek; use elyze::scanner::Scanner; fn main() -> ParseResult<()> { let data = "\"I'm a quoted data\" - \"yes me too\""; let mut tokenizer = Scanner::new(data.as_bytes()); let result = peek(GroupKind::DoubleQuotes, &mut tokenizer).expect("failed to parse"); if let Some(peeked) = result { assert_eq!(peeked.peeked_slice(), b"I'm a quoted data"); } Ok(()) }