Enum regex_automata::DenseDFA [−][src]
A dense table-based deterministic finite automaton (DFA).
A dense DFA represents the core matching primitive in this crate. That is, logically, all DFAs have a single start state, one or more match states and a transition table that maps the current state and the current byte of input to the next state. A DFA can use this information to implement fast searching. In particular, the use of a dense DFA generally makes the trade off that match speed is the most valuable characteristic, even if building the regex may take significant time and space. As such, the processing of every byte of input is done with a small constant number of operations that does not vary with the pattern, its size or the size of the alphabet. If your needs don’t line up with this trade off, then a dense DFA may not be an adequate solution to your problem.
In contrast, a sparse DFA makes the opposite trade off: it uses less space but will execute a variable number of instructions per byte at match time, which makes it slower for matching.
A DFA can be built using the default configuration via the
DenseDFA::new
constructor. Otherwise,
one can configure various aspects via the
dense::Builder
.
A single DFA fundamentally supports the following operations:
- Detection of a match.
- Location of the end of the first possible match.
- Location of the end of the leftmost-first match.
A notable absence from the above list of capabilities is the location of
the start of a match. In order to provide both the start and end of a
match, two DFAs are required. This functionality is provided by a
Regex
, which can be built with its basic
constructor, Regex::new
, or with
a RegexBuilder
.
State size
A DenseDFA
has two type parameters, T
and S
. T
corresponds to
the type of the DFA’s transition table while S
corresponds to the
representation used for the DFA’s state identifiers as described by the
StateID
trait. This type parameter is typically
usize
, but other valid choices provided by this crate include u8
,
u16
, u32
and u64
. The primary reason for choosing a different state
identifier representation than the default is to reduce the amount of
memory used by a DFA. Note though, that if the chosen representation cannot
accommodate the size of your DFA, then building the DFA will fail and
return an error.
While the reduction in heap memory used by a DFA is one reason for choosing
a smaller state identifier representation, another possible reason is for
decreasing the serialization size of a DFA, as returned by
to_bytes_little_endian
,
to_bytes_big_endian
or
to_bytes_native_endian
.
The type of the transition table is typically either Vec<S>
or &[S]
,
depending on where the transition table is stored.
Variants
This DFA is defined as a non-exhaustive enumeration of different types of
dense DFAs. All of these dense DFAs use the same internal representation
for the transition table, but they vary in how the transition table is
read. A DFA’s specific variant depends on the configuration options set via
dense::Builder
. The default variant is
PremultipliedByteClass
.
The DFA
trait
This type implements the DFA
trait, which means it
can be used for searching. For example:
use regex_automata::{DFA, DenseDFA}; let dfa = DenseDFA::new("foo[0-9]+")?; assert_eq!(Some(8), dfa.find(b"foo12345"));
The DFA
trait also provides an assortment of other lower level methods
for DFAs, such as start_state
and next_state
. While these are correctly
implemented, it is an anti-pattern to use them in performance sensitive
code on the DenseDFA
type directly. Namely, each implementation requires
a branch to determine which type of dense DFA is being used. Instead,
this branch should be pushed up a layer in the code since walking the
transitions of a DFA is usually a hot path. If you do need to use these
lower level methods in performance critical code, then you should match on
the variants of this DFA and use each variant’s implementation of the DFA
trait directly.
Variants
Standard(Standard<T, S>)
A standard DFA that does not use premultiplication or byte classes.
ByteClass(ByteClass<T, S>)
A DFA that shrinks its alphabet to a set of equivalence classes instead of using all possible byte values. Any two bytes belong to the same equivalence class if and only if they can be used interchangeably anywhere in the DFA while never discriminating between a match and a non-match.
This type of DFA can result in significant space reduction with a very small match time performance penalty.
Premultiplied(Premultiplied<T, S>)
A DFA that premultiplies all of its state identifiers in its transition table. This saves an instruction per byte at match time which improves search performance.
The only downside of premultiplication is that it may prevent one from using a smaller state identifier representation than you otherwise could.
PremultipliedByteClass(PremultipliedByteClass<T, S>)
The default configuration of a DFA, which uses byte classes and premultiplies its state identifiers.
Implementations
impl<T: AsRef<[S]>, S: StateID> DenseDFA<T, S>
[src]
pub fn as_ref<'a>(&'a self) -> DenseDFA<&'a [S], S>
[src]
Cheaply return a borrowed version of this dense DFA. Specifically, the
DFA returned always uses &[S]
for its transition table while keeping
the same state identifier representation.
pub fn memory_usage(&self) -> usize
[src]
Returns the memory usage, in bytes, of this DFA.
The memory usage is computed based on the number of bytes used to represent this DFA’s transition table. This corresponds to heap memory usage.
This does not include the stack size used up by this DFA. To
compute that, used std::mem::size_of::<DenseDFA>()
.
impl<'a, S: StateID> DenseDFA<&'a [S], S>
[src]
pub unsafe fn from_bytes(buf: &'a [u8]) -> DenseDFA<&'a [S], S>
[src]
Deserialize a DFA with a specific state identifier representation.
Deserializing a DFA using this routine will never allocate heap memory. This is also guaranteed to be a constant time operation that does not vary with the size of the DFA.
The bytes given should be generated by the serialization of a DFA with
either the
to_bytes_little_endian
method or the
to_bytes_big_endian
endian, depending on the endianness of the machine you are
deserializing this DFA from.
If the state identifier representation is usize
, then deserialization
is dependent on the pointer size. For this reason, it is best to
serialize DFAs using a fixed size representation for your state
identifiers, such as u8
, u16
, u32
or u64
.
Panics
The bytes given should be trusted. In particular, if the bytes are not a valid serialization of a DFA, or if the given bytes are not aligned to an 8 byte boundary, or if the endianness of the serialized bytes is different than the endianness of the machine that is deserializing the DFA, then this routine will panic. Moreover, it is possible for this deserialization routine to succeed even if the given bytes do not represent a valid serialized dense DFA.
Safety
This routine is unsafe because it permits callers to provide an arbitrary transition table with possibly incorrect transitions. While the various serialization routines will never return an incorrect transition table, there is no guarantee that the bytes provided here are correct. While deserialization does many checks (as documented above in the panic conditions), this routine does not check that the transition table is correct. Given an incorrect transition table, it is possible for the search routines to access out-of-bounds memory because of explicit bounds check elision.
Example
This example shows how to serialize a DFA to raw bytes, deserialize it
and then use it for searching. Note that we first convert the DFA to
using u16
for its state identifier representation before serializing
it. While this isn’t strictly necessary, it’s good practice in order to
decrease the size of the DFA and to avoid platform specific pitfalls
such as differing pointer sizes.
use regex_automata::{DFA, DenseDFA}; let initial = DenseDFA::new("foo[0-9]+")?; let bytes = initial.to_u16()?.to_bytes_native_endian()?; let dfa: DenseDFA<&[u16], u16> = unsafe { DenseDFA::from_bytes(&bytes) }; assert_eq!(Some(8), dfa.find(b"foo12345"));
Trait Implementations
impl<T: Clone + AsRef<[S]>, S: Clone + StateID> Clone for DenseDFA<T, S>
[src]
fn clone(&self) -> DenseDFA<T, S>
[src]
pub fn clone_from(&mut self, source: &Self)
1.0.0[src]
impl<T: AsRef<[S]>, S: StateID> DFA for DenseDFA<T, S>
[src]
type ID = S
The representation used for state identifiers in this DFA. Read more
fn start_state(&self) -> S
[src]
fn is_match_state(&self, id: S) -> bool
[src]
fn is_dead_state(&self, id: S) -> bool
[src]
fn is_match_or_dead_state(&self, id: S) -> bool
[src]
fn is_anchored(&self) -> bool
[src]
fn next_state(&self, current: S, input: u8) -> S
[src]
unsafe fn next_state_unchecked(&self, current: S, input: u8) -> S
[src]
fn is_match_at(&self, bytes: &[u8], start: usize) -> bool
[src]
fn shortest_match_at(&self, bytes: &[u8], start: usize) -> Option<usize>
[src]
fn find_at(&self, bytes: &[u8], start: usize) -> Option<usize>
[src]
fn rfind_at(&self, bytes: &[u8], start: usize) -> Option<usize>
[src]
fn is_match(&self, bytes: &[u8]) -> bool
[src]
fn shortest_match(&self, bytes: &[u8]) -> Option<usize>
[src]
fn find(&self, bytes: &[u8]) -> Option<usize>
[src]
fn rfind(&self, bytes: &[u8]) -> Option<usize>
[src]
impl<T: Debug + AsRef<[S]>, S: Debug + StateID> Debug for DenseDFA<T, S>
[src]
Auto Trait Implementations
impl<T, S> Send for DenseDFA<T, S> where
S: Send,
T: Send,
S: Send,
T: Send,
impl<T, S> Sync for DenseDFA<T, S> where
S: Sync,
T: Sync,
S: Sync,
T: Sync,
impl<T, S> Unpin for DenseDFA<T, S> where
S: Unpin,
T: Unpin,
S: Unpin,
T: Unpin,
Blanket Implementations
impl<T> Any for T where
T: 'static + ?Sized,
[src]
T: 'static + ?Sized,
impl<T> Borrow<T> for T where
T: ?Sized,
[src]
T: ?Sized,
impl<T> BorrowMut<T> for T where
T: ?Sized,
[src]
T: ?Sized,
pub fn borrow_mut(&mut self) -> &mut T
[src]
impl<T> From<T> for T
[src]
impl<T, U> Into<U> for T where
U: From<T>,
[src]
U: From<T>,
impl<T, U> TryFrom<U> for T where
U: Into<T>,
[src]
U: Into<T>,
type Error = Infallible
The type returned in the event of a conversion error.
pub fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>
[src]
impl<T, U> TryInto<U> for T where
U: TryFrom<T>,
[src]
U: TryFrom<T>,