libsemigroups  v3.0.0
C++ library for semigroups and monoids
Loading...
Searching...
No Matches
AhoCorasick

Defined in aho-corasick.hpp.

This class implements a trie based data structure with suffix links to be used with the Aho-Corasick dictionary searching algorithm. An introduction to this algorithm can be found at:

https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm

The implementation of AhoCorasick uses two different types of node; active and inactive. An active node is a node that is currently a node in the trie. An inactive node is a node that used to be part of the trie, but has since been removed. It may later become active again after being reinitialised, and exists as a way of minimising how frequently memory needs to be allocated and deallocated for nodes.

Several helper functions are provided in the aho_corasick namespace.

Public Types

using const_iterator = typename word_type::const_iterator
 Used for word_type iterators.
 
using index_type = size_t
 Alias for the index of a node in the trie.
 

Public Member Functions

 AhoCorasick ()
 Construct an empty AhoCorasick.
 
 AhoCorasick (AhoCorasick &&)=default
 Default move constructor.
 
 AhoCorasick (AhoCorasick const &)=default
 Default copy constructor.
 
rx::iterator_range< std::set< index_type >::const_iteratoractive_nodes () const
 Return the active nodes.
 
template<typename Iterator>
index_type add_word (Iterator first, Iterator last)
 Check and add a word to the trie.
 
template<typename Iterator>
index_type add_word_no_checks (Iterator first, Iterator last)
 Add a word to the trie.
 
std::set< index_type >::iterator begin_nodes () const noexcept
 Return an iterator pointing to the first active node.
 
std::set< index_type >::const_iterator cbegin_nodes () const noexcept
 Return a const iterator pointing to the first active node.
 
std::set< index_type >::const_iterator cend_nodes () const noexcept
 Return a const iterator pointing one beyond the last active node.
 
index_type child (index_type parent, letter_type letter) const
 Return the child of parent with edge-label letter.
 
index_type child_no_checks (index_type parent, letter_type letter) const
 Return the child of parent with edge-label letter.
 
std::set< index_type >::iterator end_nodes () const noexcept
 Return an iterator pointing one beyond the last active node.
 
size_t height (index_type i) const
 Calculate the height of a node.
 
size_t height_no_checks (index_type i) const
 Calculate the height of a node.
 
AhoCorasickinit ()
 Reinitialise an existing AhoCorasick object.
 
Node const & node (index_type i) const
 Return the node given an index.
 
Node const & node_no_checks (index_type i) const
 Return the node given an index.
 
size_t number_of_nodes () const noexcept
 Returns the number of nodes in the trie.
 
AhoCorasickoperator= (AhoCorasick &&)=default
 Default move assignment.
 
AhoCorasickoperator= (AhoCorasick const &)=default
 Default copy assignment.
 
template<typename Iterator>
index_type rm_word (Iterator first, Iterator last)
 Check and add a word to the trie.
 
template<typename Iterator>
index_type rm_word_no_checks (Iterator first, Iterator last)
 Remove a word from the trie.
 
word_type signature (index_type i) const
 Find the signature of a node (out-of-place).
 
void signature (word_type &w, index_type i) const
 Find the signature of a node (in-place).
 
word_type signature_no_checks (index_type i) const
 Find the signature of a node (out-of-place).
 
void signature_no_checks (word_type &w, index_type i) const
 Find the signature of a node (in-place).
 
index_type suffix_link (index_type current) const
 Calculate the index of the suffix link of a node.
 
index_type suffix_link_no_checks (index_type current) const
 Calculate the index of the suffix link of a node.
 
void throw_if_node_index_not_active (index_type i) const
 Check if an index corresponds to a node currently in the trie.
 
void throw_if_node_index_out_of_range (index_type i) const
 Check if an index corresponds to a node.
 
index_type traverse (index_type current, letter_type a) const
 Traverse the trie using suffix links where necessary.
 
index_type traverse_no_checks (index_type current, letter_type a) const
 Traverse the trie using suffix links where necessary.
 

Constructor & Destructor Documentation

◆ AhoCorasick() [1/3]

Construct an AhoCorasick containing only the root that corresponds to the empty word \(\varepsilon\).

◆ AhoCorasick() [2/3]

AhoCorasick ( AhoCorasick const & )
default

Default copy constructor.

◆ AhoCorasick() [3/3]

AhoCorasick ( AhoCorasick && )
default

Default move constructor.

Member Function Documentation

◆ active_nodes()

rx::iterator_range< std::set< index_type >::const_iterator > active_nodes ( ) const
inlinenodiscard

This function returns the active nodes of the trie.

Returns
A value of type rx::iterator_range<std::set<index_type>::const_iterator>.
Exceptions
This function guarantees not to throw a LibsemigroupsException.
Complexity
Constant.

◆ add_word()

template<typename Iterator>
index_type add_word ( Iterator first,
Iterator last )

This function does the same as add_word_no_checks(Iterator, Iterator) after first checking that the word corresponding to first and last does not correspond to an existing terminal node in the trie.

Exceptions
LibsemigroupsExceptionif the word corresponding to first and last corresponds to an existing terminal node in the trie.
See also
add_word_no_checks.

◆ add_word_no_checks()

template<typename Iterator>
index_type add_word_no_checks ( Iterator first,
Iterator last )

Calling this function immediately adds the given word to the trie, and makes the final node on the path labelled by this word terminal (if it wasn't already). After adding a word, existing suffix links become invalid. If an identical word has already been added to the trie, then this function does nothing. If first == last, then this function does nothing.

Template Parameters
Iteratorthe type of the 1st and 2nd parameters.
Parameters
firstiterator pointing to the first letter of the word to add.
lastone beyond the last letter of the word to add.
Returns
An index_type corresponding to the final node added to the trie. This node will have a signature equal to that of the given word.
Exceptions
This function guarantees not to throw a LibsemigroupsException.
Complexity
Linear in the distance between first and last.
See also
signature

◆ begin_nodes()

std::set< index_type >::iterator begin_nodes ( ) const
inlinenodiscardnoexcept

This function returns an iterator pointing to the first active node in the trie.

Returns
A value of type std::set<index_type>::iterator.
Exceptions
This function is noexcept and is guaranteed never to throw.
Complexity
Constant.

◆ cbegin_nodes()

std::set< index_type >::const_iterator cbegin_nodes ( ) const
inlinenodiscardnoexcept

This function returns a const iterator pointing to the first active node in the trie.

Returns
A value of type std::set<index_type>::const_iterator.
Exceptions
This function is noexcept and is guaranteed never to throw.
Complexity
Constant.

◆ cend_nodes()

std::set< index_type >::const_iterator cend_nodes ( ) const
inlinenodiscardnoexcept

This function returns a const iterator pointing one beyond the last active node in the trie.

Returns
A value of type std::set<index_type>::const_iterator.
Exceptions
This function is noexcept and is guaranteed never to throw.
Complexity
Constant.

◆ child()

index_type child ( index_type parent,
letter_type letter ) const
inlinenodiscard

After validating parent, this function performs the same as child_no_checks(parent, letter).

Exceptions
LibsemigroupsExceptionif throw_if_node_index_not_active(parent) throws.
See also
child_no_checks, throw_if_node_index_not_active.

◆ child_no_checks()

index_type child_no_checks ( index_type parent,
letter_type letter ) const
inlinenodiscard

This function returns the index of the child of the node with index parent along the edge labelled by letter. If no such child exists, UNDEFINED is returned.

Parameters
parentthe index of the node whose child is sought.
letterthe edge-label connecting the parent to the desired child.
Returns
A value of type index_type.
Exceptions
This function guarantees not to throw a LibsemigroupsException.
Complexity
Constant.
Warning
This function does no checks on its arguments whatsoever. In particular, if the index parent is greater than the number of nodes that have ever been created, then bad things will happen.

◆ end_nodes()

std::set< index_type >::iterator end_nodes ( ) const
inlinenodiscardnoexcept

This function returns an iterator pointing one beyond the last active node in the trie.

Returns
A value of type std::set<index_type>::iterator.
Exceptions
This function is noexcept and is guaranteed never to throw.
Complexity
Constant.

◆ height()

size_t height ( index_type i) const
inlinenodiscard

After validating i, this function performs the same as height_no_checks(i).

Exceptions
LibsemigroupsExceptionif throw_if_node_index_not_active(i) throws.
See also
height_no_checks, throw_if_node_index_not_active.

◆ height_no_checks()

size_t height_no_checks ( index_type i) const
nodiscard

Calculate the height of a node.

Parameters
ithe index of the node whose height is sought.
Returns
A value of type size_t.
Exceptions
This function guarantees not to throw a LibsemigroupsException.
Complexity
Linear in the return value which is, at worst, the maximum length of a word in the trie.
Warning
This function does no checks on its arguments whatsoever. In particular, if the index i is greater than the number of nodes that have ever been created, then bad things will happen.

◆ init()

AhoCorasick & init ( )

This function puts an AhoCorasick object back into the same state as if it had been newly default constructed.

Returns
A reference to this.
Exceptions
This function guarantees not to throw a LibsemigroupsException.
Complexity
Linear in the number of nodes in the trie.

◆ node()

Node const & node ( index_type i) const
inlinenodiscard

After validating i, this function performs the same as node_no_checks(i).

Exceptions
LibsemigroupsExceptionif throw_if_node_index_out_of_range(i) throws.
See also
node_no_checks, throw_if_node_index_out_of_range.

◆ node_no_checks()

Node const & node_no_checks ( index_type i) const
inlinenodiscard

This function returns the node stored in the trie given by the index i.

Parameters
ithe index of the node to return.
Returns
A value of type Node.
Exceptions
This function guarantees not to throw a LibsemigroupsException.
Complexity
Constant.
Note
The node returned by this function may not represent a node presently stored in the trie. See throw_if_node_index_not_active.
Warning
This function does no checks on its arguments whatsoever. In particular, if the index i is greater than the number of nodes that have ever been created, then bad things will happen.

◆ number_of_nodes()

size_t number_of_nodes ( ) const
inlinenodiscardnoexcept

This function returns the number of nodes in the trie.

Returns
A value of type size_t.
Exceptions
This function is noexcept and is guaranteed never to throw.
Complexity
Constant.

◆ operator=() [1/2]

AhoCorasick & operator= ( AhoCorasick && )
default

Default move assignment.

◆ operator=() [2/2]

AhoCorasick & operator= ( AhoCorasick const & )
default

Default copy assignment.

◆ rm_word()

template<typename Iterator>
index_type rm_word ( Iterator first,
Iterator last )

This function does the same as rm_word_no_checks(Iterator, Iterator) after first checking that the word corresponding to first and last is terminal node in the trie.

Exceptions
LibsemigroupsExceptionif the word corresponding to first and last does not correspond to an existing terminal node in the trie.
See also
rm_word_no_checks.

◆ rm_word_no_checks()

template<typename Iterator>
index_type rm_word_no_checks ( Iterator first,
Iterator last )

From the trie, remove each node of the given word that is not part of the prefix of a different word.

If the given word \(W\) corresponds to a terminal node with no children, then calling this function removes the nodes \(n_i\) from the trie that correspond to the largest suffix \(W\), such that each \(n_i\) has either zero children or one. After this, existing suffix links become invalid.

If \(W\) corresponds to a terminal node \(n\) with children, then calling this function makes \(n\) not terminal.

If \(W\) does not correspond to a terminal node, then calling this function does nothing.

Template Parameters
Iteratorthe type of the 1st and 2nd parameters.
Parameters
firstiterator pointing to the first letter of the word to add.
lastone beyond the last letter of the word to add.
Returns
An index_type corresponding to the node with signature equal to the given word.
Exceptions
This function guarantees not to throw a LibsemigroupsException.
Complexity
Linear in the distance between first and last.
See also
signature.

◆ signature() [1/2]

word_type signature ( index_type i) const
inline

After validating i, this function performs the same as signature_no_checks(i).

Exceptions
LibsemigroupsExceptionif throw_if_node_index_not_active(i) throws.
See also
signature_no_checks, throw_if_node_index_not_active.

◆ signature() [2/2]

void signature ( word_type & w,
index_type i ) const
inline

After validating i, this function performs the same as signature_no_checks(w, i).

Exceptions
LibsemigroupsExceptionif throw_if_node_index_not_active(i) throws.
See also
signature_no_checks, throw_if_node_index_not_active.

◆ signature_no_checks() [1/2]

word_type signature_no_checks ( index_type i) const
inline

Return the the signature of the node with index i. Recall that the signature of a node \(n\) is the word consisting of the edge labels of the unique path from the root to \(n\).

Parameters
ithe index of the node whose signature is sought.
Returns
A value of type word_type.
Exceptions
This function guarantees not to throw a LibsemigroupsException.
Complexity
Linear in the height of the node.
Warning
This function does no checks on its arguments whatsoever. In particular, if the index i is greater than the number of nodes that have ever been created, then bad things will happen.

◆ signature_no_checks() [2/2]

void signature_no_checks ( word_type & w,
index_type i ) const

Changes w in-place to contain the signature of the node with index i . Recall that the signature of a node \(n\) is the word consisting of the edge labels of the unique path from the root to \(n\).

Parameters
wthe word to clear and change in-place.
ithe index of the node whose signature is sought.
Exceptions
This function guarantees not to throw a LibsemigroupsException.
Complexity
Linear in the height of the node.
Warning
This function does no checks on its arguments whatsoever. In particular, if the index i is greater than the number of nodes that have ever been created, then bad things will happen.

◆ suffix_link()

index_type suffix_link ( index_type current) const
inlinenodiscard

After validating current, this function performs the same as suffix_link_no_checks(current).

Exceptions
LibsemigroupsExceptionif throw_if_node_index_not_active(current) throws.
See also
suffix_link_no_checks, throw_if_node_index_not_active.

◆ suffix_link_no_checks()

index_type suffix_link_no_checks ( index_type current) const
nodiscard

Calculate the index of a suffix link of a node. Recall that the suffix link of a node with signature \(W\) is the node with the signature equal to that of the longest proper suffix of \(W\) contained in the trie.

Parameters
currentthe index of the node whose suffix link is sought.
Returns
A value of type index_type.
Exceptions
This function guarantees not to throw a LibsemigroupsException.
Complexity
Linear in the height of the node.
Warning
This function does no checks on its arguments whatsoever. In particular, if the index current is greater than the number of nodes that have ever been created, then bad things will happen.

◆ throw_if_node_index_not_active()

void throw_if_node_index_not_active ( index_type i) const

This function checks whether the given index i corresponds to an active node.

Parameters
ithe index to check.
Exceptions
LibsemigroupsExceptionif throw_if_node_index_out_of_range(i) throws, or if i is not an active node.
Complexity
Constant.
See also
throw_if_node_index_out_of_range.

◆ throw_if_node_index_out_of_range()

void throw_if_node_index_out_of_range ( index_type i) const

This function checks if the given index i corresponds to the index of a node; either active or inactive.

Parameters
ithe index to check.
Exceptions
LibsemigroupsExceptionif i does not correspond to the index of a node; that is, if i is larger than the size of the container storing the indices of nodes.
Complexity
Constant.

◆ traverse()

index_type traverse ( index_type current,
letter_type a ) const
inlinenodiscard

After validating current, this function performs the same as traverse_no_checks(current, a).

Exceptions
LibsemigroupsExceptionif throw_if_node_index_not_active(current) throws.
See also
traverse_no_checks, throw_if_node_index_not_active.

◆ traverse_no_checks()

index_type traverse_no_checks ( index_type current,
letter_type a ) const
nodiscard

This function traverses the trie using suffix links where necessary, behaving like a combination of the goto function and the fail function in [3].

If current is the index of a node with signature \(W\), and a is the letter \(a\), then traverse_no_checks(current, a) returns the index of the node with signature equal to the longest suffix of \(Wa\) contained in the trie.

Parameters
currentthe index of the node to traverse from.
athe letter to traverse.
Returns
An value of type index_type.
Exceptions
This function guarantees not to throw a LibsemigroupsException.

The documentation for this class was generated from the following file: