![]() |
libsemigroups
v3.0.0
C++ library for semigroups and monoids
|
Defined in aho-corasick.hpp
.
This class implements a trie based data structure with suffix links to be used with the Aho-Corasick dictionary searching algorithm. An introduction to this algorithm can be found at:
https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm
The implementation of AhoCorasick uses two different types of node; active and inactive. An active node is a node that is currently a node in the trie. An inactive node is a node that used to be part of the trie, but has since been removed. It may later become active again after being reinitialised, and exists as a way of minimising how frequently memory needs to be allocated and deallocated for nodes.
Several helper functions are provided in the aho_corasick
namespace.
Public Types | |
using | const_iterator = typename word_type::const_iterator |
Used for word_type iterators. | |
using | index_type = size_t |
Alias for the index of a node in the trie. | |
Public Member Functions | |
AhoCorasick () | |
Construct an empty AhoCorasick. | |
AhoCorasick (AhoCorasick &&)=default | |
Default move constructor. | |
AhoCorasick (AhoCorasick const &)=default | |
Default copy constructor. | |
rx::iterator_range< std::set< index_type >::const_iterator > | active_nodes () const |
Return the active nodes. | |
template<typename Iterator> | |
index_type | add_word (Iterator first, Iterator last) |
Check and add a word to the trie. | |
template<typename Iterator> | |
index_type | add_word_no_checks (Iterator first, Iterator last) |
Add a word to the trie. | |
std::set< index_type >::iterator | begin_nodes () const noexcept |
Return an iterator pointing to the first active node. | |
std::set< index_type >::const_iterator | cbegin_nodes () const noexcept |
Return a const iterator pointing to the first active node. | |
std::set< index_type >::const_iterator | cend_nodes () const noexcept |
Return a const iterator pointing one beyond the last active node. | |
index_type | child (index_type parent, letter_type letter) const |
Return the child of parent with edge-label letter . | |
index_type | child_no_checks (index_type parent, letter_type letter) const |
Return the child of parent with edge-label letter . | |
std::set< index_type >::iterator | end_nodes () const noexcept |
Return an iterator pointing one beyond the last active node. | |
size_t | height (index_type i) const |
Calculate the height of a node. | |
size_t | height_no_checks (index_type i) const |
Calculate the height of a node. | |
AhoCorasick & | init () |
Reinitialise an existing AhoCorasick object. | |
Node const & | node (index_type i) const |
Return the node given an index. | |
Node const & | node_no_checks (index_type i) const |
Return the node given an index. | |
size_t | number_of_nodes () const noexcept |
Returns the number of nodes in the trie. | |
AhoCorasick & | operator= (AhoCorasick &&)=default |
Default move assignment. | |
AhoCorasick & | operator= (AhoCorasick const &)=default |
Default copy assignment. | |
template<typename Iterator> | |
index_type | rm_word (Iterator first, Iterator last) |
Check and add a word to the trie. | |
template<typename Iterator> | |
index_type | rm_word_no_checks (Iterator first, Iterator last) |
Remove a word from the trie. | |
word_type | signature (index_type i) const |
Find the signature of a node (out-of-place). | |
void | signature (word_type &w, index_type i) const |
Find the signature of a node (in-place). | |
word_type | signature_no_checks (index_type i) const |
Find the signature of a node (out-of-place). | |
void | signature_no_checks (word_type &w, index_type i) const |
Find the signature of a node (in-place). | |
index_type | suffix_link (index_type current) const |
Calculate the index of the suffix link of a node. | |
index_type | suffix_link_no_checks (index_type current) const |
Calculate the index of the suffix link of a node. | |
void | throw_if_node_index_not_active (index_type i) const |
Check if an index corresponds to a node currently in the trie. | |
void | throw_if_node_index_out_of_range (index_type i) const |
Check if an index corresponds to a node. | |
index_type | traverse (index_type current, letter_type a) const |
Traverse the trie using suffix links where necessary. | |
index_type | traverse_no_checks (index_type current, letter_type a) const |
Traverse the trie using suffix links where necessary. | |
AhoCorasick | ( | ) |
Construct an AhoCorasick containing only the root that corresponds to the empty word \(\varepsilon\).
|
default |
Default copy constructor.
|
default |
Default move constructor.
|
inlinenodiscard |
This function returns the active nodes of the trie.
rx::iterator_range<std::set<index_type>::const_iterator>
.index_type add_word | ( | Iterator | first, |
Iterator | last ) |
This function does the same as add_word_no_checks(Iterator, Iterator)
after first checking that the word corresponding to first
and last
does not correspond to an existing terminal node in the trie.
LibsemigroupsException | if the word corresponding to first and last corresponds to an existing terminal node in the trie. |
index_type add_word_no_checks | ( | Iterator | first, |
Iterator | last ) |
Calling this function immediately adds the given word to the trie, and makes the final node on the path labelled by this word terminal (if it wasn't already). After adding a word, existing suffix links become invalid. If an identical word has already been added to the trie, then this function does nothing. If first == last
, then this function does nothing.
Iterator | the type of the 1st and 2nd parameters. |
first | iterator pointing to the first letter of the word to add. |
last | one beyond the last letter of the word to add. |
first
and last
.
|
inlinenodiscardnoexcept |
This function returns an iterator pointing to the first active node in the trie.
std::set<index_type>::iterator
.noexcept
and is guaranteed never to throw.
|
inlinenodiscardnoexcept |
This function returns a const iterator pointing to the first active node in the trie.
std::set<index_type>::const_iterator
.noexcept
and is guaranteed never to throw.
|
inlinenodiscardnoexcept |
This function returns a const iterator pointing one beyond the last active node in the trie.
std::set<index_type>::const_iterator
.noexcept
and is guaranteed never to throw.
|
inlinenodiscard |
After validating parent
, this function performs the same as child_no_checks(parent, letter)
.
LibsemigroupsException | if throw_if_node_index_not_active(parent) throws. |
|
inlinenodiscard |
This function returns the index of the child of the node with index parent
along the edge labelled by letter
. If no such child exists, UNDEFINED is returned.
parent | the index of the node whose child is sought. |
letter | the edge-label connecting the parent to the desired child. |
index_type
.parent
is greater than the number of nodes that have ever been created, then bad things will happen.
|
inlinenodiscardnoexcept |
This function returns an iterator pointing one beyond the last active node in the trie.
std::set<index_type>::iterator
.noexcept
and is guaranteed never to throw.
|
inlinenodiscard |
After validating i
, this function performs the same as height_no_checks(i)
.
LibsemigroupsException | if throw_if_node_index_not_active(i) throws. |
|
nodiscard |
Calculate the height of a node.
i | the index of the node whose height is sought. |
size_t
.i
is greater than the number of nodes that have ever been created, then bad things will happen. AhoCorasick & init | ( | ) |
This function puts an AhoCorasick object back into the same state as if it had been newly default constructed.
this
.
|
inlinenodiscard |
After validating i
, this function performs the same as node_no_checks(i)
.
LibsemigroupsException | if throw_if_node_index_out_of_range(i) throws. |
|
inlinenodiscard |
This function returns the node stored in the trie given by the index i
.
i | the index of the node to return. |
Node
.i
is greater than the number of nodes that have ever been created, then bad things will happen.
|
inlinenodiscardnoexcept |
This function returns the number of nodes in the trie.
size_t
.noexcept
and is guaranteed never to throw.
|
default |
Default move assignment.
|
default |
Default copy assignment.
index_type rm_word | ( | Iterator | first, |
Iterator | last ) |
This function does the same as rm_word_no_checks(Iterator, Iterator)
after first checking that the word corresponding to first
and last
is terminal node in the trie.
LibsemigroupsException | if the word corresponding to first and last does not correspond to an existing terminal node in the trie. |
index_type rm_word_no_checks | ( | Iterator | first, |
Iterator | last ) |
From the trie, remove each node of the given word that is not part of the prefix of a different word.
If the given word \(W\) corresponds to a terminal node with no children, then calling this function removes the nodes \(n_i\) from the trie that correspond to the largest suffix \(W\), such that each \(n_i\) has either zero children or one. After this, existing suffix links become invalid.
If \(W\) corresponds to a terminal node \(n\) with children, then calling this function makes \(n\) not terminal.
If \(W\) does not correspond to a terminal node, then calling this function does nothing.
Iterator | the type of the 1st and 2nd parameters. |
first | iterator pointing to the first letter of the word to add. |
last | one beyond the last letter of the word to add. |
first
and last
.
|
inline |
After validating i
, this function performs the same as signature_no_checks(i)
.
LibsemigroupsException | if throw_if_node_index_not_active(i) throws. |
|
inline |
After validating i
, this function performs the same as signature_no_checks(w, i)
.
LibsemigroupsException | if throw_if_node_index_not_active(i) throws. |
|
inline |
Return the the signature of the node with index i
. Recall that the signature of a node \(n\) is the word consisting of the edge labels of the unique path from the root to \(n\).
i | the index of the node whose signature is sought. |
word_type
.i
is greater than the number of nodes that have ever been created, then bad things will happen. void signature_no_checks | ( | word_type & | w, |
index_type | i ) const |
Changes w
in-place to contain the signature of the node with index i
. Recall that the signature of a node \(n\) is the word consisting of the edge labels of the unique path from the root to \(n\).
w | the word to clear and change in-place. |
i | the index of the node whose signature is sought. |
i
is greater than the number of nodes that have ever been created, then bad things will happen.
|
inlinenodiscard |
After validating current
, this function performs the same as suffix_link_no_checks(current)
.
LibsemigroupsException | if throw_if_node_index_not_active(current) throws. |
|
nodiscard |
Calculate the index of a suffix link of a node. Recall that the suffix link of a node with signature \(W\) is the node with the signature equal to that of the longest proper suffix of \(W\) contained in the trie.
current | the index of the node whose suffix link is sought. |
index_type
.current
is greater than the number of nodes that have ever been created, then bad things will happen. void throw_if_node_index_not_active | ( | index_type | i | ) | const |
This function checks whether the given index i
corresponds to an active node.
i | the index to check. |
LibsemigroupsException | if throw_if_node_index_out_of_range(i) throws, or if i is not an active node. |
void throw_if_node_index_out_of_range | ( | index_type | i | ) | const |
This function checks if the given index i
corresponds to the index of a node; either active or inactive.
i | the index to check. |
LibsemigroupsException | if i does not correspond to the index of a node; that is, if i is larger than the size of the container storing the indices of nodes. |
|
inlinenodiscard |
After validating current
, this function performs the same as traverse_no_checks(current, a)
.
LibsemigroupsException | if throw_if_node_index_not_active(current) throws. |
|
nodiscard |
This function traverses the trie using suffix links where necessary, behaving like a combination of the goto function and the fail function in [3].
If current
is the index of a node with signature \(W\), and a
is the letter \(a\), then traverse_no_checks(current, a)
returns the index of the node with signature equal to the longest suffix of \(Wa\) contained in the trie.
current | the index of the node to traverse from. |
a | the letter to traverse. |
index_type
.