The AhoCorasick class
For an implementation of the Aho-Corasick algorithm.
This class implements a trie based data structure with suffix links to be used with the Aho-Corasick dictionary searching algorithm. An introduction to this algorithm can be found here.
The implementation of AhoCorasick
uses two different types of node;
active and inactive. An active node is a node that is currently a node
in the trie. An inactive node is a node that used to be part of the trie, but
has since been removed. It may later become active again after being
reinitialised, and exists as a way of minimising how frequently memory needs
to be allocated and deallocated for nodes. This function validates whether the
given index i corresponds to an active node.
Several helper functions are provided in the aho_corasick
module, documented
here.
Contents
For an implementation of the Aho-Corasick algorithm. |
|
Return the child of parent with edge-label letter |
|
Copy a |
|
Calculate the height of a node. |
|
Reinitialise an existing AhoCorasick object. |
|
Check if a node is terminal (by index). |
|
Returns the number of nodes in the trie. |
|
Find the signature of a node |
|
Calculate the index of the suffix link of a node. |
|
Check if an index corresponds to a node currently in the trie. |
|
Check if an index corresponds to a node. |
|
Traverse the trie using suffix links where necessary. |
Full API
- class AhoCorasick
- __init__(self: AhoCorasick) None
Construct an empty AhoCorasick.
Construct an
AhoCorasick
containing only the root that corresponds to the empty word \(\varepsilon\).
- child(self: AhoCorasick, parent: int, letter: int) int | Undefined
Return the child of parent with edge-label letter
This function returns the index of the child of the node with index parent along the edge labelled by letter. If no such child exists,
UNDEFINED
is returned.- Parameters:
- Returns:
the index of the child.
- Return type:
- Raises:
LibsemigroupsError – if
throw_if_node_index_not_active(parent)
throws.- Complexity:
Constant.
See also
- copy(self: AhoCorasick) AhoCorasick
Copy a
AhoCorasick
object.- Returns:
A copy.
- Return type:
- height(self: AhoCorasick, i: int) int
Calculate the height of a node.
- Parameters:
i (int) – the index of the node whose height is sought
- Returns:
the height.
- Return type:
- Raises:
LibsemigroupsError – if
throw_if_node_index_not_active(i)
throws.- Complexity:
Linear in the return value which is, at worst, the maximum length of a word in the trie
See also
- init(self: AhoCorasick) AhoCorasick
Reinitialise an existing AhoCorasick object.
This function puts an
AhoCorasick
object back into the same state as if it had been newly default constructed.- Returns:
self.
- Return type:
- Complexity:
Linear in the number of nodes in the trie
- is_terminal(self: AhoCorasick, i: int) bool
Check if a node is terminal (by index).
This function checks if the node with index i is terminal or not.
- Parameters:
i (int) – the index.
- Returns:
Whether or not the node is terminal
- Return type:
- Raises:
LibsemigroupsError – if i does not correspond to the index of a node; that is, if i is larger than the size of the container storing the indices of nodes.
LibsemigroupsError – if i does not correspond to an active node.
- Complexity:
Constant
- number_of_nodes(self: AhoCorasick) int
Returns the number of nodes in the trie.
This function Returns the number of nodes in the trie.
- Returns:
The number of nodes>
- Return type:
- Complexity:
Constant
- signature(self: AhoCorasick, i: int) list[int]
Find the signature of a node
Return the the signature of the node with index i. Recall that the signature of a node \(n\) is the word consisting of the edge labels of the unique path from the root to \(n\).
- suffix_link(self: AhoCorasick, current: int) int
Calculate the index of the suffix link of a node.
Calculate the index of a suffix link of a node. Recall that the suffix link of a node with signature \(W\) is the node with the signature equal to that of the longest proper suffix of \(W\) contained in the trie.
- Parameters:
current (int) – the index of the node whose suffix link is sought
- Returns:
The index of the suffix link.
- Return type:
- Raises:
LibsemigroupsError – if
throw_if_node_index_not_active(current)
throws.- Complexity:
Linear in the height of the node.
See also
- throw_if_node_index_not_active(self: AhoCorasick, i: int) None
Check if an index corresponds to a node currently in the trie.
- Parameters:
i (int) – the index to validate
- Raises:
LibsemigroupsError – if
throw_if_node_index_out_of_range(i)
throws, or if i is not an active node.- Complexity:
Constant
See also
- throw_if_node_index_out_of_range(self: AhoCorasick, i: int) None
Check if an index corresponds to a node.
This function checks if the given index i corresponds to the index of a node; either active or inactive.
- Parameters:
i (int) – the index to validate.
- Raises:
LibsemigroupsError – if i does not correspond to the index of a node; that is, if i is larger than the size of the container storing the indices of nodes.
- Complexity:
Constant
- traverse(self: AhoCorasick, current: int, a: int) int
Traverse the trie using suffix links where necessary.
This function traverses the trie using suffix links where necessary, behaving like a combination of the goto function and the fail function in [AC75].
If current is the index of a node with signature \(W\), and a is the letter \(a\), then traverse_no_checks(current, a) returns the index of the node with signature equal to the longest suffix of \(Wa\) contained in the trie.
- Parameters:
- Returns:
The index of the node traversed to
- Return type:
- Raises:
LibsemigroupsError – if
throw_if_node_index_not_active(current)
throws.
See also