The AhoCorasick class

For an implementation of the Aho-Corasick algorithm.

This class implements a trie based data structure with suffix links to be used with the Aho-Corasick dictionary searching algorithm. An introduction to this algorithm can be found here.

The implementation of AhoCorasick uses two different types of node; active and inactive. An active node is a node that is currently a node in the trie. An inactive node is a node that used to be part of the trie, but has since been removed. It may later become active again after being reinitialised, and exists as a way of minimising how frequently memory needs to be allocated and deallocated for nodes. This function validates whether the given index i corresponds to an active node.

Several helper functions are provided in the aho_corasick module, documented here.

Contents

AhoCorasick

For an implementation of the Aho-Corasick algorithm.

AhoCorasick.child(…)

Return the child of parent with edge-label letter

AhoCorasick.copy(…)

Copy a AhoCorasick object.

AhoCorasick.height(…)

Calculate the height of a node.

AhoCorasick.init(…)

Reinitialise an existing AhoCorasick object.

AhoCorasick.is_terminal(…)

Check if a node is terminal (by index).

AhoCorasick.number_of_nodes(…)

Returns the number of nodes in the trie.

AhoCorasick.signature(…)

Find the signature of a node

AhoCorasick.suffix_link(…)

Calculate the index of the suffix link of a node.

AhoCorasick.throw_if_node_index_not_active(…)

Check if an index corresponds to a node currently in the trie.

AhoCorasick.throw_if_node_index_out_of_range(…)

Check if an index corresponds to a node.

AhoCorasick.traverse(…)

Traverse the trie using suffix links where necessary.

Full API

class AhoCorasick
__init__(self: AhoCorasick) None

Construct an empty AhoCorasick.

Construct an AhoCorasick containing only the root that corresponds to the empty word \(\varepsilon\).

child(self: AhoCorasick, parent: int, letter: int) int | Undefined

Return the child of parent with edge-label letter

This function returns the index of the child of the node with index parent along the edge labelled by letter. If no such child exists, UNDEFINED is returned.

Parameters:
  • parent (int) – the index of the node whose child is sought.

  • letter (int) – the edge-label connecting the parent to the desired child.

Returns:

the index of the child.

Return type:

int | Undefined

Raises:

LibsemigroupsError – if throw_if_node_index_not_active(parent) throws.

Complexity:

Constant.

copy(self: AhoCorasick) AhoCorasick

Copy a AhoCorasick object.

Returns:

A copy.

Return type:

AhoCorasick

height(self: AhoCorasick, i: int) int

Calculate the height of a node.

Parameters:

i (int) – the index of the node whose height is sought

Returns:

the height.

Return type:

int

Raises:

LibsemigroupsError – if throw_if_node_index_not_active(i) throws.

Complexity:

Linear in the return value which is, at worst, the maximum length of a word in the trie

init(self: AhoCorasick) AhoCorasick

Reinitialise an existing AhoCorasick object.

This function puts an AhoCorasick object back into the same state as if it had been newly default constructed.

Returns:

self.

Return type:

AhoCorasick

Complexity:

Linear in the number of nodes in the trie

is_terminal(self: AhoCorasick, i: int) bool

Check if a node is terminal (by index).

This function checks if the node with index i is terminal or not.

Parameters:

i (int) – the index.

Returns:

Whether or not the node is terminal

Return type:

bool

Raises:
  • LibsemigroupsError – if i does not correspond to the index of a node; that is, if i is larger than the size of the container storing the indices of nodes.

  • LibsemigroupsError – if i does not correspond to an active node.

Complexity:

Constant

number_of_nodes(self: AhoCorasick) int

Returns the number of nodes in the trie.

This function Returns the number of nodes in the trie.

Returns:

The number of nodes>

Return type:

int

Complexity:

Constant

signature(self: AhoCorasick, i: int) list[int]

Find the signature of a node

Return the the signature of the node with index i. Recall that the signature of a node \(n\) is the word consisting of the edge labels of the unique path from the root to \(n\).

Parameters:

i (int) – the index of the node whose signature is sought.

Returns:

The signature.

Return type:

list[int]

Complexity:

Linear in the height of the node.

Calculate the index of the suffix link of a node.

Calculate the index of a suffix link of a node. Recall that the suffix link of a node with signature \(W\) is the node with the signature equal to that of the longest proper suffix of \(W\) contained in the trie.

Parameters:

current (int) – the index of the node whose suffix link is sought

Returns:

The index of the suffix link.

Return type:

int

Raises:

LibsemigroupsError – if throw_if_node_index_not_active(current) throws.

Complexity:

Linear in the height of the node.

throw_if_node_index_not_active(self: AhoCorasick, i: int) None

Check if an index corresponds to a node currently in the trie.

Parameters:

i (int) – the index to validate

Raises:

LibsemigroupsError – if throw_if_node_index_out_of_range(i) throws, or if i is not an active node.

Complexity:

Constant

throw_if_node_index_out_of_range(self: AhoCorasick, i: int) None

Check if an index corresponds to a node.

This function checks if the given index i corresponds to the index of a node; either active or inactive.

Parameters:

i (int) – the index to validate.

Raises:

LibsemigroupsError – if i does not correspond to the index of a node; that is, if i is larger than the size of the container storing the indices of nodes.

Complexity:

Constant

traverse(self: AhoCorasick, current: int, a: int) int

Traverse the trie using suffix links where necessary.

This function traverses the trie using suffix links where necessary, behaving like a combination of the goto function and the fail function in [AC75].

If current is the index of a node with signature \(W\), and a is the letter \(a\), then traverse_no_checks(current, a) returns the index of the node with signature equal to the longest suffix of \(Wa\) contained in the trie.

Parameters:
  • current (int) – the index of the node to traverse from

  • a (int) – the letter to traverse

Returns:

The index of the node traversed to

Return type:

int

Raises:

LibsemigroupsError – if throw_if_node_index_not_active(current) throws.