stringcompare package
- class stringcompare.CharacterDifference
Bases:
StringComparator
Character difference between two strings.
This is the number of characters differing between two strings. The distance may be normalized or returned as a similarity score instead.
- Parameters:
normalize – Whether or not to normalize the result to be between 0 and 1. Defaults to True.
similarity – Whether or not to return a similarity score (higher for more similar strings) or a distance score (closer to zero for more similar strings). Defaults to False.
- compare(self: stringcompare.distance._distance.CharacterDifference, arg0: str, arg1: str) float
- class stringcompare.Comparator
Bases:
pybind11_object
Abstract base class for pybind11 comparator objects.
Provides the
compare()
function for comparison of two elements, theelementwise()
function for elementwise comparison between two lists, and thepairwise()
function for pairwise comparison between elements of two lists.Parameters for the comparison functions (e.g. to return a distance or similarity, whether or not to normalize, weights, etc) should be provided to the constructor.
The current class structure, implemented in C++, is as follows:
Comparator ─┬──────── ├─► compare() │ ├─► elementwise() │ ├─► pairwise() │ │StringComparator └─┬────────────── │ │ Levenshtein ├──────────── │ │ DamerauLevenshtein ├─────────────────── │ │ LCSDistance ├──────────── │ │ Jaro ├───── │ │ JaroWinkler └──────────── │ │ CharacterDifference └──────────────────── │ │ Hamming └────────
See also
StringComparator
NumericComparator
- compare(self: stringcompare.distance._distance.Comparator, arg0: object, arg1: object) float
Comparison between two elements.
- Parameters:
arg0 – Object to compare from.
arg1 – Object to compare to.
- Returns:
Numeric value of the comparison.
- elementwise(self: stringcompare.distance._distance.Comparator, arg0: collections.abc.Sequence[object], arg1: collections.abc.Sequence[object]) numpy.typing.NDArray[numpy.float64]
Elementwise comparison between two lists.
- Parameters:
arg0 – List of objects to compare from.
arg1 – List of objects to compare to.
- Returns:
Numpy array containing comparison values.
Note
The two lists
arg0
andarg1
should be of the same length.
- pairwise(self: stringcompare.distance._distance.Comparator, arg0: collections.abc.Sequence[object], arg1: collections.abc.Sequence[object]) numpy.typing.NDArray[numpy.float64]
Pairwise comparison between two lists.
- Parameters:
arg0 – List of objects to compare from.
arg1 – List of objects to compare to.
- Returns:
2x2 numpy array containing comparison values, where each row corresponds to an element of
arg0
and each column corresponds to an element ofarg1
.
- class stringcompare.DamerauLevenshtein
Bases:
StringComparator
Damerau-Levenshtein distance
This is the minimum number of insertions, deletions, substitutions or transpositions required to change one word into the other. The distance may be normalized or returned as a similarity score instead.
- Parameters:
normalize – Whether or not to normalize the result to be between 0 and 1. Defaults to True.
similarity – Whether or not to return a similarity score (higher for more similar strings) or a distance score (closer to zero for more similar strings). Defaults to False.
dmat_size – Initial length of the internal string buffer.
- compare(self: stringcompare.distance._distance.DamerauLevenshtein, arg0: str, arg1: str) float
- class stringcompare.DeepparseAddressTagger(deepparse_handle)[source]
Bases:
Tagger
- LABELS = ['StreetNumber', 'StreetName', 'Unit', 'Municipality', 'Province', 'PostalCode', 'Orientation', 'GeneralDelivery']
- class stringcompare.Hamming
Bases:
StringComparator
Hamming distance between two strings.
This is the number of differences between corresponding characters in the strings.
- Parameters:
normalize – Whether or not to normalize the result to be between 0 and 1. Defaults to True.
similarity – Whether or not to return a similarity score (higher for more similar strings) or a distance score (closer to zero for more similar strings). Defaults to False.
- compare(self: stringcompare.distance._distance.Hamming, arg0: str, arg1: str) float
- class stringcompare.Jaro
Bases:
StringComparator
Jaro distance
- Parameters:
similarity – Whether or not to return a similarity score (higher for more similar strings) or a distance score (closer to zero for more similar strings). Defaults to False.
- compare(self: stringcompare.distance._distance.Jaro, arg0: str, arg1: str) float
- class stringcompare.JaroWinkler
Bases:
StringComparator
Jaro-Winkler distance
- Parameters:
similarity – Whether or not to return a similarity score (higher for more similar strings) or a distance score (closer to zero for more similar strings). Defaults to False.
- compare(self: stringcompare.distance._distance.JaroWinkler, arg0: str, arg1: str) float
- class stringcompare.LCSDistance
Bases:
StringComparator
Longest common subsequence (LCS) distance
This is the minimum number of insertions or deletions required to change one word into the other. The distance may be normalized or returned as a similarity score instead.
- Parameters:
normalize – Whether or not to normalize the result to be between 0 and 1. Defaults to True.
similarity – Whether or not to return a similarity score (higher for more similar strings) or a distance score (closer to zero for more similar strings). Defaults to False.
dmat_size – Initial length of the internal string buffer.
- compare(self: stringcompare.distance._distance.LCSDistance, arg0: str, arg1: str) float
- class stringcompare.Levenshtein
Bases:
StringComparator
Levenshtein distance
This is defined as the “minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other” (see Wikipedia page). The distance may be normalized or returned as a similarity score instead.
- Parameters:
normalize – Whether or not to normalize the result to be between 0 and 1. Defaults to True.
similarity – Whether or not to return a similarity score (higher for more similar strings) or a distance score (closer to zero for more similar strings). Defaults to False.
dmat_size – Initial ength of the internal string buffer. Defaults to 100.
- Examples:
>>> from stringcompare import Levenshtein >>> lev = Levenshtein() >>> lev("Olivier", "Oilvier") # Same as lev.compare("Olivier", "Oilvier") 0.25
>>> lev = Levenshtein(normalize=False) >>> lev("Olivier", "Oilvier") 2.0
>>> lev = Levenshtein(normalize=False, similarity=True) >>> lev("Olivier", "Oilvier") 6.0
>>> lev.elementwise(["a", "ab"], ["b", "ba"]) array([0.5, 1.])
>>> lev.pairwise(["a", "ab"], ["b", "ba"]) array([[0.5, 1. ], [1. , 1. ]])
- compare(self: stringcompare.distance._distance.Levenshtein, arg0: str, arg1: str) float
- class stringcompare.StringComparator
Bases:
pybind11_object
- compare(self: stringcompare.distance._distance.StringComparator, arg0: str, arg1: str) float
- elementwise(self: stringcompare.distance._distance.StringComparator, arg0: collections.abc.Sequence[str], arg1: collections.abc.Sequence[str]) numpy.typing.NDArray[numpy.float64]
- pairwise(self: stringcompare.distance._distance.StringComparator, arg0: collections.abc.Sequence[str], arg1: collections.abc.Sequence[str]) numpy.typing.NDArray[numpy.float64]
- class stringcompare.WhitespaceTokenizer[source]
Bases:
DelimTokenizer