stringcompare.distance package

class stringcompare.distance.CharacterDifference

Bases: StringComparator

Character difference between two strings.

This is the number of characters differing between two strings. The distance may be normalized or returned as a similarity score instead.

Parameters:
  • normalize – Whether or not to normalize the result to be between 0 and 1. Defaults to True.

  • similarity – Whether or not to return a similarity score (higher for more similar strings) or a distance score (closer to zero for more similar strings). Defaults to False.

compare(self: stringcompare.distance._distance.CharacterDifference, arg0: str, arg1: str) float
class stringcompare.distance.Comparator

Bases: pybind11_object

Abstract base class for pybind11 comparator objects.

Provides the compare() function for comparison of two elements, the elementwise() function for elementwise comparison between two lists, and the pairwise() function for pairwise comparison between elements of two lists.

Parameters for the comparison functions (e.g. to return a distance or similarity, whether or not to normalize, weights, etc) should be provided to the constructor.

The current class structure, implemented in C++, is as follows:

Comparator
─┬────────
 ├─► compare()
 │
 ├─► elementwise()
 │
 ├─► pairwise()
 │
 │StringComparator
 └─┬──────────────
   │
   │ Levenshtein
   ├────────────
   │
   │ DamerauLevenshtein
   ├───────────────────
   │
   │ LCSDistance
   ├────────────
   │
   │ Jaro
   ├─────
   │
   │ JaroWinkler
   └────────────
   │
   │ CharacterDifference
   └────────────────────
   │
   │ Hamming
   └────────

See also

StringComparator NumericComparator

compare(self: stringcompare.distance._distance.Comparator, arg0: object, arg1: object) float

Comparison between two elements.

Parameters:
  • arg0 – Object to compare from.

  • arg1 – Object to compare to.

Returns:

Numeric value of the comparison.

elementwise(self: stringcompare.distance._distance.Comparator, arg0: collections.abc.Sequence[object], arg1: collections.abc.Sequence[object]) numpy.typing.NDArray[numpy.float64]

Elementwise comparison between two lists.

Parameters:
  • arg0 – List of objects to compare from.

  • arg1 – List of objects to compare to.

Returns:

Numpy array containing comparison values.

Note

The two lists arg0 and arg1 should be of the same length.

pairwise(self: stringcompare.distance._distance.Comparator, arg0: collections.abc.Sequence[object], arg1: collections.abc.Sequence[object]) numpy.typing.NDArray[numpy.float64]

Pairwise comparison between two lists.

Parameters:
  • arg0 – List of objects to compare from.

  • arg1 – List of objects to compare to.

Returns:

2x2 numpy array containing comparison values, where each row corresponds to an element of arg0 and each column corresponds to an element of arg1.

class stringcompare.distance.DamerauLevenshtein

Bases: StringComparator

Damerau-Levenshtein distance

This is the minimum number of insertions, deletions, substitutions or transpositions required to change one word into the other. The distance may be normalized or returned as a similarity score instead.

Parameters:
  • normalize – Whether or not to normalize the result to be between 0 and 1. Defaults to True.

  • similarity – Whether or not to return a similarity score (higher for more similar strings) or a distance score (closer to zero for more similar strings). Defaults to False.

  • dmat_size – Initial length of the internal string buffer.

compare(self: stringcompare.distance._distance.DamerauLevenshtein, arg0: str, arg1: str) float
class stringcompare.distance.Hamming

Bases: StringComparator

Hamming distance between two strings.

This is the number of differences between corresponding characters in the strings.

Parameters:
  • normalize – Whether or not to normalize the result to be between 0 and 1. Defaults to True.

  • similarity – Whether or not to return a similarity score (higher for more similar strings) or a distance score (closer to zero for more similar strings). Defaults to False.

compare(self: stringcompare.distance._distance.Hamming, arg0: str, arg1: str) float
class stringcompare.distance.Jaro

Bases: StringComparator

Jaro distance

Parameters:

similarity – Whether or not to return a similarity score (higher for more similar strings) or a distance score (closer to zero for more similar strings). Defaults to False.

compare(self: stringcompare.distance._distance.Jaro, arg0: str, arg1: str) float
class stringcompare.distance.JaroWinkler

Bases: StringComparator

Jaro-Winkler distance

Parameters:

similarity – Whether or not to return a similarity score (higher for more similar strings) or a distance score (closer to zero for more similar strings). Defaults to False.

compare(self: stringcompare.distance._distance.JaroWinkler, arg0: str, arg1: str) float
class stringcompare.distance.LCSDistance

Bases: StringComparator

Longest common subsequence (LCS) distance

This is the minimum number of insertions or deletions required to change one word into the other. The distance may be normalized or returned as a similarity score instead.

Parameters:
  • normalize – Whether or not to normalize the result to be between 0 and 1. Defaults to True.

  • similarity – Whether or not to return a similarity score (higher for more similar strings) or a distance score (closer to zero for more similar strings). Defaults to False.

  • dmat_size – Initial length of the internal string buffer.

compare(self: stringcompare.distance._distance.LCSDistance, arg0: str, arg1: str) float
class stringcompare.distance.Levenshtein

Bases: StringComparator

Levenshtein distance

This is defined as the “minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other” (see Wikipedia page). The distance may be normalized or returned as a similarity score instead.

Parameters:
  • normalize – Whether or not to normalize the result to be between 0 and 1. Defaults to True.

  • similarity – Whether or not to return a similarity score (higher for more similar strings) or a distance score (closer to zero for more similar strings). Defaults to False.

  • dmat_size – Initial ength of the internal string buffer. Defaults to 100.

Examples:

>>> from stringcompare import Levenshtein
>>> lev = Levenshtein()
>>> lev("Olivier", "Oilvier") # Same as lev.compare("Olivier", "Oilvier")
0.25
>>> lev = Levenshtein(normalize=False)
>>> lev("Olivier", "Oilvier")
2.0
>>> lev = Levenshtein(normalize=False, similarity=True)
>>> lev("Olivier", "Oilvier")
6.0
>>> lev.elementwise(["a", "ab"], ["b", "ba"])
array([0.5, 1.])
>>> lev.pairwise(["a", "ab"], ["b", "ba"])
array([[0.5, 1. ],
       [1. , 1. ]])
compare(self: stringcompare.distance._distance.Levenshtein, arg0: str, arg1: str) float
class stringcompare.distance.StringComparator

Bases: pybind11_object

compare(self: stringcompare.distance._distance.StringComparator, arg0: str, arg1: str) float
elementwise(self: stringcompare.distance._distance.StringComparator, arg0: collections.abc.Sequence[str], arg1: collections.abc.Sequence[str]) numpy.typing.NDArray[numpy.float64]
pairwise(self: stringcompare.distance._distance.StringComparator, arg0: collections.abc.Sequence[str], arg1: collections.abc.Sequence[str]) numpy.typing.NDArray[numpy.float64]

Submodules

stringcompare.distance.characterdifference module

class stringcompare.distance.characterdifference.CharacterDifference(normalize=True, similarity=False)[source]

Bases: StringComparator

compare(s, t)[source]

Comparison between two elements.

Parameters

s:

element to compare from.

t:

element to compare to.

Returns

Number indicating similarity level between the two elements. This is not necessarily normalized or symmetric.

stringcompare.distance.comparator module

class stringcompare.distance.comparator.Comparator[source]

Bases: ABC, Generic[T]

abstractmethod compare(t: T) float[source]

Comparison between two elements.

Parameters

s:

element to compare from.

t:

element to compare to.

Returns

Number indicating similarity level between the two elements. This is not necessarily normalized or symmetric.

elementwise(l1: List[T], l2: List[T]) ndarray[source]
pairwise(l1: List[T], l2: List[T]) ndarray[source]

Pairwise comparisons between two lists.

Parameters

l1: list

List of elements to compare from.

l2: list

List of elements to compare to.

Returns

Matrix of dimension len(l1)xlen(l2), where each row corresponds to an element of l1 and each column corresponds to an element of l2.

class stringcompare.distance.comparator.NumericComparator[source]

Bases: Comparator[float]

class stringcompare.distance.comparator.StringComparator[source]

Bases: Comparator[str]

stringcompare.distance.dameraulevenshtein module

class stringcompare.distance.dameraulevenshtein.DamerauLevenshtein(normalize=True, similarity=False, dmat_size=100)[source]

Bases: StringComparator

compare(s, t)[source]

Comparison between two elements.

Parameters

s:

element to compare from.

t:

element to compare to.

Returns

Number indicating similarity level between the two elements. This is not necessarily normalized or symmetric.

stringcompare.distance.dameraulevenshtein.dameraulevenshtein(s, t, dmat)[source]

stringcompare.distance.jaro module

class stringcompare.distance.jaro.Jaro(similarity=False)[source]

Bases: StringComparator

compare(s, t)[source]

Comparison between two elements.

Parameters

s:

element to compare from.

t:

element to compare to.

Returns

Number indicating similarity level between the two elements. This is not necessarily normalized or symmetric.

stringcompare.distance.jaro.jaro(s, t)[source]

stringcompare.distance.jarowinkler module

class stringcompare.distance.jarowinkler.JaroWinkler(similarity=False)[source]

Bases: StringComparator

compare(s, t)[source]

Comparison between two elements.

Parameters

s:

element to compare from.

t:

element to compare to.

Returns

Number indicating similarity level between the two elements. This is not necessarily normalized or symmetric.

stringcompare.distance.jarowinkler.jarowinkler(s, t, p=0.1)[source]

stringcompare.distance.lcs module

class stringcompare.distance.lcs.LCSDistance(normalize=True, similarity=False, dmat_size=100)[source]

Bases: StringComparator

compare(s, t)[source]

Comparison between two elements.

Parameters

s:

element to compare from.

t:

element to compare to.

Returns

Number indicating similarity level between the two elements. This is not necessarily normalized or symmetric.

stringcompare.distance.lcs.lcs(s, t, dmat)[source]

stringcompare.distance.levenshtein module

class stringcompare.distance.levenshtein.Levenshtein(normalize=True, similarity=False, dmat_size=100)[source]

Bases: StringComparator

compare(s, t)[source]

Comparison between two elements.

Parameters

s:

element to compare from.

t:

element to compare to.

Returns

Number indicating similarity level between the two elements. This is not necessarily normalized or symmetric.

stringcompare.distance.levenshtein.levenshtein(s, t)[source]

stringcompare.distance.monge_elkan module

class stringcompare.distance.monge_elkan.MongeElkan(comparator: ~stringcompare.distance.comparator.StringComparator = <stringcompare.distance.levenshtein.Levenshtein object>, tokenizer: ~stringcompare.preprocessing.tokenizer.Tokenizer = <stringcompare.preprocessing.tokenizer.WhitespaceTokenizer object>, symmetrize=False)[source]

Bases: StringComparator

compare(s: str, t: str)[source]

Comparison between two elements.

Parameters

s:

element to compare from.

t:

element to compare to.

Returns

Number indicating similarity level between the two elements. This is not necessarily normalized or symmetric.

monge_elkan(s: str, t: str)[source]