.. role:: raw-html-m2r(raw) :format: html .. image:: https://github.com/OlivierBinette/StringCompare/actions/workflows/python-package-conda.yml/badge.svg :target: https://github.com/OlivierBinette/StringCompare/actions/workflows/python-package-conda.yml :alt: Python package .. image:: https://codecov.io/gh/OlivierBinette/StringCompare/branch/main/graph/badge.svg?token=F8ASD5R051 :target: https://codecov.io/gh/OlivierBinette/StringCompare :alt: codecov .. image:: https://www.codefactor.io/repository/github/olivierbinette/stringcompare/badge :target: https://www.codefactor.io/repository/github/olivierbinette/stringcompare :alt: CodeFactor .. image:: https://img.shields.io/badge/lifecycle-maturing-blue.svg :target: https://lifecycle.r-lib.org/articles/stages.html :alt: Lifecycle Maturing .. image:: https://img.shields.io/github/v/release/olivierbinette/stringcompare :target: https://github.com/OlivierBinette/StringCompare/releases :alt: Release version .. image:: https://img.shields.io/github/sponsors/OlivierBinette :target: https://github.com/sponsors/OlivierBinette :alt: Sponsors ⚡ **StringCompare**\ : Efficient String Comparison Functions =============================================================== **StringCompare** is a Python package for efficient string similarity computation and approximate string matching. It is inspired by the excellent `\ *comparator* `_ and `\ *stringdist* `_ R packages, and from the equally excellent `\ *py_stringmatching* `_\ , `\ *jellyfish* `_\ , and `\ *textdistance* `_ Python packages. The key feature of **StringCompare** is a focus on speed, extensibility and maintainability through its `\ *pybind11* `_ C++ implementation. **StringCompare** is faster than most other Python libraries (see benchmark below) and much more memory efficient when dealing with long strings. The `complete API documentation `_ is available on the project website `olivierbinette.github.io/StringCompare `_. Installation ------------ Install the released version from github using the following commands: .. code-block:: bash pip install git+https://github.com/OlivierBinette/StringCompare.git@release Project Roadmap --------------- **StringCompare** currently implements `edit distances `_ and similarity functions, such as the Levenshtein, Damerau-Levenshtein, Jaro, and Jaro-Winkler distances. This is *stage 1* of the following development roadmap: .. list-table:: :header-rows: 1 * - Stage - Goals - Status * - 1 - pybind11 framework and edit-based distances (Levenshtein, Damerau-Levenshtein, Jaro, and Jaro-Winkler) - ✔️ * - 2 - Token-based and hybrid distances (tf-idf similarity, LSH, Monge-Elkan, ...) - ... * - 3 - Vocabulary optimizations and metric trees - ... * - 4 - Embeddings and string distance learning - ... Examples -------- Comparison algorithms are instanciated as ``Comparator`` object, which provides the ``compare()`` method (equivalent to ``__call__()``\ ) for string comparison. .. code-block:: python from stringcompare import Levenshtein, Jaro, JaroWinkler, DamerauLevenshtein, LCSDistance lev = Levenshtein(normalize=True, similarity=False) lev("Olivier", "Oliver") # same as lev.compare("Olivier", "Oliver") .. code-block:: 0.14285714285714285 Comparator objects also provide the ``elementwise()`` function for elementwise comparison between lists .. code-block:: python lev.elementwise(["Olivier", "Olivier"], ["Oliver", "Olivia"]) .. code-block:: array([0.14285714, 0.26666667]) and the ``pairwise()`` function for pairwise comparisons. .. code-block:: python lev.pairwise(["Olivier", "Oliver"], ["Olivier", "Olivia"]) .. code-block:: array([[0. , 0.26666667], [0.14285714, 0.28571429]]) Benchmark --------- Comparison of the Damerau-Levenshtein implementation speed for different Python packages, when comparing the strings "Olivier Binette" and "Oilvier Benet": .. code-block:: python from timeit import timeit from tabulate import tabulate # Comparison functions from stringcompare import DamerauLevenshtein cmp = DamerauLevenshtein() from jellyfish import damerau_levenshtein_distance from textdistance import damerau_levenshtein functions = { "StringCompare": cmp.compare, "jellyfish": damerau_levenshtein_distance, "textdistance": damerau_levenshtein, } table = [ [name, timeit(lambda: fun("Olivier Binette", "Oilvier Benet"), number=1000000) * 1000] for name, fun in functions.items() ] print(tabulate(table, headers=["Package", "avg runtime (ns)"])) .. code-block:: Package avg runtime (ns) ------------- ------------------ StringCompare 697.834 jellyfish 974.363 textdistance 3982.73 Performance notes ^^^^^^^^^^^^^^^^^ The use of pybind11 comes with a small performance overhead. We could be faster if we directly interfaced with CPython. However, the use of pybind11 allows the library to be easily extensible and maintainable. The C++ implementation has little to worry about Python, excepted for the use of a pybind11 numpy wrapper in some places. Pybind11 takes care of the details of exposing the C++ API to Python. Known Bugs ---------- *pybind11* has compatibility issues with gcc 11 (e.g. on Ubuntu 21.10). If running Linux and ``gcc --version`` is 11, then use the following commands to configure your environment before (re)installing: .. code-block:: bash sudo apt install g++-9 gcc-9 export CC=gcc-9 CXX=g++-9 If this is unsuccessful, you might want to use **StringCompare** within a `Docker `_ container. I recommend using the python:3.7.9 base image. For example, after installing docker, you can launch an interactive bash session and install **StringCompare** as follows: .. code-block:: bash sudo docker run -it python:3.7.9 bash pip install git+https://github.com/OlivierBinette/StringCompare.git python >>> import stringcompare Please report installation issues `here `_. Contribute ---------- **StringCompare** is currently in early development stage and contributions are welcome! See the `contributing `_ page for more information. Acknowledgements ---------------- This project is made possible by the support of the `Natural Sciences and Engineering Research Council of Canada (NSERC) `_ and by the support of a `G-Research `_ grant. :raw-html-m2r:``\ :raw-html-m2r:`` I would also like to thank the support of my individual `Github sponsors `_.