<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Olivier Binette</title>
<link>https://olivierbinette.ca/pages/blog.html</link>
<atom:link href="https://olivierbinette.ca/pages/blog.xml" rel="self" type="application/rss+xml"/>
<description>A collection of reflections, projects, and thoughts, big and small.</description>
<generator>quarto-1.5.57</generator>
<lastBuildDate>Wed, 11 Feb 2026 05:00:00 GMT</lastBuildDate>
<item>
  <title>Do you have any advice on how to build practical skills efficiently?</title>
  <link>https://olivierbinette.ca/pages/posts/2026-02-11-question-from-statistics-phd-student/2026-02-11-question-from-statistics-phd-student.html</link>
  <description><![CDATA[ 





<p>A second-year PhD student in statistics already passed his prelim and started research. But coming from a non-statistics and non-computing background, he’s wondering how he can efficiently build practical skills.</p>
<p>Here is my answer that focuses on a few useful resources:</p>
<blockquote class="blockquote">
<p>I’d highly recommend using https://www.datacamp.com/ as a tool to quickly gain more confidence using R and various other technologies. It’s fast-paced, interactive, starts at the basics, but also gets into more intermediate and advanced topics. I highly recommend subscribing for a month and doing a day or two of intensive learning on that platform.</p>
<p>Never be afraid to go back to the basics and take your time in working through bugs or inefficiencies. For example, if you find you have difficulty working with your command line interface, take the most basic DataCamp course on the topic to make sure you’re not missing anything essential.</p>
<p>The book “<a href="https://adv-r.hadley.nz/">Advanced R</a>” is an essential read for understanding R at a deeper level. While some of it may seem unnecessarily detailed, getting a rough understanding of the contents of this book will make day-to-day work with R a breeze.</p>
<p>Once you’re comfortable with your tools, you’ll likely start working on bigger projects. At that stage, being organized and having efficient workflows is very important. There are two resources I’d recommend to help you organize practical research done with R:</p>
<ul>
<li>This blog post talks about organizing a reproducible data analysis: https://hrdag.org/2016/06/14/the-task-is-a-quantum-of-workflow/</li>
<li>For larger-scale scientific projects, there are packages and frameworks like rrtools (https://github.com/benmarwick/rrtools) to help keep things organized.</li>
</ul>
</blockquote>



<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-reuse"><h2 class="anchored quarto-appendix-heading">Reuse</h2><div class="quarto-appendix-contents"><div><a rel="license" href="https://creativecommons.org/licenses/by/4.0/">CC BY 4.0</a></div></div></section><section class="quarto-appendix-contents" id="quarto-copyright"><h2 class="anchored quarto-appendix-heading">Copyright</h2><div class="quarto-appendix-contents"><div>Olivier Binette</div></div></section></div> ]]></description>
  <category>general</category>
  <category>questions</category>
  <guid>https://olivierbinette.ca/pages/posts/2026-02-11-question-from-statistics-phd-student/2026-02-11-question-from-statistics-phd-student.html</guid>
  <pubDate>Wed, 11 Feb 2026 05:00:00 GMT</pubDate>
</item>
<item>
  <title>Typo-Tolerant Search in 76 Lines of Code</title>
  <link>https://olivierbinette.ca/pages/posts/2026-02-11-typo-tolerant-search/2026-02-11-typo-tolerant-search.html</link>
  <description><![CDATA[ 





<p>It’s surprisingly easy to write an optimal implementation of <a href="https://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein</a>-based typo-tolerant search. You basically need to:</p>
<ol type="1">
<li>Tokenize documents.</li>
<li>Create an inverted index to quickly look up documents from terms.</li>
<li>Combine a <a href="https://en.wikipedia.org/wiki/Trie">trie</a> with an adaptation of the usual Levenshtein distance algorithm for typo-tolerant search with optimal time complexity O(<code>#terms</code> x <code>max_levenshtein_distance</code>).</li>
</ol>
<p>At the core of this implementation is the algorithm for fuzzy searching the trie, which is a variation of the algorithm used to compute the Levenshtein distance. It’s a depth first search of the trie:</p>
<ul>
<li>At each node, we associate a vector <code>dists</code> such that <code>dist[i]</code> represents the Levenshtein distance between <code>query[:i]</code> and the word represented by the current node.</li>
<li>By capping the maximum Levenshtein distance at <code>n</code>, this computation can be performed with <code>2*n+1</code> time complexity using the parent node’s <code>dists</code> vector.</li>
<li>Furthermore, the <code>dists</code> vector allows for both (1) determining the Levenshtein distance between the query and the current word, and (2) checking whether or not descendants of the current node could potentially match. So we’re both effieciently computing the Levenshtein distance and only exploring relevant branches of the trie.</li>
</ul>
<p>Here’s what the trie and the attached dists vectors might look like:</p>
<p><a href="trie-search-example.excalidraw.png" class="lightbox" data-gallery="quarto-lightbox-gallery-1"><img src="https://olivierbinette.ca/pages/posts/2026-02-11-typo-tolerant-search/trie-search-example.excalidraw.png" class="img-fluid"></a></p>
<section id="code" class="level2">
<h2 class="anchored" data-anchor-id="code">Code</h2>
<div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> string</span>
<span id="cb1-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> collections <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> defaultdict</span>
<span id="cb1-3"></span>
<span id="cb1-4"></span>
<span id="cb1-5"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">class</span> Node:</span>
<span id="cb1-6">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">__init__</span>(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, word<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""</span>, parent<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>):</span>
<span id="cb1-7">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.word <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> word</span>
<span id="cb1-8">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.parent <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> parent</span>
<span id="cb1-9">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.children <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> defaultdict()</span>
<span id="cb1-10">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.dists <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span></span>
<span id="cb1-11">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.is_word <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span></span>
<span id="cb1-12"></span>
<span id="cb1-13"></span>
<span id="cb1-14"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">class</span> Trie:</span>
<span id="cb1-15">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">__init__</span>(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>):</span>
<span id="cb1-16">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.root <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Node()</span>
<span id="cb1-17"></span>
<span id="cb1-18">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> preprocess(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, doc):</span>
<span id="cb1-19">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> (</span>
<span id="cb1-20">            doc.lower()</span>
<span id="cb1-21">            .translate(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>.maketrans(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""</span>, string.punctuation))</span>
<span id="cb1-22">            .encode(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ascii"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"replace"</span>)</span>
<span id="cb1-23">            .decode()</span>
<span id="cb1-24">        )</span>
<span id="cb1-25"></span>
<span id="cb1-26">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> insert(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, word):</span>
<span id="cb1-27">        word <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.preprocess(word)</span>
<span id="cb1-28">        node <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.root</span>
<span id="cb1-29">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i, char <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">enumerate</span>(word):</span>
<span id="cb1-30">            node <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> node.children.setdefault(char, Node(word[: i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>], node))</span>
<span id="cb1-31">        node.is_word <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span></span>
<span id="cb1-32"></span>
<span id="cb1-33">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> fuzzy_search(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, query, n):</span>
<span id="cb1-34">        query <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.preprocess(query)</span>
<span id="cb1-35">        matching_set, visited, stack <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">set</span>(), <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">set</span>(), [<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.root]</span>
<span id="cb1-36">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">while</span> stack:</span>
<span id="cb1-37">            node <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> stack.pop()</span>
<span id="cb1-38">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> node <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">not</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> visited:</span>
<span id="cb1-39">                node.dists <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.get_levenshtein_dists(node, query, n)</span>
<span id="cb1-40">                <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> node.is_word <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">and</span> node.dists[<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;=</span> n:</span>
<span id="cb1-41">                    matching_set.add(node.word)</span>
<span id="cb1-42">                <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">min</span>(node.dists) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;=</span> n:</span>
<span id="cb1-43">                    stack.extend(node.children.values())</span>
<span id="cb1-44">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> matching_set</span>
<span id="cb1-45"></span>
<span id="cb1-46">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> get_levenshtein_dists(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, node, query, n):</span>
<span id="cb1-47">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> node.parent <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">is</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>:</span>
<span id="cb1-48">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(query) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb1-49">        dists <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [n <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> (<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(query) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb1-50">        dists[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(node.word)</span>
<span id="cb1-51">        prev_dists <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> node.parent.dists</span>
<span id="cb1-52">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">max</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, dists[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> n <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>), <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">min</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(query), dists[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> n <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)):</span>
<span id="cb1-53">            dists[i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (</span>
<span id="cb1-54">                prev_dists[i]</span>
<span id="cb1-55">                <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> query[i] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> node.word[<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb1-56">                <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">min</span>(prev_dists[i], prev_dists[i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>], dists[i])</span>
<span id="cb1-57">            )</span>
<span id="cb1-58"></span>
<span id="cb1-59">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> dists</span>
<span id="cb1-60"></span>
<span id="cb1-61"></span>
<span id="cb1-62"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">class</span> Index:</span>
<span id="cb1-63">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">__init__</span>(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, documents):</span>
<span id="cb1-64">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.trie <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Trie()</span>
<span id="cb1-65">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.inverted_index <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> defaultdict(<span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">lambda</span>: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">set</span>())</span>
<span id="cb1-66">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> doc <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> documents:</span>
<span id="cb1-67">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> word <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.trie.preprocess(doc).split():</span>
<span id="cb1-68">                <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.trie.insert(word)</span>
<span id="cb1-69">                <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.inverted_index[word].add(doc)</span>
<span id="cb1-70"></span>
<span id="cb1-71">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> fuzzy_search(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, query, n):</span>
<span id="cb1-72">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> {</span>
<span id="cb1-73">            doc</span>
<span id="cb1-74">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> word <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.trie.fuzzy_search(query, n)</span>
<span id="cb1-75">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> doc <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.inverted_index[word]</span>
<span id="cb1-76">        }</span></code></pre></div>
</section>
<section id="example" class="level2">
<h2 class="anchored" data-anchor-id="example">Example</h2>
<div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> simplesearch <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Index</span>
<span id="cb2-2"></span>
<span id="cb2-3">docs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb2-4">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Wikipedia is hosted by the Wikimedia Foundation, a non-profit organization that also hosts a range of other projects."</span>,</span>
<span id="cb2-5">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The Hrabri class consisted of two submarines built for the Kingdom of Serbs, Croats and Slovenes. The first submarines to serve in the Royal Yugoslav Navy (KM), they arrived in Yugoslavia on 5 April 1928, and participated in cruises to Mediterranean ports prior to World War II."</span></span>
<span id="cb2-6">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Did you know that Jean-Emmanuel Depraz (pictured) won a Magic: The Gathering world championship using three cards depicting the player who beat him in 2021?"</span></span>
<span id="cb2-7">]</span>
<span id="cb2-8"></span>
<span id="cb2-9">index <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Index(docs)</span>
<span id="cb2-10">index.fuzzy_search(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Willipedia"</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)</span>
<span id="cb2-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">## {'Wikipedia is hosted by the Wikimedia Foundation, a non-profit organization that also hosts a range of other projects.'}</span></span></code></pre></div>
</section>
<section id="notes" class="level2">
<h2 class="anchored" data-anchor-id="notes">Notes</h2>
<ul>
<li>This demo is only optimized for time complexity, not memory.</li>
<li>Obviously this not a refined search engine. It’s just efficient single-term fuzzy search and corresponding document lookup.</li>
</ul>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<ul>
<li>http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata</li>
<li>https://julesjacobs.com/2015/06/17/disqus-levenshtein-simple-and-fast.html</li>
<li>https://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html</li>
</ul>


</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-reuse"><h2 class="anchored quarto-appendix-heading">Reuse</h2><div class="quarto-appendix-contents"><div><a rel="license" href="https://creativecommons.org/licenses/by/4.0/">CC BY 4.0</a></div></div></section><section class="quarto-appendix-contents" id="quarto-copyright"><h2 class="anchored quarto-appendix-heading">Copyright</h2><div class="quarto-appendix-contents"><div>Olivier Binette</div></div></section></div> ]]></description>
  <category>technical</category>
  <category>information retrieval</category>
  <category>python</category>
  <guid>https://olivierbinette.ca/pages/posts/2026-02-11-typo-tolerant-search/2026-02-11-typo-tolerant-search.html</guid>
  <pubDate>Wed, 11 Feb 2026 05:00:00 GMT</pubDate>
</item>
<item>
  <title>Buy or Build?</title>
  <link>https://olivierbinette.ca/pages/posts/2024-11-15-build-or-buy/2024-11-15-build-or-buy.html</link>
  <description><![CDATA[ 





<div class="columns">
<div class="column" style="width:70%;">
<p>Choosing where to get what you need - whether that’s software, hardware, or people - is called <strong>strategic sourcing.</strong> It’s about minimizing the <strong>total cost of ownership</strong>, including the costs associated with using, not using, or maintaining the thing you need.</p>
<p>The problem is particularly tricky for software developers. They’re paid to build, after all, and they know or want to know how to build. So why should they buy a solution when they can make it themselves?</p>
</div><div class="column" style="width:30%;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="bernie.jpg" class="lightbox" data-gallery="quarto-lightbox-gallery-1"><img src="https://olivierbinette.ca/pages/posts/2024-11-15-build-or-buy/bernie.jpg" class="img-fluid quarto-figure quarto-figure-center figure-img" style="width:90.0%"></a></p>
</figure>
</div>
</div>
</div>
<blockquote class="blockquote">
<p>Before you build, make sure you understand the real costs to succeed over the long term, and only embark on those code-writing efforts you’re sure your business is capable of. - <a href="https://hbr.org/2021/12/when-should-your-company-develop-its-own-software">Robert Sher, HBR</a></p>
</blockquote>
<p>Answering this question requires having clear requirements, understanding the extent to which suppliers can meet these requirements, and understanding the total costs associated with each alternative.</p>
<p>But there’s a <strong>rule of thumb</strong> that covers many situations. If:</p>
<ul>
<li>you can buy what you need,</li>
<li>from a reasonably mature competitive market,</li>
<li>that benefits from economies of scale,</li>
</ul>
<p>then you should buy and not build.</p>
<p>Why? You’re unlikely to beat a competitive market with economies of scale, so buy if you can.</p>
<p>There are exceptions to this, such as if this is an area of core capability where you’re trying to compete. But it’s a good rule of thumb for the rest.</p>
<hr>
<p>Other tips from <a href="https://pipdecks.com">PipDecks</a>’ Strategy Tactics:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="pip-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-2"><img src="https://olivierbinette.ca/pages/posts/2024-11-15-build-or-buy/pip-1.png" class="img-fluid quarto-figure quarto-figure-center figure-img" style="width:50.0%"></a></p>
</figure>
</div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="pip-2.png" class="lightbox" data-gallery="quarto-lightbox-gallery-3" title="https://pipdecks.com/products/strategy-tactics"><img src="https://olivierbinette.ca/pages/posts/2024-11-15-build-or-buy/pip-2.png" class="img-fluid quarto-figure quarto-figure-center figure-img" style="width:50.0%" alt="https://pipdecks.com/products/strategy-tactics"></a></p>
</figure>
</div>
<figcaption>https://pipdecks.com/products/strategy-tactics</figcaption>
</figure>
</div>



<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-reuse"><h2 class="anchored quarto-appendix-heading">Reuse</h2><div class="quarto-appendix-contents"><div><a rel="license" href="https://creativecommons.org/licenses/by/4.0/">CC BY 4.0</a></div></div></section><section class="quarto-appendix-contents" id="quarto-copyright"><h2 class="anchored quarto-appendix-heading">Copyright</h2><div class="quarto-appendix-contents"><div>Olivier Binette</div></div></section></div> ]]></description>
  <category>general</category>
  <category>management</category>
  <guid>https://olivierbinette.ca/pages/posts/2024-11-15-build-or-buy/2024-11-15-build-or-buy.html</guid>
  <pubDate>Fri, 15 Nov 2024 05:00:00 GMT</pubDate>
</item>
<item>
  <title>Product Development Is Hard</title>
  <link>https://olivierbinette.ca/pages/posts/2024-09-04-product-development-is-hard/2024-09-04-product-development-is-hard.html</link>
  <description><![CDATA[ 





<p>I am mostly a “technical” person. This means I tend to work on technology problems that have technology solutions. I’m interested in non-technological things as well, but it’s not my expertise.</p>
<p>In my field, <strong>learning about a new technology can feel like gaining a superpower.</strong> Think about being able to build a custom ChatGPT - it’s exciting!</p>
<p>With this comes the thought: “Wouldn’t it be nice if I solved problem Y using technology X?”</p>
<p>Unfortunately, the answer to this question is typically a resounding “no.”</p>
<p>It’s not that problem Y is not important. Or that technology X can’t help with problem Y. The problem is that product development is hard.</p>
<p>If I went about building a solution fueled only by my technological enthusiasm, then I would likely fail. It has happened to me before.</p>
<p>Most people don’t care about technology. They care about a job to be done. <strong>They want to gain a superpower of their own.</strong></p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="job-to-be-done.webp" class="lightbox" data-gallery="quarto-lightbox-gallery-1" title="https://jtbd.info/2-what-is-jobs-to-be-done-jtbd-796b82081cca"><img src="https://olivierbinette.ca/pages/posts/2024-09-04-product-development-is-hard/job-to-be-done.webp" class="preview-image img-fluid figure-img" alt="https://jtbd.info/2-what-is-jobs-to-be-done-jtbd-796b82081cca"></a></p>
<figcaption>https://jtbd.info/2-what-is-jobs-to-be-done-jtbd-796b82081cca</figcaption>
</figure>
</div>
<p>Building a good product requires understanding what your customer/client wants to get done. To understand where, when, and why they might want to use your product.</p>
<p>This is a science of its own. It’s not a technological problem, it’s a human problem. And it’s not my expertise.</p>
<p>As technologists, we need to embrace our backline role. We need to call on non-technologists to guide the creation of great products that empower others, or learn the skills we need to get this done through training from experts or experience working with experts.</p>



<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-reuse"><h2 class="anchored quarto-appendix-heading">Reuse</h2><div class="quarto-appendix-contents"><div><a rel="license" href="https://creativecommons.org/licenses/by/4.0/">CC BY 4.0</a></div></div></section><section class="quarto-appendix-contents" id="quarto-copyright"><h2 class="anchored quarto-appendix-heading">Copyright</h2><div class="quarto-appendix-contents"><div>Olivier Binette</div></div></section></div> ]]></description>
  <category>general</category>
  <category>product-development</category>
  <guid>https://olivierbinette.ca/pages/posts/2024-09-04-product-development-is-hard/2024-09-04-product-development-is-hard.html</guid>
  <pubDate>Wed, 04 Sep 2024 04:00:00 GMT</pubDate>
</item>
<item>
  <title>Strategic Project Management Made Simple</title>
  <link>https://olivierbinette.ca/pages/posts/2024-09-04-strategic-project-management-made-simple/2024-09-04-strategic-project-management-made-simple.html</link>
  <description><![CDATA[ 





<div class="callout callout-style-simple callout-note">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-body-container">
<p>Everything that follows is a quote from Terry’s book, with minimal adaptations for flow in some places. It’s an excellent book. Get it <a href="https://www.amazon.com/Strategic-Project-Management-Made-Simple/dp/1119718171">here</a>.</p>
</div>
</div>
</div>
<p>The most potent opportunities seldom show up labeled as “projects,” but arrive disguised as problems, issues, or murky messes. Tackling so called Big, Hairy, Audacious Goals, as Jim Collins describes them in Built to Last, involves juggling a full spectrum of slippery Objectives that can be difficult to define, let alone manage.</p>
<p>In the pages ahead, I’ll walk you through a flexible thinking process, and show you how to sort through the fog of fuzzy ideas and develop sound strategies and executable plans. You’ll see how these tools scale up and down to handle issues of any size and flex to fit multiple situations you may face. But first, let’s review why most project plans are inadequate. See how many of these resonate with your personal experience:</p>
<div class="callout callout-style-simple callout-tip callout-titled" title="Beware these six dangerous planning mistakes">
<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-1-contents" aria-controls="callout-1" aria-expanded="false" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Beware these six dangerous planning mistakes
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-1" class="callout-1-contents callout-collapse collapse">
<div class="callout-body-container callout-body">
<table class="caption-top table">
<colgroup>
<col style="width: 50%">
<col style="width: 50%">
</colgroup>
<tbody>
<tr class="odd">
<td></td>
<td><strong>Planning Mistake</strong> | <strong>Solution Elements</strong> |</td>
</tr>
<tr class="even">
<td><p><strong>Tolerating Vague Objectives</strong></p>
<p><em>In the rush to implement, not enough serious, upfront thinking goes into clarifying Objectives, Measures, and their interconnections.</em></p></td>
<td><ul>
<li><p>Make Objectives clear and measurable</p></li>
<li><p>Identify logical levels and If-Then links</p></li>
<li><p>Define your strategic hypotheses</p></li>
<li><p>Define why before what and how</p></li>
</ul></td>
</tr>
<tr class="odd">
<td><p><strong>Ignoring Environmental Context</strong></p>
<p><em>Projects unfold in unpredictable ways, but people sometimes think myopically and ignore how risk factors outside their project boundaries might affect them.</em></p></td>
<td><ul>
<li><p>Scan the environment for circumstances</p></li>
<li><p>Understand internal and external context</p></li>
<li><p>Identify risk elements</p></li>
<li><p>Make, test, manage, and monitor Assumptions</p></li>
</ul></td>
</tr>
<tr class="even">
<td><p><strong>Poor Planning Tools and Processes</strong></p>
<p><em>When the only tool is a hammer, the whole world looks like a nail. Before firing up your PC, fire up your brain and flesh out your project strategy.</em></p></td>
<td><ul>
<li><p>Choose common planning model and language</p></li>
<li><p>Plan top-down, test bottom-up</p></li>
<li><p>Plan for the plan</p></li>
<li><p>Use the Logical Framework as a central planning tool</p></li>
</ul></td>
</tr>
<tr class="odd">
<td><p><strong>Neglecting Stakeholder Interests</strong></p>
<p><em>Projects are real-life dramas played out by multiple actors who bring their own agenda and varying degrees of interest and support.</em></p></td>
<td><ul>
<li><p>Remember - people support what they help create</p></li>
<li><p>Involve people who matter</p></li>
<li><p>Understand the perspectives of others</p></li>
<li><p>Build consensus and commitment</p></li>
</ul></td>
</tr>
<tr class="even">
<td><p><strong>One-shot Planning</strong></p>
<p><em>Like home-baked bread that grows moldy with time, project plans have a limited shelf-life. They must be updated to reflect new learning and progress.</em></p></td>
<td><ul>
<li><p>Build consensus and commitment</p></li>
<li><p>Treat project documents as living plans, organic in nature</p></li>
<li><p>Be “cycle logical” - think, plan, act, and assess</p></li>
<li><p>Iterate and update in predetermined learning cycles</p></li>
<li><p>Constantly refine the strategic hypothesis</p></li>
</ul></td>
</tr>
<tr class="odd">
<td><p><strong>Mismanaging People Dynamics</strong></p>
<p><em>Project success requires the committed, coordinated action of many people.</em></p></td>
<td><ul>
<li><p>Build in payoffs (fun, learning, rewards)</p></li>
<li><p>Grow the team while growing the plan</p></li>
<li><p>Sharpen the who-when-what-how</p></li>
<li><p>Manage with emotional intelligence</p></li>
</ul></td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
<section id="the-four-critical-questions" class="level2">
<h2 class="anchored" data-anchor-id="the-four-critical-questions">The Four Critical Questions</h2>
<p>All great solutions begin by asking the right questions. They seem like simple questions - that’s exactly the point. They are indeed simple, but not simplistic. The four following carefully crafted questions work wonders in virtually any situation. The first three are usually glossed over in the rush to answer the fourth.</p>
<ol type="1">
<li><p><strong>What are we trying to accomplish and why?</strong></p>
<p>The question of what the project should accomplish - and more importantly - why it needs to be done, deserves fine-tuned attention because those answers drive everything else. In the rush to decide on the how, who, and when of a project, people often gloss over the why.</p></li>
<li><p><strong>How will we measure success?</strong></p>
<p>This question is significant because Measures flesh out and anchor what the Objectives really mean. Until you define how success will be measured, even the most sincere visions are no more than highfalutin’ fluff.</p></li>
<li><p><strong>What other conditions must exist?</strong></p>
<p>This third question puts your project, issue, or initiative into a larger strategic context. Asking this expands the analysis to include some of the outside factors which may disrupt your carefully crafted plans.</p></li>
<li><p><strong>How do we get there?</strong></p>
<p>The majority of project teams I have worked with tend to delve deep into the details much too soon, or get sidelined by premature technical arguments. They gloss over the first three questions in a rush to get moving. The value of the fourth question comes from consciously placing it in its only, truly functional place in the planning sequence: Last.</p></li>
</ol>
</section>
<section id="logframes" class="level2">
<h2 class="anchored" data-anchor-id="logframes">LogFrames</h2>
<p>While the LogFrame matrix may initially seem intimidating, the ideas it captures are basic. The four strategic questions offer a user friendly way to learn and apply this tool. These questions are inherently embedded in the matrix and answering them helps you design your project in a way that connect all the dots.</p>
<p><a href="logframe1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-1"><img src="https://olivierbinette.ca/pages/posts/2024-09-04-strategic-project-management-made-simple/logframe1.png" class="img-fluid" style="width:66.0%"></a></p>
<div class="callout callout-style-default callout-tip callout-titled" title="Alternative LogFrame Diagram">
<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-2-contents" aria-controls="callout-2" aria-expanded="false" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Alternative LogFrame Diagram
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-2" class="callout-2-contents callout-collapse collapse">
<div class="callout-body-container callout-body">
<p><a href="logframe2.png" class="lightbox" data-gallery="quarto-lightbox-gallery-2"><img src="https://olivierbinette.ca/pages/posts/2024-09-04-strategic-project-management-made-simple/logframe2.png" class="img-fluid" style="width:66.0%"></a></p>
</div>
</div>
</div>
<ol type="1">
<li><p><strong><em>What Are We Trying To Accomplish And Why?</em></strong> <strong>(Objectives)</strong><br>
The <strong>first column</strong> describes Objectives and the If-Then logic linking them together. The LogFrame makes important distinctions among various “levels” of Objectives: Strategic intention (<strong>Goal</strong>), project impact (<strong>Purpose</strong>), project deliverables (<strong>Outcomes</strong>), and the key action steps (<strong>Inputs</strong>).</p></li>
<li><p><strong><em>How Will We Measure Success?</em></strong> <strong>(Measures and Verifications)</strong></p>
<ul>
<li><p>The <strong>second column</strong> identifies the Measures of sucess for Objectives at each level. here wew select appropriate Measures and choose quantity, quality, and time indicators to clarify what each Objective means.</p></li>
<li><p>The <strong>third column</strong> summarizes how we will verify the status of the Measures at eaech level. Think of the Verification column as the project’s management information and feedback system.</p></li>
</ul></li>
<li><p><strong><em>What Other Conditions Must Exist?</em></strong> <strong>(Assumptions)</strong><br>
The <strong>fourth column</strong> captures Assumptions; those ever-present, but often neglected risk factors outside of the project, on which project success depends. Defining and testing Assumptions lets you spot potential problems and deal with them in advance.</p></li>
<li><p><strong><em>How do we get There?</em></strong> <strong>(Inputs)</strong><br>
The <strong>bottom row</strong> captures the project action plan: Who does what, when, and with what resources. Conventional project management like Work Breakdown Structures (WBS) and Gantt chart schedules fit here.</p></li>
</ol>
<div class="callout callout-style-simple callout-tip callout-titled" title="LogFrame Tips">
<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-3-contents" aria-controls="callout-3" aria-expanded="false" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
LogFrame Tips
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-3" class="callout-3-contents callout-collapse collapse">
<section id="logframe-tips" class="level3 callout-body-container callout-body">
<h3 class="anchored" data-anchor-id="logframe-tips">LogFrame Tips</h3>
<ul>
<li><p><em><strong>Treat the matrix as a summary</strong>.</em> Keep it clear and concise; supplement with other documents.</p></li>
<li><p><strong><em>Make sure everyone on the team has working understanding</em></strong> of the LogFrame (at a minimum, knowing the four critical questions).</p></li>
<li><p><strong><em>Make sure the right peopole are involved</em></strong>. Invite key stakeholders to participate in project planning.</p></li>
<li><p><strong><em>Stress the importance of the process of planning</em></strong> as much as the plan that comes out of the planning process. Supplement liberally with other supporting tools.</p></li>
<li><p><strong><em>Iterate to make it great</em></strong>. Consider the first Logframe to be a rough draft that will require revision and reworking, perhaps through many cycles.</p></li>
<li><p><strong><em>Build in specific milestones on the calendar</em></strong> at which you refine and revise the matrix in the light of new information.</p></li>
<li><p><strong><em>Monitor and manage changing Assumptions</em></strong> over time.</p></li>
</ul>
</section>
</div>
</div>
<div class="callout callout-style-simple callout-tip callout-titled" title="Turning a Problem Into a Set of Objectives">
<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-4-contents" aria-controls="callout-4" aria-expanded="false" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Turning a Problem Into a Set of Objectives
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-4" class="callout-4-contents callout-collapse collapse">
<section id="turning-a-problem-into-a-set-of-objectives" class="level3 callout-body-container callout-body">
<h3 class="anchored" data-anchor-id="turning-a-problem-into-a-set-of-objectives">Turning a Problem Into a Set of Objectives</h3>
<p>A problem is simply a project in disguise. Projects masquerading as problems must first be converted into Objectives before advancing to solutions. Spend some time carefully diagnosing the problem because the way you define it shapes the range of solution options. Don’t get sucked in by an over-simplified definition, catch phrase, or symptom. Get at the root causes. Find the right problem to solve.</p>
<p>Stakeholder collaboration during problem analysis builds shared understanding, generates better solution approaches, and greases the skids for smoother execution.</p>
<section id="ask-your-stakeholders" class="level4">
<h4 class="anchored" data-anchor-id="ask-your-stakeholders">Ask Your Stakeholders</h4>
<ul>
<li><p>What do you see as the problem?</p></li>
<li><p>Why is this a problem and for whom?</p></li>
<li><p>What causes the problem?</p></li>
<li><p>What are the consequences if we ignore the problem?</p></li>
<li><p>How will you know when the problem is gone?</p></li>
<li><p>What benefits will a solution bring?</p></li>
<li><p>What might an ideal solution look like?</p></li>
</ul>
</section>
</section>
</div>
</div>
<div class="callout callout-style-simple callout-tip callout-titled" title="Exploring Distinctions Among LogFrame Levels">
<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-5-contents" aria-controls="callout-5" aria-expanded="false" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Exploring Distinctions Among LogFrame Levels
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-5" class="callout-5-contents callout-collapse collapse">
<section id="exploring-distinctions-among-logframe-levels" class="level2 callout-body-container callout-body">
<h2 class="anchored" data-anchor-id="exploring-distinctions-among-logframe-levels">Exploring Distinctions Among LogFrame Levels</h2>
<section id="goal-the-big-picture-impact" class="level3">
<h3 class="anchored" data-anchor-id="goal-the-big-picture-impact">Goal: The Big Picture Impact</h3>
<p>The Goal is the big picture context — the overarching corporate or strategic Objective to which your project, and usually other projects, contribute.</p>
<p>Some typical Goal examples:</p>
<ul>
<li><p>Delight our customers</p></li>
<li><p>Become the top provider in the market</p></li>
<li><p>Increase corporate profits</p></li>
<li><p>Ensure reliability of the nuclear stockpile</p></li>
<li><p>Foster a climate of innovation</p></li>
<li><p>Be the global leader in safety education</p></li>
</ul>
<p>These secondary trigger questions can help you get to the priamary Goal of a project:</p>
<ul>
<li><p>What is the higher corporate or strategic Objective to which this project contributes?</p></li>
<li><p>Why is the project’s impact important?</p></li>
<li><p>What should happen after we achieve the Purpose?</p></li>
<li><p>What is the big picture reason for doing this project?</p></li>
</ul>
</section>
<section id="purpose-the-project-sweet-spot" class="level3">
<h3 class="anchored" data-anchor-id="purpose-the-project-sweet-spot">Purpose: The Project Sweet Spot</h3>
<p>Purpose is the vital, often missing focus that expresses the desired result or the impact we expect the project deliverables to produce. It describes expected change in system behavior, whether the system of interest is a core process, a new organization unit, or target customers. Purpose floats a level above that which we can directly control — the Outcomes. It’s a subtle concept, often hard to grasp because we are so conditioned to thinking of activities and Outcomes.</p>
<p>Consider these examples:</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th><strong>Outcomes Statement</strong></th>
<th><strong>Corresponding Purposes</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>System built or delivered</td>
<td>Customers use our system</td>
</tr>
<tr class="even">
<td>Process improved</td>
<td>Improved process used</td>
</tr>
<tr class="odd">
<td>System developed</td>
<td>System successfully implemented</td>
</tr>
<tr class="even">
<td>Staff trained in safe procedures</td>
<td>Staff operates machinery safely</td>
</tr>
</tbody>
</table>
<p>Here are some trigger questions you can ask to articulate the Purpose:</p>
<ul>
<li><p>Why are we really doing this project?</p></li>
<li><p>What would the clients or users like to see happen because of this project?</p></li>
<li><p>If this project were a success, how would we know?</p></li>
<li><p>What impact are we trying to achieve?</p></li>
</ul>
</section>
<section id="outcomes-what-the-project-will-deliver" class="level3">
<h3 class="anchored" data-anchor-id="outcomes-what-the-project-will-deliver">Outcomes: What the Project Will Deliver</h3>
<p>Project Outcomes describe what the team can, must, and commits to make happen to achieve Purpose. They can be functioning systems or processes (i.e., recruiting process operating) as well as completed end products (i.e., prototype built) and delivered services (i.e., people trained). They describe the specifi c end-results (or deliverables) expected from implementing a series of activities or tasks.</p>
<p>Use these questions to help solidify required Outcomes:</p>
<ul>
<li><p>What are our main project deliverables?</p></li>
<li><p>What do we need to make happen in order to achieve the project Purpose?</p></li>
<li><p>What are the end results for which the project team can be held accountable?</p></li>
<li><p>What processes do we need to put in place to achieve Purpose?</p></li>
</ul>
<table class="caption-top table">
<thead>
<tr class="header">
<th><strong>Inputs (Activities)</strong></th>
<th><strong>Outcomes</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Train users</td>
<td>Users trained</td>
</tr>
<tr class="even">
<td>Improve skills</td>
<td>Skills improvevd</td>
</tr>
<tr class="odd">
<td>Determine best methods</td>
<td>Best methods determined</td>
</tr>
<tr class="even">
<td>Build new office</td>
<td>New office built</td>
</tr>
</tbody>
</table>
</section>
</section>
</div>
</div>
<div class="callout callout-style-simple callout-tip callout-titled" title="Four Tips for Meaningful Measures">
<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-6-contents" aria-controls="callout-6" aria-expanded="false" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Four Tips for Meaningful Measures
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-6" class="callout-6-contents callout-collapse collapse">
<section id="four-tips-for-meaningful-measures" class="level2 callout-body-container callout-body">
<h2 class="anchored" data-anchor-id="four-tips-for-meaningful-measures">Four Tips for Meaningful Measures</h2>
<p>Don’t fall into the trap of measuring only that which is easy to measure. Measuring Inputs and Outcomes is most straightforward, but progress towards Purpose and Goal is what really counts. The best Measures meet these criteria:</p>
<ol type="1">
<li><p><strong><em>Valid</em></strong> — They accurately measure the Objective. Changes in the status of Measures accurately reflect changes in the status of the Objective.</p></li>
<li><p><strong><em>Verifiable</em></strong> — Clear, non-subjective evidence exists or can be obtained. This third LogFrame column<br>
identifies processes and mechanisms for determining the status of Measures in column two.</p></li>
<li><p><strong><em>Targeted</em></strong> — Quality, quantity, and time targets are pinned down. Choose targets that are sufficient to achieve impact at the next higher level. Sometimes, rather than locking in a single number, it’s appropriate to state a rough range.</p></li>
<li><p><strong><em>Independent</em></strong> — Each level in the hierarchy has separate Measures.</p>
<ol type="1">
<li><p><em>Goal Measures</em> tend to be broad macro-Measures that include the long-term impact of one project or multiple projects aimed at the same Goal.</p></li>
<li><p><em>Purpose Measures</em> describe those conditions we expect will exist when we are willing to call the project a success.</p></li>
<li><p><em>Outcome Measures</em> describe specific tangible results that the project team can make happen and commits to doing so. Describe them as completed results (using the past tense verb form, such as “System developed”or “Training completed”).</p></li>
<li><p><em>Input Measures</em> deal with activity, budget, and schedule.</p></li>
</ol></li>
</ol>
<p>Purpose Measures are the most important in the hierarchy. Why? Because that’s your primary aiming point, the what-should-occur result you expect after you deliver what you can.</p>
</section>
</div>
</div>
<div class="callout callout-style-simple callout-tip callout-titled" title="Three Steps for Managing Assumptions">
<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-7-contents" aria-controls="callout-7" aria-expanded="false" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Three Steps for Managing Assumptions
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-7" class="callout-7-contents callout-collapse collapse">
<section id="three-steps-for-managing-assumptions" class="level2 callout-body-container callout-body">
<h2 class="anchored" data-anchor-id="three-steps-for-managing-assumptions">Three Steps for Managing Assumptions</h2>
<section id="step-1.-identify-key-assumptions" class="level3">
<h3 class="anchored" data-anchor-id="step-1.-identify-key-assumptions">Step 1. Identify Key Assumptions</h3>
<p>Brainstorm all the conditions you believe are necessary to go from one LogFrame level to the next.</p>
<p><a href="logframe-assumptions.png" class="lightbox" data-gallery="quarto-lightbox-gallery-3"><img src="https://olivierbinette.ca/pages/posts/2024-09-04-strategic-project-management-made-simple/logframe-assumptions.png" class="img-fluid" style="width:66.0%"></a></p>
</section>
<section id="step-2.-analyze-and-test-them" class="level3">
<h3 class="anchored" data-anchor-id="step-2.-analyze-and-test-them">Step 2. Analyze and Test Them</h3>
<p>Try to assess the degree of risk you can expect from these critical Assumptions by using a simple rating system or probability percentages. Decide which Assumptions to highlight in the LogFrame matrix.</p>
<ul>
<li><p>How important is this Assumption to project success or failure?</p></li>
<li><p>How valid or probable is this Assumption? What are the odds that it is valid (or not)? Can we express it as a percentage? How do we know?</p></li>
<li><p>If the Assumptions fail, what is the effect on the project? Does a failed Assumption diminish accomplishment? Delay it? Destroy it?</p></li>
<li><p>What could cause this Assumption not to be valid? ”(Note: This one raises specific risk factors.)</p></li>
</ul>
</section>
<section id="step-3.-act-on-them" class="level3">
<h3 class="anchored" data-anchor-id="step-3.-act-on-them">Step 3. Act on Them</h3>
<p>Put each key Assumption under your mental microscope and consider the following:</p>
<ul>
<li><p>Is this a reasonable risk to take?</p></li>
<li><p>To what extent is it amenable to control? Can we manage it? Influence and nudge it? Or only monitor it</p></li>
<li><p>What are some ways we can influence the Assumption?</p></li>
<li><p>What contingency plans might we put in place just in case the Assumption proves wrong?</p></li>
<li><p>How can we design the project to minimize the impact of, or work around, the Assumption?</p></li>
<li><p>Is this Assumption under someone else’s control?</p></li>
<li><p>How could we design the project to make this Assumption moot or irrelevant?</p></li>
</ul>
</section>
</section>
</div>
</div>
</section>
<section id="aligning-projects-with-strategic-intent" class="level2">
<h2 class="anchored" data-anchor-id="aligning-projects-with-strategic-intent">Aligning Projects With Strategic Intent</h2>
<p>The LogFrame can be the cornerstone of any unit-level management system. However, this presumes that there is a sound, overarching strategy to begin with.</p>
<p>Strategy is the particular means chosen to get from where you are to where you want to go, selected from multiple possibilities and reflecting your vision, mission, and values. An overall Strategy (big “S”) usually consists of multiple strategic initiatives (small “s”), which are executed through programs, projects, and tasks.</p>
<p>Strategic planning steps:</p>
<ol type="1">
<li><p><strong><em>Clarify the Planning Context and Issues</em></strong> - Be clear about your expected planning Outcomes and identify issues to include.</p></li>
<li><p><strong><em>Involve Key Players</em></strong> - Decide who to involve in your process to build buy-in and stay-ini.</p></li>
<li><p><strong><em>Scan Your Environment</em></strong> - Identify what’s changing in your environment; and analyze divvision and department plans to extract Goals your group shares or owns.</p></li>
<li><p><strong><em>Revisit Your Vision/Mission/Values</em></strong> - Turn these “fluff“ statements into high-performance tools that energize staff and build shared commitment.</p></li>
<li><p><strong><em>Sharpen Your Goals and Measures</em></strong> - Develop a meaningful performance scorecard that identifies how you deliver customer value.</p></li>
<li><p><strong><em>Develop Core Strategies</em></strong> - Turn Goals into strategies, and test those strategies for impact against Measures to ensure smart choices.</p></li>
<li><p><strong><em>Turn Strategies into Executable Plans</em></strong> - Using the <strong>Logical Framework</strong>. Let the responsible players flesh out implementation plans.</p></li>
<li><p><strong><em>Follow Up and Continue the Process</em></strong> - Build momentum by revieweing and updating the plans while strenghtening the planning process itself.</p></li>
</ol>
</section>
<section id="the-strategic-action-cycle" class="level2">
<h2 class="anchored" data-anchor-id="the-strategic-action-cycle">The Strategic Action Cycle</h2>
<p><a href="strategic-cycle.png" class="lightbox" data-gallery="quarto-lightbox-gallery-4"><img src="https://olivierbinette.ca/pages/posts/2024-09-04-strategic-project-management-made-simple/strategic-cycle.png" class="img-fluid" style="width:66.0%"></a></p>
<ol type="1">
<li><p>The cycle begins with “<strong><em>Think</em></strong>,” the big picture strategic/program focus which follows the process from Chapter 4, or an equivalent strategic planning process.</p></li>
<li><p>Results of strategic thinking identify projects to be managed with the Plan-Act-Assess cycle.</p></li>
<li><p><strong><em>Project plans</em></strong> created with LogFrames provide a solid foundation for action (execution/implementation) and Assessment.</p></li>
<li><p>The <strong><em>Assess</em></strong> block can complete the loop in three ways. If assessment shows that success has been achieved - as defined by project Purpose - the project can be considered complete.</p>
<ol type="1">
<li><p><strong>Project Monitoring</strong> is an ongoing process of tracking budget and schedule against deliverables and making tactical adjustments. It presumes the Logical Framework is the best design and focuses team attention on translating Inputs into Outcomes.</p></li>
<li><p><strong>Project Review</strong> is an occasional process that asks managers to step back from the day-to-day work and reassess their approach. It challenges the project design and invites changes in the LogFrame, with emphasis on the Outcome to Purpose link.</p></li>
<li><p><strong>Project Evaluation</strong> examines impact and cost effectiveness. Project evaluations are often timed as the end of one phase nears and another is about to begin, or after the project is over. Evaluation examines Purpose to Goal linkages.</p></li>
</ol></li>
</ol>
</section>
<section id="other" class="level2">
<h2 class="anchored" data-anchor-id="other">Other</h2>
<div class="callout callout-style-simple callout-tip callout-titled" title="Tips">
<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-8-contents" aria-controls="callout-8" aria-expanded="true" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Tips
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-8" class="callout-8-contents callout-collapse collapse show">
<div class="callout-body-container callout-body">
<ul>
<li><p>The process of planning is more crucial than the planning documents that emerge at the other end. The collaborative use of the LogFrame helps you simultaneously build and shape a strong team while they work together to create an actionable plan.</p></li>
<li><p>Make sure that <strong>everyone speaks the same language</strong> by agreeing on what your key terms mean and using them in a consistent way.</p></li>
<li><p>The LogFrame matrix usually shows four levels, but <strong>Objectives above the Goal can be included</strong> to illustrate a higher level of impact. The higher up the hierarchy we climb, the more long-term, general, and “vision-sounding” these Objectives become.</p></li>
<li><p><strong>Don’t ask “Hows it going on this task?“</strong> Instead, ask:</p>
<ul>
<li><p>Are you having difficulties that would keep you from meeting targets?</p></li>
<li><p>Are you getting the support you need from others?</p></li>
<li><p>Is there anything else I should know about this?</p></li>
<li><p>What do you need from me?</p></li>
</ul></li>
<li><p><strong>Project monitoring</strong> asks “Are we <em>on</em> track?“; <strong>project reviews</strong> ask “Are we on the <em>right</em> track?“ Use the LogFrame to challenge your strategy by posing questions such as:</p>
<ul>
<li><p>Is our Purpose still valid? What’s our progress toward Purpose?</p></li>
<li><p>Is our Purpose likely to be achieved with this plan? Will this Purpose get us to the Goal?</p></li>
<li><p>What is the status of Assumptions?</p></li>
<li><p>Are these the right Outcomes? Are we producing them effectively?</p></li>
<li><p>Should new Outcomes or Assumptions be added? Existing ones dropped?</p></li>
<li><p>How should we rervise our key strategic hypotheses (Outcome to Purpose to Goal) to produce better results?</p></li>
</ul></li>
<li><p>Because the LogFrame’s systems thinking underpinnings are generic and flexible, so is the grid format itself. Be innovative and <strong>customize the LogFrame to your needs</strong> and add your own categories.</p></li>
<li><p>At times you’ll need to <strong>zoom in on a project component</strong> for more visibility. Some tasks are large enough to justify their own LogFrame.</p></li>
<li><p>Make responsibilities clear to all</p></li>
<li><p>Clarify Resource Requirements</p></li>
<li><p>Analyze stakeholder interests</p></li>
<li><p>Manage with emotional intelligence</p></li>
</ul>
</div>
</div>
</div>


</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-reuse"><h2 class="anchored quarto-appendix-heading">Reuse</h2><div class="quarto-appendix-contents"><div><a rel="license" href="https://creativecommons.org/licenses/by/4.0/">CC BY 4.0</a></div></div></section><section class="quarto-appendix-contents" id="quarto-copyright"><h2 class="anchored quarto-appendix-heading">Copyright</h2><div class="quarto-appendix-contents"><div>Olivier Binette</div></div></section></div> ]]></description>
  <category>general</category>
  <category>management</category>
  <guid>https://olivierbinette.ca/pages/posts/2024-09-04-strategic-project-management-made-simple/2024-09-04-strategic-project-management-made-simple.html</guid>
  <pubDate>Wed, 04 Sep 2024 04:00:00 GMT</pubDate>
</item>
<item>
  <title>The Pareto Principle and Project Failures</title>
  <link>https://olivierbinette.ca/pages/posts/2024-09-01-pareto-principle-and-project-failures/2024-09-01-pareto-principle-and-project-failures.html</link>
  <description><![CDATA[ 





<p>The Pareto principle, or the 80/20 rule, states that 80% of consequences come from 20% of the causes.</p>
<p>Surprisingly enough, this principle has general statistical underpinnings and does actually occur in a broad range of situations. The numbers 80/20 could be something else, but there is often an imbalance of this sort. It’s related to selection bias and size bias. Let me explain in the context of software development.</p>
<p>Say you’re building a piece of software for some use case. There’s a lot that goes into building and deploying the software: the UI, the logic, the backend, the deployment infrastructure, the iterative changes, etc. Each part contributes more or less to the functionality a user can see.</p>
<p><a href="plot.png" class="lightbox" data-gallery="quarto-lightbox-gallery-1"><img src="https://olivierbinette.ca/pages/posts/2024-09-01-pareto-principle-and-project-failures/plot.png" class="img-fluid"></a></p>
<p>In this plot, UI+logic+backend is <strong>80% of the functionality</strong>* the user can see, but only <strong>40% of the required effort</strong> to complete the project.</p>
<p>If functionality and effort are uncorrelated or negatively correlated, then building the most functionality first will lead to decreasing return of efforts on functionality over the project’s life. The smallest set of components that contribute to 80% functionality is a <strong>biased selection</strong> that isn’t representative of the overall effort distribution.</p>
<p>This doesn’t mean that the 80% seen functionality is more important than the other 20%. In fact, your software is going to be useless if you can’t build the infrastructure it needs for deployment. All components are equally important in this example. This mismatch between true value and apparent functionality can be dangerously misleading.</p>
<section id="why-software-projects-fail" class="level2">
<h2 class="anchored" data-anchor-id="why-software-projects-fail">Why Software Projects Fail</h2>
<p>The Pareto principle plays into the common failure (or cost overrun, scope creep, technical debt) of software projects.</p>
<p>Often, development teams prioritize building a minimal viable product (MVP), or delivering the most apparent functionality for a given effort level. The fast achievement of 80% functionality can lead to poor expectations of what’s needed to reach a product that has actual value, i.e.&nbsp;something maintainable and deployable. Clients, project managers, and developers can misunderstand the scope of project if they rank tasks in functionality-first order, without considering the full value chain.</p>
<section id="a-better-approach---managing-risks-and-the-full-value-chain" class="level3">
<h3 class="anchored" data-anchor-id="a-better-approach---managing-risks-and-the-full-value-chain">A Better Approach - Managing Risks And the Full Value Chain</h3>
<p>As part of good project management, you want to:</p>
<ol type="1">
<li><strong>Map risks and uncertainties</strong>, and address the most important ones first.</li>
<li><strong>Deliver self-contained value</strong> to the client throughout the project, if possible.</li>
</ol>
<p>E.g. for (1), if you don’t know what a client wants, that’s a big risk. Getting an MVP in front of them might help reduce uncertainties and mitigate that risk. A cost overrun is also a big risk. If you don’t know how long it will take to build the infrastructure to deploy your system, then you might want to address that first.</p>
<p>For (2), note that value is not always the same as functionality. Undeployed functionality has no value to a client. An MVP, unless it is truly viable on its own, typically has little value to a client. A product that doesn’t meet quality requirements does not have any value. If clients hire you for software development, value is something they can use without any further software development.</p>
</section>
</section>
<section id="in-short" class="level2">
<h2 class="anchored" data-anchor-id="in-short">In Short</h2>
<p>The Pareto principle is both about the <strong>big impact you can have from a few actions</strong> (e.g., achieve 80% in 20% of the time), and <strong>how easily misled you can be about scope and impact</strong> (e.g., forgetting about a necessary 20% that takes 80% of the time).</p>
<hr>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="sheraz.png" class="lightbox" data-gallery="quarto-lightbox-gallery-2" title="Infographic from Sheraz Ishak"><img src="https://olivierbinette.ca/pages/posts/2024-09-01-pareto-principle-and-project-failures/sheraz.png" class="img-fluid figure-img" alt="Infographic from Sheraz Ishak"></a></p>
<figcaption>Infographic from Sheraz Ishak</figcaption>
</figure>
</div>


</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-reuse"><h2 class="anchored quarto-appendix-heading">Reuse</h2><div class="quarto-appendix-contents"><div><a rel="license" href="https://creativecommons.org/licenses/by/4.0/">CC BY 4.0</a></div></div></section><section class="quarto-appendix-contents" id="quarto-copyright"><h2 class="anchored quarto-appendix-heading">Copyright</h2><div class="quarto-appendix-contents"><div>Olivier Binette</div></div></section></div> ]]></description>
  <category>general</category>
  <category>management</category>
  <guid>https://olivierbinette.ca/pages/posts/2024-09-01-pareto-principle-and-project-failures/2024-09-01-pareto-principle-and-project-failures.html</guid>
  <pubDate>Sun, 01 Sep 2024 04:00:00 GMT</pubDate>
</item>
<item>
  <title>The NABCs of Innovation</title>
  <link>https://olivierbinette.ca/pages/posts/2024-08-29-NABCs-of-innovation/2024-08-29-NABCs-of-innovation.html</link>
  <description><![CDATA[ 





<p><strong>Innovation</strong> is creating and delivering new value to customers.</p>
<p>It happens at different levels. <strong>R&amp;D projects</strong> are often expected to fail, but have potential for breakthroughs. Bringing existing technology to <strong>new markets</strong> is also a form of innovation, possibly with a higher success rate. <strong>Incremental optimizations</strong> and <strong>process improvements</strong> also involve innovation and are essential to an efficient business.</p>
<p>Innovation begins with someone having an idea they think could be valuable. Developing that idea and bringing it to customers requires time an energy.</p>
<p>A <strong>value proposition</strong> is what explains why this time and energy should be expended.</p>
<p><strong>Curtis R. Carlson</strong>, ex-President of SRI International, <a href="https://hbr.org/2020/11/innovation-for-impact">developed a framework for value propositions</a>. It has four main components (the “<strong>NABCs</strong>”) that aim to answer essential business questions:</p>
<ul>
<li><strong>Need</strong>: Who’s the customer? What’s their need or job to be done? What’s the gap in the market?</li>
<li><strong>Approach</strong>: How are we solving that need? Is it unique, compelling, and defensible?</li>
<li><strong>Benefit</strong>: What superior value is the customer getting through our approach?</li>
<li><strong>Competition</strong>: What’s the competition? Why is our approach more appealing?</li>
</ul>
<p>Additionally, there should be a driving force behind the proposition, i.e.&nbsp;motivated people willing and able to push this forward. The value proposition should also be aligned with the organization, both to support its development and enable capturing resulting value.</p>
<p>Building a good value proposition is an <strong>iterative process</strong>. The customer need is what matters and the approach might change - don’t fall in love with an idea. Focus on customer needs and the reasons underlying what they say they want. Try to quantify the value proposition, even if some of it may be guesswork. Address the most major risks and uncertainties first, before trying to build everything. Maintain and adjust the value proposition throughout the project.</p>
<section id="exceptional-innovations" class="level2">
<h2 class="anchored" data-anchor-id="exceptional-innovations">Exceptional Innovations</h2>
<p>The best innovations don’t just provide new value.</p>
<p>They fit within or enable <strong>compounding processes</strong>, where past innovations keep on providing more and more value as they are built upon. Relatedly, they create more than one opportunity to capture value, i.e.&nbsp;they help expose the business to new opportunities, such as by entering new markets.</p>
<p>They align with the business’ <strong>strategic vision</strong> (its plan for growth) and reinforces its <strong>strategic positioning</strong> (how it distinguishes itself from competitors and provides compelling value, despite constraints.)</p>


</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-reuse"><h2 class="anchored quarto-appendix-heading">Reuse</h2><div class="quarto-appendix-contents"><div><a rel="license" href="https://creativecommons.org/licenses/by/4.0/">CC BY 4.0</a></div></div></section><section class="quarto-appendix-contents" id="quarto-copyright"><h2 class="anchored quarto-appendix-heading">Copyright</h2><div class="quarto-appendix-contents"><div>Olivier Binette</div></div></section></div> ]]></description>
  <category>general</category>
  <category>management</category>
  <guid>https://olivierbinette.ca/pages/posts/2024-08-29-NABCs-of-innovation/2024-08-29-NABCs-of-innovation.html</guid>
  <pubDate>Thu, 29 Aug 2024 04:00:00 GMT</pubDate>
</item>
<item>
  <title>Test-Driven Development is Free</title>
  <link>https://olivierbinette.ca/pages/posts/2024-08-24-test-driven-development-is-free/2024-08-24-test-driven-development-is-free.html</link>
  <description><![CDATA[ 





<p>Test-driven development (TDD) is the practice of writing tests before starting to write functional code.</p>
<p>It’s sounds a bit formal, but it’s very close to what we do when developing interactively in a Python notebook: starting with a working example before refactoring code in a general-purpose function, and iterating on the process of creating examples, testing, and developing. The practice started in the early days of programming, which is why some of the guides on the topic can seem complicated. But, in short:</p>
<p><strong>TDD was interactive development, before interactive development was a thing!</strong></p>
<p>Now there are advantages to formalizing TDD, without needing to move away from interactive development. I won’t list all of them here, but I will point out the ones that support my argument that TDD is free.</p>
<section id="why-tdd-is-free" class="level2">
<h2 class="anchored" data-anchor-id="why-tdd-is-free">Why TDD Is Free</h2>
<p>Here’s a key assumption I’m making: <strong>doing things right the first time is free.</strong> If you’re not doing it right the first time, you’ll have to come back to it later anyway. And not doing it right the first time is likely to create many unnecessary costs along the way.</p>
<p>So, how do you do something right the first time? There are 2 parts to this:</p>
<ol type="1">
<li><p>You need to <strong>know what’s the “right” thing</strong> you want to do.</p></li>
<li><p>You need to <strong>check that you actually did it right.</strong></p></li>
</ol>
<p>Point (2) is testing. You’ll have to test, whether it is at the beginning, throughout, or at the end.</p>
<p>Point (1) is having clear requirements. Sure, you can write down requirements specification in detail and work off of that. But you know what else is a clear requirement? A test case.</p>
<p>You can save time by combining points (1) and (2) together in test cases. Just keep in mind that you’ll have to <strong>write tests first</strong> in order to satisfy point (1).</p>
<p><strong>So, TDD is free</strong>: it’s not doing anything that you wouldn’t have to do anyway, and it’s saving you from extra work now and in the future.</p>
<p>Note that there is a learning curve to TDD. You need to find a TDD workflow that works for you. That takes a bit of time. But afterwards, you are saving time.</p>
</section>
<section id="this-isnt-a-new-idea" class="level2">
<h2 class="anchored" data-anchor-id="this-isnt-a-new-idea">This Isn’t a New Idea</h2>
<p><strong>You’re already doing TDD:</strong></p>
<ul>
<li><p>In agile development, we use “User Stories” to describe specifications. These are high-level test case descriptions: “given starting point X, I want to do Y to achieve Z.” User stories don’t tell you how to code things - that’s the functional implementation. It’s something you figure out afterwards, once you know what the input looks like, what the function is meant to do, and what the result should look like.</p></li>
<li><p>As mentioned earlier, interactive development is informal TDD. How can you formalize TDD in interactive development, without losing the benefits of interactive development? Simply bring the tests to your interactive development workflow. It can be done by staying organized, or you can use tools like the “ipytest” library for unit testing in Python notebooks.</p></li>
</ul>
</section>
<section id="next-steps" class="level2">
<h2 class="anchored" data-anchor-id="next-steps">Next Steps</h2>
<p>You’re already doing TDD, but maybe you’re not doing it in the most effective way. If you answer yes to some of the questions below, then it might be worth it to improve your TDD practices:</p>
<ul>
<li>Could you save time by catching bugs earlier?</li>
<li>Could you save time by writing examples/tests, instead of long-form documentation?</li>
<li>Could you save time by keeping track of the experiments, tests, and examples you use in a notebook as you develop?</li>
<li>Could you save time by clicking a single button to run all tests in your notebook, instead of backtracking to execute notebook cells one by one?</li>
<li>Do you often have to go back to fix bugs in your code or other people’s code?</li>
</ul>
<p>There are lots of guides online about TDD. But remember: you need to create a workflow that works for you. TDD is not about formality, complicated testing, or full-coverage testing. TDD is about speeding up your development and building things right the first time.</p>
</section>
<section id="tdd-myths" class="level2">
<h2 class="anchored" data-anchor-id="tdd-myths">TDD Myths</h2>
<p>Be careful not to fall into the following traps:</p>
<ul>
<li>“All tests need to be written upfront.” No.&nbsp;Your TDD tests only need to cover what you want to code up in the next 5-30 minutes. They’re meant to help you develop, not give you analysis paralysis.</li>
<li>“Tests can’t change.” No.&nbsp;TDD tests are there to help you develop. Change them as much as you like.</li>
<li>“I can’t add more test after I’m done implementing.” No.&nbsp;TDD is an iterative process. Create a test, make sure it runs (and generally fails), develop, create more tests, check what fails, develop, and keep going until you are satisfied.</li>
<li>“I don’t need QA if I do TDD.” No.&nbsp;TDD is all about development. It helps develop faster and better. It’s about you, as a developer, building what you want to build right the first time. But, as often happens, it’s not because something is built right that it is the right thing for your customer!</li>
</ul>
</section>
<section id="practical-example" class="level2">
<h2 class="anchored" data-anchor-id="practical-example">Practical Example</h2>
<p>Here’s what TDD looks like in practice.</p>
<p>Say I want to code a function “fibonacci” that computes the first n numbers of the standard Fibonacci sequence.</p>
<section id="step-1-a-first-simple-example-and-test" class="level3">
<h3 class="anchored" data-anchor-id="step-1-a-first-simple-example-and-test">Step 1: A first simple example and test</h3>
<p>First, I’ll write an example or what I want to do. This defines requirements for my function and lets me check it. The first tests should be simple and useful for development. If I don’t know in advance what the output should be, that’s OK: I can do a smoke test instead (just check that the function runs without error and show its output).</p>
<div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Input</span></span>
<span id="cb1-2">input_n <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span></span>
<span id="cb1-3"></span>
<span id="cb1-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Output</span></span>
<span id="cb1-5">expected_output <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span>]</span></code></pre></div>
<p>Then I keep track of this as a test case, so it’s easy to execute.</p>
<div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> test_fibonacci():</span>
<span id="cb2-2">  <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> fibonnaci(input_n) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> expected_output</span></code></pre></div>
<p>Notice that this first step is very simple and directly related to my current development task: develop a function that gets the logic right. I don’t want to worry about edge cases and every detail right now, so I don’t write tests/examples for that.</p>
</section>
<section id="step-2-implement-and-check" class="level3">
<h3 class="anchored" data-anchor-id="step-2-implement-and-check">Step 2: Implement and check</h3>
<p>Now I code the function and test it.</p>
<div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> fibonacci(n):</span>
<span id="cb3-2">  result <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>]</span>
<span id="cb3-3">  <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">while</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(result) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> n:</span>
<span id="cb3-4">    result.append(result[<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>], result[<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>])</span>
<span id="cb3-5">  </span>
<span id="cb3-6">  <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> result</span>
<span id="cb3-7"></span>
<span id="cb3-8">test_fibonacci()</span></code></pre></div>
<p>If it doesn’t pass, make changes until it does. When it passes, great! We have the right logic. Now we can think about edge cases and iterate.</p>
</section>
<section id="step-3-iterate" class="level3">
<h3 class="anchored" data-anchor-id="step-3-iterate">Step 3: Iterate</h3>
<p>First, create examples/test cases. Again, this specifies what we want to achieve, and makes it easy for us to check it.</p>
<div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> test_fibonacci_edge_cases():</span>
<span id="cb4-2">  <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> fibonacci(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb4-3">  <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> fibonacci(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb4-4">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># etc </span></span></code></pre></div>
<p>Then, make changes to your function and run the tests:</p>
<div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> fibonacci(n):</span>
<span id="cb5-2">  ...</span>
<span id="cb5-3"></span>
<span id="cb5-4">test_fibonacci() <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Make sure I didn't break anything</span></span>
<span id="cb5-5">test_fibonacci_edge_cases() <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># New tests</span></span></code></pre></div>
<p>A large number of tests can quickly become unwieldy. This is where testing frameworks like pytest become handy. They keep track of test suites and let you run all tests in a single click.</p>


</section>
</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-reuse"><h2 class="anchored quarto-appendix-heading">Reuse</h2><div class="quarto-appendix-contents"><div><a rel="license" href="https://creativecommons.org/licenses/by/4.0/">CC BY 4.0</a></div></div></section><section class="quarto-appendix-contents" id="quarto-copyright"><h2 class="anchored quarto-appendix-heading">Copyright</h2><div class="quarto-appendix-contents"><div>Olivier Binette</div></div></section></div> ]]></description>
  <category>technical</category>
  <category>python</category>
  <guid>https://olivierbinette.ca/pages/posts/2024-08-24-test-driven-development-is-free/2024-08-24-test-driven-development-is-free.html</guid>
  <pubDate>Sat, 24 Aug 2024 04:00:00 GMT</pubDate>
</item>
<item>
  <title>Personal Knowledge Management</title>
  <link>https://olivierbinette.ca/pages/posts/2024-08-15-personal-knowledge-management/2024-08-15-personal-knowledge-management.html</link>
  <description><![CDATA[ 





<p>Essentially all of my work involves reading and writing. I write papers and proposals, code, documentation, emails, and I jot down thoughts in problem-solving sessions. And all of that is in relation to the writings and ideas of an incredibly large number of people.</p>
<p>Keeping up with all this information requires knowledge management systems. They are often integrated into our online experiences - we have bookmarks, searchable email inboxes, online code repositories, etc.</p>
<p>But some effort is needed to use these systems effectively, without being overwhelmed by all of these disparate systems. That’s where personal knowledge management comes in.</p>
<p>It’s not a new idea. For millennia, beginning at least with Aristotle, writers have been using “commonplace“ books to organize their notes, quotes, and ideas. Stephen Johnson, in the book <a href="https://www.ted.com/talks/steven_johnson_where_good_ideas_come_from?subtitle=en"><em>Where Good Ideas Come From</em></a>, relates Darwin’s notebooks to this tradition:</p>
<blockquote class="blockquote">
<p>Darwin’s notebooks lie at the tail end of a long and fruitful tradition that peaked in Enlightenment-era Europe, particularly in England: the practice of maintaining a ‘commonplace’ book. Scholars, amateur scientists, aspiring men of letters - just about anyone with intellectual ambition in the seventeenth and eighteenth centuries was likely to keep a commonplace book. The great minds of the period - Milton, Bacon, Locke - were zealous believers in the memory-enhancing powers of the commonplace book.</p>
</blockquote>
<p>Something as simple as the “notes” app on your phone, or sending yourself emails, can work well enough for note-taking. But we can get much more out of our notes by using technology to help index notes, create connections between them, and help summarize and extract relevant information when needed.</p>
<p>Technology can also help us overcome the challenges of determining how to organize notes. Personally, I cannot keep any file tree well organized. There is an alternative: instead of a hierarchical tree, we can organize notes in a graph using tags and links. This is how Wikipedia is structured. You don’t find a wiki page by going down a file tree. Rather, you do keyword searches and follow links between pages.</p>
<p>My favorite tool for this is <a href="https://obsidian.md">Obsidian</a> (at work I use Confluence). Previously I used <a href="https://www.notion.com">Notion</a>, and before that I only used paper. Obsidian is free, easy-to-use, private (it’s a desktop app!), and responsive. I use it to keep track of everything that isn’t my paper notepad, emails, or LaTeX/Word documents.</p>
<p><a href="obsidian.png" class="lightbox" data-gallery="quarto-lightbox-gallery-1"><img src="https://olivierbinette.ca/pages/posts/2024-08-15-personal-knowledge-management/obsidian.png" class="preview-image img-fluid"></a></p>
<p>There are lots of other tools available:</p>
<div class="columns">
<div class="column" style="width:50%;">
<ul>
<li><a href="http://hypothes.is/">hypothes.is</a> for web annotation</li>
<li><a href="https://roamresearch.com/">Roam</a></li>
<li><a href="https://www.notion.so/">Notion</a></li>
</ul>
</div><div class="column" style="width:50%;">
<ul>
<li><a href="https://logseq.com/">Logseq</a></li>
<li><a href="https://www.dendron.so/">Dendron</a></li>
<li><a href="https://databyss.org/">Databyss</a></li>
</ul>
</div>
</div>
<p>In short, it’s easy to take modern digital features like hypertext or search for granted. But it’s really amazing how far we’ve come to get here, and I think we can do even more amazing things if we can use these features to their full extent or push them even further.</p>



<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-reuse"><h2 class="anchored quarto-appendix-heading">Reuse</h2><div class="quarto-appendix-contents"><div><a rel="license" href="https://creativecommons.org/licenses/by/4.0/">CC BY 4.0</a></div></div></section><section class="quarto-appendix-contents" id="quarto-copyright"><h2 class="anchored quarto-appendix-heading">Copyright</h2><div class="quarto-appendix-contents"><div>Olivier Binette</div></div></section></div> ]]></description>
  <category>general</category>
  <category>knowledge-management</category>
  <guid>https://olivierbinette.ca/pages/posts/2024-08-15-personal-knowledge-management/2024-08-15-personal-knowledge-management.html</guid>
  <pubDate>Thu, 15 Aug 2024 04:00:00 GMT</pubDate>
</item>
<item>
  <title>Measurement and Management</title>
  <link>https://olivierbinette.ca/pages/posts/2024-08-15-measurement-and-management/2024-08-15-measurement-and-management.html</link>
  <description><![CDATA[ 





<div class="columns">
<div class="column" style="width:80%;">
<p>W. Edwards Deming pioneered the use of measurement and statistics in manufacturing industries, using data to improve processes. Some even credit part of the success of the post-WWII Japanese auto industry (e.g.&nbsp;Toyota) to Deming’s japanese career, where he taught and popularized the use of Statistical Process Control (SPC) [1].</p>
<p>Unfortunately, Deming’s work and ideas are widely misunderstood. And Deming was aware of this. Much of his later writings emphasize how a <strong>naive understanding of quality management is counterproductive</strong>. <sup>1</sup></p>
</div><div class="column" style="width:20%;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="deming.jpeg" class="lightbox" data-gallery="quarto-lightbox-gallery-1" title="W. Edwards Deming"><img src="https://olivierbinette.ca/pages/posts/2024-08-15-measurement-and-management/deming.jpeg" class="preview-image img-fluid figure-img" alt="W. Edwards Deming"></a></p>
<figcaption>W. Edwards Deming</figcaption>
</figure>
</div>
</div>
</div>
<section id="dont-manage-by-numbers." class="level2">
<h2 class="anchored" data-anchor-id="dont-manage-by-numbers.">Don’t manage by numbers.</h2>
<p>It’s a bit confusing: Deming encouraged the use of measurement, metrics, data, and statistics, as a key tool for process improvement and quality control. And yet he also painstakingly tried to drive in points like this:</p>
<ul>
<li><strong>“It is wrong to suppose that if you can’t measure it, you can’t manage it – a costly myth.”</strong></li>
<li><strong>“Eliminate management by numbers and numerical goals.”</strong></li>
</ul>
<p>How can this be? How can he simultaneously be pro-measurement, pro-data, and against data-driven management?</p>
</section>
<section id="how-can-we-resolve-this-false-paradox" class="level2">
<h2 class="anchored" data-anchor-id="how-can-we-resolve-this-false-paradox">How can we resolve this false paradox?</h2>
<p>As a statistician, Deming was aware how important what you can’t measure is to making valid inferences. Statistics is not about data. It’s about combining data and context to make valid inferences. Data on its own has no meaning. Missing data - including both the data you wish you had and the data you don’t even know you’re missing - is more important than the data you have. A statistician’s work is to help learn about such unknowns. It’s a fallacy to make decisions based only on available data - the <a href="https://en.wikipedia.org/wiki/McNamara_fallacy">McNamara fallacy</a>.</p>
<blockquote class="blockquote">
<p>“But when the McNamara discipline is applied too literally, the first step is to measure whatever can be easily measured. The second step is to disregard that which can’t easily be measured or given a quantitative value. The third step is to presume that what can’t be measured easily really isn’t important. The fourth step is to say that what can’t be easily measured really doesn’t exist. This is suicide.” — <a href="https://en.wikipedia.org/wiki/Daniel_Yankelovich">Daniel Yankelovich</a>, “<a href="https://archive.org/details/sim_sales-management_1971-11-15_107_11/page/26/mode/2up?view=theater">Interpreting the New Life Styles</a>”, Sales Management (1971)</p>
</blockquote>
<p>The problem isn’t data or measurement. In fact, you should aim to measure as much as you can, as often as you can. You should build measurement and observability as core components of your systems and infrastructures. You should work to continually improve your approach to measurement of what matters. And you should have statisticians or data scientists make sense of these numbers through their context, given specific goals.</p>
<p>But here’s the thing: <strong>measurement is not management.</strong></p>
<p>As a manager, your job is to create and maintain structures that drive customer value and continuous improvement. To achieve this, you need to think about knowns (i.e., data, metrics) and unknowns. Statisticians or data scientists can help you contextualize data and shed light on unknowns, athough it’s not always an easy process.</p>
</section>
<section id="in-short" class="level2">
<h2 class="anchored" data-anchor-id="in-short">In Short</h2>
<p>There are many misconceptions surrounding data and its use in management. It is important for all to understand both the importance of data and its limitations. We can do so by learning from resources such as the Deming Institute’s website:</p>
<script src="https://cdn.jsdelivr.net/npm/@mariusbongarts/previewbox/dist/index.min.js"></script>
<p><previewbox-link url="https://deming.org/explore/fourteen-points" title="Deming's Fourteen points" description="Dr. W. Edwards Deming offered 14 key principles for management to follow to improve the effectiveness of a business or organization significantly. The principles (points) were first presented in his book Out of the Crisis. Below is the condensation of the 14 Points for Management, but these alone will not improve your business." imageurl="https://upload.wikimedia.org/wikipedia/commons/7/73/W._Edwards_Deming.jpg"></previewbox-link></p>
<p>Deming advocated for structures that removed fear in workers, fostered continuous improvement, and enabled taking pride in one’s work.</p>


</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>Deming started working in Japan in 1947, bringing knowledge of the theory of Statistical Process Control (SPC) that was pioneered by Walter A. Shewhart at Bell Laboratories a few decades earlier. During post-war reconstruction, the Union of Japanese Scientists and Engineers (JUSE) invited Deming to teach SPC to engineers and managers. He went on to work with private enterprises and received multiple awards for his contributions.↩︎</p></li>
</ol>
</section><section class="quarto-appendix-contents" id="quarto-reuse"><h2 class="anchored quarto-appendix-heading">Reuse</h2><div class="quarto-appendix-contents"><div><a rel="license" href="https://creativecommons.org/licenses/by/4.0/">CC BY 4.0</a></div></div></section><section class="quarto-appendix-contents" id="quarto-copyright"><h2 class="anchored quarto-appendix-heading">Copyright</h2><div class="quarto-appendix-contents"><div>Olivier Binette</div></div></section></div> ]]></description>
  <category>general</category>
  <category>management</category>
  <guid>https://olivierbinette.ca/pages/posts/2024-08-15-measurement-and-management/2024-08-15-measurement-and-management.html</guid>
  <pubDate>Thu, 15 Aug 2024 04:00:00 GMT</pubDate>
</item>
<item>
  <title>Comment on The Sample Size Required in Importance Sampling</title>
  <link>https://olivierbinette.ca/pages/posts/2017-03-18-comment-on-sample-size-for-importance-sampling/2017-03-18-comment-on-sample-size-for-importance-sampling.html</link>
  <description><![CDATA[ 





<p>The problem is to evaluate</p>
<p style="text-align:center;">
<img src="https://latex.codecogs.com/png.latex?I%20=%20I(f)%20=%20%5Cint%20f%20d%5Cmu,">
</p>
<p>where $$ is a probability measure on a space <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BM%7D"> and where <img src="https://latex.codecogs.com/png.latex?f:%20%5Cmathbb%7BM%7D%20%5Crightarrow%20%5Cmathbb%7BR%7D"> is measurable. The Monte-Carlo estimate of <img src="https://latex.codecogs.com/png.latex?I"> is</p>
<p style="text-align:center;">
<img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B1%7D%7Bn%7D%5Csum_%7Bi=1%7D%5En%20f(x_i),%20%5Cqquad%20x_i%20%5Csim%20%5Cmu.">
</p>
<p>When it is too difficult to sample <img src="https://latex.codecogs.com/png.latex?%5Cmu">, for instance, other estimates can be obtained. Suppose that <img src="https://latex.codecogs.com/png.latex?%5Cmu"> is absolutely continuous with respect to another probability measure <img src="https://latex.codecogs.com/png.latex?%5Clambda">, and that the density of <img src="https://latex.codecogs.com/png.latex?%5Cmu"> with respect to <img src="https://latex.codecogs.com/png.latex?%5Clambda"> is given by <img src="https://latex.codecogs.com/png.latex?%5Crho">. Another unbiaised estimate of <img src="https://latex.codecogs.com/png.latex?I"> is then</p>
<p style="text-align:center;">
<img src="https://latex.codecogs.com/png.latex?I_n(f)%20=%20%5Cfrac%7B1%7D%7Bn%7D%5Csum_%7Bi=1%7D%5En%20f(y_i)%5Crho(y_i),%5Cqquad%20y_i%20%5Csim%20%5Clambda.">
</p>
<p>This is the general framework of importance sampling, with the Monte-Carlo estimate recovered by taking <img src="https://latex.codecogs.com/png.latex?%5Clambda%20=%20%5Cmu">. An important question is the following.</p>
<p style="text-align:center;">
<em>How large should <img src="https://latex.codecogs.com/png.latex?n"> be for <img src="https://latex.codecogs.com/png.latex?I_n(f)"> to be close to <img src="https://latex.codecogs.com/png.latex?I(f)">?</em>
</p>
<p>An answer is given, under certain conditions, by Chatterjee and Diaconis (2015). Their main result can be interpreted as follows. If <img src="https://latex.codecogs.com/png.latex?X%20%5Csim%20%5Cmu"> and if <img src="https://latex.codecogs.com/png.latex?%5Clog%20%5Crho(X)"> is concentrated around its expected value <img src="https://latex.codecogs.com/png.latex?L=%5Ctext%7BE%7D%5B%5Clog%20%5Crho(X)%5D">, then a sample size of approximately <img src="https://latex.codecogs.com/png.latex?n%20=%20e%5E%7BL%7D"> is both necessary and sufficient for <img src="https://latex.codecogs.com/png.latex?I_n"> to be close to <img src="https://latex.codecogs.com/png.latex?I">. The exact sample size needed depends on <img src="https://latex.codecogs.com/png.latex?%5C%7Cf%5C%7C_%7BL%5E2(%5Cmu)%7D"> and on the tail behavior of <img src="https://latex.codecogs.com/png.latex?%5Clog%5Crho(X)">. I state below their theorem with a small modification.</p>
<p><strong>Theorem 1.</strong> (Chatterjee and Diaconis, 2015) <em>As above, let <img src="https://latex.codecogs.com/png.latex?X%20%5Csim%20%5Cmu">. For any <img src="https://latex.codecogs.com/png.latex?a%20%5Cgt;%200"> and <img src="https://latex.codecogs.com/png.latex?n%20%5Cin%20%5Cmathbb%7BN%7D">,</em></p>
<p style="text-align:center;">
<em><img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BE%7D%20%7CI_n(f)%20-%20I(f)%7C%20%5Cle%20%5C%7Cf%5C%7C_%7BL%5E2(%5Cmu)%7D%5Cleft(%20%5Csqrt%7Ba/n%7D%20+%202%5Csqrt%7B%5Cmathbb%7BP%7D%20(%5Crho(X)%20%5Cgt;%20a)%7D%20%5Cright)."></em>
</p>
<p><em>Conversely, for any <img src="https://latex.codecogs.com/png.latex?%5Cdelta%20%5Cin%20(0,1)"> and <img src="https://latex.codecogs.com/png.latex?b%20%5Cgt;%200">,</em></p>
<p style="text-align:center;">
<em><img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D(1%20-%20I_n(1)%20%5Cle%20%5Cdelta)%20%5Cle%20%5Cfrac%7Bn%7D%7Bb%7D%20+%20%5Cfrac%7B%5Cmathbb%7BP%7D(%5Crho(X)%20%5Cle%20b)%7D%7B1-%5Cdelta%7D."></em>
</p>
<p><strong>Remark 1.</strong> Suppose <img src="https://latex.codecogs.com/png.latex?%5C%7Cf%5C%7C_%7BL%5E2(%5Cmu)%7D%20%5Cle%201"> and that <img src="https://latex.codecogs.com/png.latex?%5Clog%5Crho(X)"> is concentrated around <img src="https://latex.codecogs.com/png.latex?L%20=%20%5Cmathbb%7BE%7D%20%5Clog%5Crho(X)">, meaning that for some <img src="https://latex.codecogs.com/png.latex?t%20%5Cgt;%200"> we have that <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D(%5Clog%20%5Crho(X)%20%5Cgt;%20L+t/2)"> and <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D(%5Clog%5Crho(X)%20%5Clt;%20L-t/2)"> are both less than an arbitrary <img src="https://latex.codecogs.com/png.latex?%5Cvarepsilon%20%5Cgt;%200">. Then, taking <img src="https://latex.codecogs.com/png.latex?n%20%5Cgeq%20e%5E%7BL+t%7D"> we find</p>
<p style="text-align:center;">
$ |I_n(f) - I| e^{-t/4} + 2.$
</p>
<p>However, if $n e^{L-t} $, we obtain</p>
<p style="text-align:center;">
$ (1 - I_n(1) ) e^{-t/2} + 2 .$
</p>
<p>meaning that there can be a high probability that <img src="https://latex.codecogs.com/png.latex?I(1)"> and <img src="https://latex.codecogs.com/png.latex?I_n(1)"> are not close.</p>
<p><strong>Remark 2.</strong> Let <img src="https://latex.codecogs.com/png.latex?%5Clambda%20=%20%5Cmu">, so that <img src="https://latex.codecogs.com/png.latex?%5Crho%20=%201">. In that case, <img src="https://latex.codecogs.com/png.latex?%5Clog%5Crho(X)"> only takes its expected value <img src="https://latex.codecogs.com/png.latex?0">. The theorem yields</p>
<p style="text-align:center;">
<img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BE%7D%20%7CI_n(f)%20-%20I(f)%7C%20%5Cle%20%5Cfrac%7B%5C%7Cf%5C%7C_%7BL%5E2(%5Cmu)%7D%7D%7B%5Csqrt%7Bn%7D%7D">
</p>
<p>and no useful bound on <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BP%7D(1-I_n(1)%20%5Cle%20%5Cdelta)">.</p>
<p><strong>Comment.</strong> For the theorem to yield a sharp cutoff, it is necessary that <img src="https://latex.codecogs.com/png.latex?L%20=%20%5Cmathbb%7BE%7D%20%5Clog%5Crho(X)"> be relatively large and that <img src="https://latex.codecogs.com/png.latex?%5Clog%5Crho(X)"> be highly concentrated around <img src="https://latex.codecogs.com/png.latex?L">. The first condition is not aimed at in the practice of importance sampling. This difficulty contrasts with the broad claim that “a sample of size approximately <img src="https://latex.codecogs.com/png.latex?e%5E%7BL%7D"> is necessary and sufficient for accurate estimation by importance sampling”. The result in conceptually interesting, but I’m not convinced that a sharp cutoff is common.</p>
<h1>
Example
</h1>
<p>I consider their example 1.4. Here <img src="https://latex.codecogs.com/png.latex?%5Clambda"> is the exponential distribution of mean <img src="https://latex.codecogs.com/png.latex?1">, <img src="https://latex.codecogs.com/png.latex?%5Cmu"> is the exponential distribution of mean 2, <img src="https://latex.codecogs.com/png.latex?%5Crho(x)%20=%20e%5E%7Bx/2%7D/2"> and <img src="https://latex.codecogs.com/png.latex?f(x)%20=%20x">. Thus <img src="https://latex.codecogs.com/png.latex?I(f)%20=%202">. We have <img src="https://latex.codecogs.com/png.latex?L%20=%20%5Cmathbb%7BE%7D%5Clog%5Crho(X)%20=%201-%5Clog(2)%20%5Capproxeq%200.3">, meaning that the theorem yields no useful cutoff. Furthermore, <img src="https://latex.codecogs.com/png.latex?%7B%7D%5Cmathbb%7BP%7D(%5Crho(X)%20%5Cgt;%20a)%20=%20%5Ctfrac%7B1%7D%7B2a%7D"> and <img src="https://latex.codecogs.com/png.latex?%5C%7Cf%5C%7C_%7BL%5E2(%5Cmu)%7D%20=%202">. Optimizing the bound given by the theorem yields</p>
<p style="text-align:center;">
<img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BE%7D%7CI_n(f)-2%7C%20%5Cle%20%5Cfrac%7B4%5Csqrt%7B2%7D%7D%7B(2n)%5E%7B1/4%7D%7D.">
</p>
<p>The figure below shows <img src="https://latex.codecogs.com/png.latex?100"> trajectories of <img src="https://latex.codecogs.com/png.latex?I_k(f)">. The shaded area bounds the expected error.</p>
<p><a href="fig13.png" class="lightbox" data-gallery="quarto-lightbox-gallery-1"><img src="https://olivierbinette.ca/pages/posts/2017-03-18-comment-on-sample-size-for-importance-sampling/fig13.png" class="preview-image img-fluid"></a></p>
<p>This next figure shows <img src="https://latex.codecogs.com/png.latex?100"> trajectories for the Monte-Carlo estimate of <img src="https://latex.codecogs.com/png.latex?2%20=%20%5Cint%20x%20d%5Cmu">, taking <img src="https://latex.codecogs.com/png.latex?%5Clambda%20=%20%5Cmu"> and <img src="https://latex.codecogs.com/png.latex?%5Crho%20=%201">. Here the theorem yields</p>
<p style="text-align:center;">
<img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BE%7D%7CI_n(f)-2%7C%20%5Cle%20%5Cfrac%7B2%7D%7B%5Csqrt%7Bn%7D%7D.">
</p>
<p><a href="fig23.png" class="lightbox" data-gallery="quarto-lightbox-gallery-2"><img src="https://olivierbinette.ca/pages/posts/2017-03-18-comment-on-sample-size-for-importance-sampling/fig23.png" class="img-fluid"></a></p>
<p><strong>References.</strong></p>
<p>Chatterjee, S. and Diaconis, P. The Sample Size Required in Importance Sampling. https://arxiv.org/abs/1511.01437v2</p>



<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-reuse"><h2 class="anchored quarto-appendix-heading">Reuse</h2><div class="quarto-appendix-contents"><div><a rel="license" href="https://creativecommons.org/licenses/by/4.0/">CC BY 4.0</a></div></div></section><section class="quarto-appendix-contents" id="quarto-copyright"><h2 class="anchored quarto-appendix-heading">Copyright</h2><div class="quarto-appendix-contents"><div>Olivier Binette</div></div></section></div> ]]></description>
  <category>technical</category>
  <category>math</category>
  <category>statistics</category>
  <guid>https://olivierbinette.ca/pages/posts/2017-03-18-comment-on-sample-size-for-importance-sampling/2017-03-18-comment-on-sample-size-for-importance-sampling.html</guid>
  <pubDate>Mon, 18 Mar 2024 04:00:00 GMT</pubDate>
</item>
<item>
  <title>What is the Reality-Ideality-Gap in Entity Resolution?</title>
  <link>https://olivierbinette.ca/pages/posts/2023-12-12-reality-ideality-gap-entity-resolution/2023-12-12-reality-ideality-gap-entity-resolution.html</link>
  <description><![CDATA[ 





<p><a href="https://www.ijcai.org/proceedings/2022/0552.pdf">Wang et al (2022)</a> describe the frustration when real-world performance does not match expectations obtained from benchmark datasets. This difference is the “reality-ideality” gap which is all too common in real-world applications of entity resolution.<br>
<br>
Why does it happen? They posit that three main issues limit the generalizability of current benchmarks, specifically in the context of deep learning approaches to entity resolution:<br>
<br>
1. 𝐓𝐡𝐞𝐫𝐞 𝐢𝐬 𝐥𝐞𝐚𝐤𝐚𝐠𝐞 𝐟𝐫𝐨𝐦 𝐭𝐡𝐞 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐬𝐞𝐭 𝐢𝐧𝐭𝐨 𝐭𝐡𝐞 𝐭𝐞𝐬𝐭 𝐬𝐞𝐭. In typical benchmark constructions, record pairs are randomly sampled, leading to the same cluster being represented in both the train and test dataset. This biases results, especially in deep learning approaches which rely on learning record embeddings.<br>
<br>
2. 𝐑𝐞𝐚𝐥-𝐰𝐨𝐫𝐥𝐝 𝐝𝐚𝐭𝐚 𝐢𝐬 𝐦𝐮𝐜𝐡 𝐦𝐨𝐫𝐞 𝐢𝐦𝐛𝐚𝐥𝐚𝐧𝐜𝐞𝐝 𝐭𝐡𝐚𝐧 𝐛𝐞𝐧𝐜𝐡𝐦𝐚𝐫𝐤 𝐝𝐚𝐭𝐚𝐬𝐞𝐭𝐬 in terms of matching vs non-matching record pairs. In other words, there is much more opportunity for error in real data than in a benchmark dataset.<br>
<br>
3. Partly as a consequence of the two above issues, 𝐭𝐲𝐩𝐢𝐜𝐚𝐥 𝐛𝐞𝐧𝐜𝐡𝐦𝐚𝐫𝐤𝐬 𝐮𝐧𝐝𝐞𝐫𝐞𝐬𝐭𝐢𝐦𝐚𝐭𝐞 𝐭𝐡𝐞 𝐢𝐦𝐩𝐨𝐫𝐭𝐚𝐧𝐜𝐞 𝐨𝐟 𝐚𝐝𝐝𝐢𝐭𝐢𝐨𝐧𝐚𝐥 𝐟𝐞𝐚𝐭𝐮𝐫𝐞𝐬 𝐚𝐧𝐝 𝐦𝐮𝐥𝐭𝐢𝐦𝐨𝐝𝐚𝐥 𝐢𝐧𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧. This leads to under-specified systems which do not perform as well as they could.<br>
<br>
The paper goes on to define clear tasks for entity resolution systems and detail issues with current benchmarks:<br>
</p>
<blockquote class="blockquote">
<p>“Our findings reveal that previous benchmarks biased the evaluation of the progress of current entity matching approaches, and there is still a long way to go to build effective entity matchers.”</p>
</blockquote>
<embed src="https://www.ijcai.org/proceedings/2022/0552.pdf" width="100%" height="100%">



<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-reuse"><h2 class="anchored quarto-appendix-heading">Reuse</h2><div class="quarto-appendix-contents"><div><a rel="license" href="https://creativecommons.org/licenses/by/4.0/">CC BY 4.0</a></div></div></section><section class="quarto-appendix-contents" id="quarto-copyright"><h2 class="anchored quarto-appendix-heading">Copyright</h2><div class="quarto-appendix-contents"><div>Olivier Binette</div></div></section></div> ]]></description>
  <category>technical</category>
  <category>record-linkage</category>
  <guid>https://olivierbinette.ca/pages/posts/2023-12-12-reality-ideality-gap-entity-resolution/2023-12-12-reality-ideality-gap-entity-resolution.html</guid>
  <pubDate>Tue, 12 Dec 2023 05:00:00 GMT</pubDate>
</item>
<item>
  <title>Potential of Privacy-Preserving Record Linkage for the Statistics of Hidden Population</title>
  <dc:creator>Andrew Demma</dc:creator>
  <dc:creator>Olivier Binette</dc:creator>
  <link>https://olivierbinette.ca/pages/posts/2022-08-07-potential-of-privacy-preserving-record-linkage-for-the-statistics-of-hidden-populations/potential-of-privacy-preserving-record-linkage-for-the-statistics-of-hidden-populations.html</link>
  <description><![CDATA[ 





<embed src="PPRL.pdf" type="application/pdf" width="100%" height="800px">



<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-reuse"><div class="quarto-appendix-contents"><div><a rel="license" href="https://creativecommons.org/licenses/by/4.0/">CC BY 4.0</a></div></div></section><section class="quarto-appendix-contents" id="quarto-copyright"><div class="quarto-appendix-contents"><div>Olivier Binette</div></div></section></div> ]]></description>
  <category>technical</category>
  <category>statistics</category>
  <category>machine learning</category>
  <category>entity resolution</category>
  <category>privacy</category>
  <guid>https://olivierbinette.ca/pages/posts/2022-08-07-potential-of-privacy-preserving-record-linkage-for-the-statistics-of-hidden-populations/potential-of-privacy-preserving-record-linkage-for-the-statistics-of-hidden-populations.html</guid>
  <pubDate>Sun, 07 Aug 2022 04:00:00 GMT</pubDate>
</item>
<item>
  <title>Intro to Hyperparameter Optimization for Machine Learning</title>
  <dc:creator>Olivier Binette</dc:creator>
  <link>https://olivierbinette.ca/pages/posts/2022-01-29-a-brief-introduction-to-hyperparameter-optimization/a-brief-introduction-to-hyperparameter-optimization.html</link>
  <description><![CDATA[ 





<section id="introduction" class="level2">
<h2 class="anchored" data-anchor-id="introduction">1 Introduction</h2>
<p>Machine learning is easy, right? You pick a model, fit it to your data, and out come predictions.</p>
<p><a href="ml.svg" class="lightbox" data-gallery="quarto-lightbox-gallery-1"><img src="https://olivierbinette.ca/pages/posts/2022-01-29-a-brief-introduction-to-hyperparameter-optimization/ml.svg" class="img-fluid"></a></p>
<p>Not quite. Sometimes we talk about the fancy math and algorithms under the hood to make it look serious, but we rarely talk about how difficult it is to transform whatever data can gather into useful, actionable predictions that have business value.</p>
<p>There are many challenges. First, there’s the transformation of a business problem into something that’s remotely approachable by machine learning and statistics. Second, there’s the development of a data collection plan or, more often than not, the identification of observational data which is already available. With the collection of this data comes the third step, modeling, which bridges between numbers and useful answers. Modeling may have to account for all kinds of issue with your data, such as class imbalance, missingness, and non-representativeness. You also want to obtain <em>good</em> answers, so throughout this step <strong>you loop between model specification, evaluation, and refinement</strong>. It is a lengthy process of research and investigation into the performance of your model, insights into the <em>why</em> of what you observe, and various fixes and improvements to your model. Finally, in a fourth stage, you must account for how your model will be used and the management of its lifecycle.</p>
<p><a href="workflow.svg" class="lightbox" data-gallery="quarto-lightbox-gallery-2"><img src="https://olivierbinette.ca/pages/posts/2022-01-29-a-brief-introduction-to-hyperparameter-optimization/workflow.svg" class="img-fluid"></a></p>
<p>Moral of the story: there is a lot work involved. We need all hands on deck. And even more than that, <strong>we need robust automatization tools</strong> to support this machine learning workflow.</p>
<p>This blog post is about a single set of tools – <strong>hyperparameter optimization techniques</strong> – used to help with the model specification, evaluation, and refinement loop. I will focus on the standard machine learning framework of supervised learning. In this context, machine learning algorithms can be seen as black boxes which take in some data, a bunch of tuning <em>hyperparameters</em> specified by the user of the algorithm, and which output predictions. The quality of the predictions can be evaluated through data splitting or cross-validation. That is, we’re always able to compare predictions to ground truth for the data we have at hand.</p>
<p>My goal is to describe key approaches to hyperparameter optimization (see Table 1) in order to provide <strong>conceptual understanding that can be helpful practice.</strong> I describe <em>black-box</em> methods which treat the machine learning algorithm as, well, a black blox. This includes <strong>grid search</strong>, <strong>randomized search</strong>, and sequential model-based optimization such as <strong>Bayesian optimization.</strong> There are additional methods to be considered, such as <strong>Hyperband</strong> and <strong>Bayesian model selection</strong>, which integrate with the learning algorithms themselves. These will be for another blog post.</p>
<table class="caption-top table">
<caption>Table 1: Different types of hyperparameter optimization methods</caption>
<thead>
<tr class="header">
<th style="text-align: left;">Black-box methods</th>
<th style="text-align: left;">Integrated methods</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">Grid Search</td>
<td style="text-align: left;">Hyperband</td>
</tr>
<tr class="even">
<td style="text-align: left;">Randomized Search</td>
<td style="text-align: left;">Bayesian Model Selection</td>
</tr>
<tr class="odd">
<td style="text-align: left;">Sequential Model-Based Optimization</td>
<td style="text-align: left;"></td>
</tr>
</tbody>
</table>
<!--**Disclaimer:** my goal in this post is **not** to say that grid search is bad, or that you should be using algorithm X instead of agorithm Y for hyperparameter optimization. The model specification, evaluation and refinement loop is an important part of the machine learning workflow which leads to useful insights into the behavior and performance of your model. It should not be entirely automated. Hyperparameter optimization techniques should be used to gain more insights into your model and to improve your productivity, not as a drop-in replacement for model building. Use whatever technique works the best for you given what you're trying to achieve.
-->
<p>Before getting into the detail of these methods though, let’s go over some basic concepts and terminology which I’ll be using.</p>
</section>
<section id="background-and-terminology" class="level2">
<h2 class="anchored" data-anchor-id="background-and-terminology">1.1 Background and Terminology</h2>
<p>First, let’s talk about models, parameters and performance evaluation. This is going to be the occasion for me to introduce some terminology and notations.</p>
<section id="models-parameters-and-performance-evaluation" class="level3">
<h3 class="anchored" data-anchor-id="models-parameters-and-performance-evaluation">Models, Parameters and Performance Evaluation</h3>
<p>A <strong>model</strong> is a mathematical representation of something going on in the real world. For instance, suppose you want to predict whether or not a given stock <img src="https://latex.codecogs.com/png.latex?X"> is going to go up tomorrow. A model for this could be: predict it’s going to go up with probability <img src="https://latex.codecogs.com/png.latex?%5Calpha"> if it went up today, otherwise predict it’s not going to go up with probability <img src="https://latex.codecogs.com/png.latex?%5Cbeta">. There’s only one <strong>variable</strong> in this model (whether or not the stock went up today), and there are two <strong>parameters</strong>, the probabilities <img src="https://latex.codecogs.com/png.latex?%5Calpha"> and <img src="https://latex.codecogs.com/png.latex?%5Cbeta">. Here the parameters could be learned if we had historical data.</p>
<p>You could consider more sophisticated models such as classical time series models or reccurent neural networks. In all cases, you have variables (the input to your model), parameters (what you learn from data), and you end up with predictions.</p>
<p>You can compare the performance of any model by comparing the predictions to what actually happened. For instance, you could look at how often your predictions were right. That’s a performance <strong>evaluation metric</strong>. Your goal is usually to build a model which will keep on performing well.</p>
<p>Formally, let <img src="https://latex.codecogs.com/png.latex?R"> be the (average) future performance of your model. You don’t know this quantity, but you can estimate it as <img src="https://latex.codecogs.com/png.latex?%5Chat%20R"> using techniques such as cross-validation and its variants. There might be a bias and a variance to <img src="https://latex.codecogs.com/png.latex?%5Chat%20R">, but the best we can do in practice is to try to find the model with the best estimated performance (modulo certain adjustments).</p>
<p>This brings us to the question: <strong>how should you choose a model?</strong> The standard in machine learning is to choose a model which maximizes <img src="https://latex.codecogs.com/png.latex?%5Chat%20R">. It’s not the only solution, and it’s not always the best solution (it can be better to do model averaging if <img src="https://latex.codecogs.com/png.latex?%5Chat%20R"> has some variance), but it’s what we’ll focus on through this blog post.</p>
<p>Furthermore, we’ll approach this problem through the lens of hyperparameter selection.</p>
</section>
<section id="hyperparameters" class="level3">
<h3 class="anchored" data-anchor-id="hyperparameters">Hyperparameters</h3>
<p>Hyperparameters are things that have you have to specify before you can run a model, such as:</p>
<ul>
<li>what data features to use,</li>
<li>what type of model to use (linear model? random forest? neural network?)</li>
<li>other decisions that go into the specification of a model:
<ul>
<li>the number of layers in your neural network,</li>
<li>the learning rate for the gradient descent algorithm,</li>
<li>the maximum depth for decision trees, etc.</li>
</ul></li>
</ul>
<p>There is only a practical distinction between parameters and hyperparameters. Hyperparameters are things that are usually set separately from the other model parameters, or that do not nicely fit within a model’s learning algorithm. Depending on the framework you’re using, parameters can become hyperparameters and vice versa. For example, by using ensemble methods, you could easily transform the “model type choice” hyperparameter to a simple parameter of your ensemble that is learned from data.</p>
<p>The key thing is that, in practice, there will typically be some distinction between parameters of your model and a set of hyperparameters that you have to specify.</p>
<p>Through experience, you can learn what hyperparameters work well for the kinds of problems that you work on. Other times, you might carefully tune parameters and investigate the impact of your choices on model performance.</p>
<p>The manual process of hyperparameter tuning can lead to important insights into the performance and behavior of your model. However, it can also be a menial task that would be better automated through hyperparameter optimization algorithms aiming to maximize <img src="https://latex.codecogs.com/png.latex?%5Chat%20R">, such as those that I review below.</p>
</section>
<section id="example" class="level3">
<h3 class="anchored" data-anchor-id="example">Example</h3>
<p>Let’s look at an example to make things concrete. This is adapted from <a href="https://scikit-optimize.github.io/stable/auto_examples/hyperparameter-optimization.html">scikit-optimize’s tutorial for tuning scikit-learn estimators</a>.</p>
<p>We’ll consider the <a href="https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset">California housing dataset</a> from the scikit-learn library. Each row in this dataset represents a census block and contains aggregated information regarding houses in that block. Our goal will be to predict median house price at the block level given these other covariates.</p>
<div id="30e089e8" class="cell" data-execution_count="1">
<details open="" class="code-fold">
<summary>Code</summary>
<div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb1-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb1-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.datasets <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> fetch_california_housing</span>
<span id="cb1-4"></span>
<span id="cb1-5">dataset <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> fetch_california_housing(as_frame<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span>
<span id="cb1-6"></span>
<span id="cb1-7">X <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> dataset.data <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Covariates</span></span>
<span id="cb1-8">n_features <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> X.shape[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Number of features</span></span>
<span id="cb1-9">y <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> dataset.target <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Median house prices</span></span>
<span id="cb1-10"></span>
<span id="cb1-11">X</span></code></pre></div>
</details>
<div class="cell-output cell-output-display" data-execution_count="1">
<div>


<table class="dataframe caption-top table table-sm table-striped small" data-quarto-postprocess="true" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">MedInc</th>
<th data-quarto-table-cell-role="th">HouseAge</th>
<th data-quarto-table-cell-role="th">AveRooms</th>
<th data-quarto-table-cell-role="th">AveBedrms</th>
<th data-quarto-table-cell-role="th">Population</th>
<th data-quarto-table-cell-role="th">AveOccup</th>
<th data-quarto-table-cell-role="th">Latitude</th>
<th data-quarto-table-cell-role="th">Longitude</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td data-quarto-table-cell-role="th">0</td>
<td>8.3252</td>
<td>41.0</td>
<td>6.984127</td>
<td>1.023810</td>
<td>322.0</td>
<td>2.555556</td>
<td>37.88</td>
<td>-122.23</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">1</td>
<td>8.3014</td>
<td>21.0</td>
<td>6.238137</td>
<td>0.971880</td>
<td>2401.0</td>
<td>2.109842</td>
<td>37.86</td>
<td>-122.22</td>
</tr>
<tr class="odd">
<td data-quarto-table-cell-role="th">2</td>
<td>7.2574</td>
<td>52.0</td>
<td>8.288136</td>
<td>1.073446</td>
<td>496.0</td>
<td>2.802260</td>
<td>37.85</td>
<td>-122.24</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">3</td>
<td>5.6431</td>
<td>52.0</td>
<td>5.817352</td>
<td>1.073059</td>
<td>558.0</td>
<td>2.547945</td>
<td>37.85</td>
<td>-122.25</td>
</tr>
<tr class="odd">
<td data-quarto-table-cell-role="th">4</td>
<td>3.8462</td>
<td>52.0</td>
<td>6.281853</td>
<td>1.081081</td>
<td>565.0</td>
<td>2.181467</td>
<td>37.85</td>
<td>-122.25</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr class="odd">
<td data-quarto-table-cell-role="th">20635</td>
<td>1.5603</td>
<td>25.0</td>
<td>5.045455</td>
<td>1.133333</td>
<td>845.0</td>
<td>2.560606</td>
<td>39.48</td>
<td>-121.09</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">20636</td>
<td>2.5568</td>
<td>18.0</td>
<td>6.114035</td>
<td>1.315789</td>
<td>356.0</td>
<td>3.122807</td>
<td>39.49</td>
<td>-121.21</td>
</tr>
<tr class="odd">
<td data-quarto-table-cell-role="th">20637</td>
<td>1.7000</td>
<td>17.0</td>
<td>5.205543</td>
<td>1.120092</td>
<td>1007.0</td>
<td>2.325635</td>
<td>39.43</td>
<td>-121.22</td>
</tr>
<tr class="even">
<td data-quarto-table-cell-role="th">20638</td>
<td>1.8672</td>
<td>18.0</td>
<td>5.329513</td>
<td>1.171920</td>
<td>741.0</td>
<td>2.123209</td>
<td>39.43</td>
<td>-121.32</td>
</tr>
<tr class="odd">
<td data-quarto-table-cell-role="th">20639</td>
<td>2.3886</td>
<td>16.0</td>
<td>5.254717</td>
<td>1.162264</td>
<td>1387.0</td>
<td>2.616981</td>
<td>39.37</td>
<td>-121.24</td>
</tr>
</tbody>
</table>

<p>20640 rows × 8 columns</p>
</div>
</div>
</div>
<p>For the regression, we’ll use scikit-learn’s gradient boosted trees estimator. This model has a number of internal parameters which don’t need to know much about, as well as hyperparameters which can be used to tune the model. This includes the <code>max_depth</code> hyperparameter for the maximum depth of decision trees, <code>learning_rate</code> for the learning rate of gradient boosting, <code>max_features</code> for the maximum number of features to use in each decision trees, and a few more. Ranges of reasonable values for these parameters are specified in the <code>space</code> variable below.</p>
<div id="2857b2b0" class="cell" data-execution_count="2">
<details open="" class="code-fold">
<summary>Code</summary>
<div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.ensemble <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> GradientBoostingRegressor</span>
<span id="cb2-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> skopt.space <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Real, Integer</span>
<span id="cb2-3"></span>
<span id="cb2-4">model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> GradientBoostingRegressor(n_estimators<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">25</span>, random_state<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb2-5"></span>
<span id="cb2-6">space  <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [Integer(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span>, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'max_depth'</span>),</span>
<span id="cb2-7">          Real(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"log-uniform"</span>, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'learning_rate'</span>),</span>
<span id="cb2-8">          Integer(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, n_features, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'max_features'</span>),</span>
<span id="cb2-9">          Integer(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span>, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'min_samples_leaf'</span>)</span>
<span id="cb2-10">]</span></code></pre></div>
</details>
</div>
<p>Now, the last thing we need is an estimator <img src="https://latex.codecogs.com/png.latex?%5Chat%20R"> for the model’s performance. This is our <code>Rhat()</code> function (i.e.&nbsp;<img src="https://latex.codecogs.com/png.latex?%5Chat%20R">) which we’ll try to maximize. Here we use a cross-validated mean absolute error score.</p>
<div id="ea0d6a3f" class="cell" data-execution_count="3">
<details open="" class="code-fold">
<summary>Code</summary>
<div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.model_selection <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> cross_val_score</span>
<span id="cb3-2"></span>
<span id="cb3-3"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> Rhat(<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span>params):</span>
<span id="cb3-4">  model.set_params(<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span>params)</span>
<span id="cb3-5">  </span>
<span id="cb3-6">  <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>np.mean(cross_val_score(model, X, y, cv<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, n_jobs<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,</span>
<span id="cb3-7">                                  scoring<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"neg_mean_absolute_error"</span>))</span></code></pre></div>
</details>
</div>
<p>With this, we can fit the model to the data (using default hyperparameter values to begin with), and evaluate the model’s performance.</p>
<div id="d9e64557" class="cell" data-execution_count="4">
<details open="" class="code-fold">
<summary>Code</summary>
<div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb4-1">model.fit(X, y)</span>
<span id="cb4-2"></span>
<span id="cb4-3">Rhat()</span></code></pre></div>
</details>
<div class="cell-output cell-output-display" data-execution_count="4">
<pre><code>np.float64(0.5503720160635011)</code></pre>
</div>
</div>
<p>Here the unit for median house price was in hundreds of thousands of dollars and we can interpret the model performance at this scale. The value <img src="https://latex.codecogs.com/png.latex?%5Chat%20R%20%5Capprox%200.55"> means that, on average, the absolute error of the model is $55,000. We’ll see if we can do better using hyperparameter optimization.</p>
</section>
</section>
<section id="black-box-optimization-methods" class="level2">
<h2 class="anchored" data-anchor-id="black-box-optimization-methods">2 Black-Box Optimization Methods</h2>
<p>Black-box hyperparameter optimization algorithms consider the underlying machine algorithm as unknown. We only assume that, given a set of hyperparameters <img src="https://latex.codecogs.com/png.latex?%5Clambda">, we can compute the estimated model performance <img src="https://latex.codecogs.com/png.latex?%5Chat%20R(%5Clambda)">. There is usually variance in <img src="https://latex.codecogs.com/png.latex?%5Chat%20R(%5Clambda)">, but this is not something that I will talk about in this post. We will therefore consider <img src="https://latex.codecogs.com/png.latex?%5Chat%20R"> as a deterministic function to be optimized.</p>
<p>Note: in practice, <strong>you need to account for the variance in <img src="https://latex.codecogs.com/png.latex?%5Chat%20R"></strong>, as otherwise you could get bad surprises. It’s just not something I’m covering in this post, since I want to focus on a conceptual understanding of the optimization algorithms.</p>
<p>We can use almost any technique to try to optimize <img src="https://latex.codecogs.com/png.latex?%5Chat%20R">, but there are a number of challenges with hyperparameter optimization:</p>
<ol type="1">
<li><img src="https://latex.codecogs.com/png.latex?%5Chat%20R"> is usually rather costly to evaluate.</li>
<li>We usually do not have gradient information regarding <img src="https://latex.codecogs.com/png.latex?%5Chat%20R"> (otherwise, hyperparameters for which we have gradient information could easily be incorporated as parameters of the underlying ML algorithms).</li>
<li>The hyperparameter space is usually complex. It can contain discrete variables and can even be tree-structured, where some hyperparameters are only defined conditionally on other hyperparameters.</li>
<li>The hyperparameter space is usually somewhat high-dimensional, with more than just 2-3 dimensions.</li>
</ol>
<p>These particularities of the hyperparameter optimization problem has led the machine learning community to favor some of the optimization techniques which I discuss below.</p>
<section id="grid-search" class="level3">
<h3 class="anchored" data-anchor-id="grid-search">2.1 Grid Search</h3>
<p>The first technique to consider is <strong>grid search</strong>, which is a brute force approach to hyperparameter optimization. It is the simplest of all – you simply specify values to consider for each hyperparameter, and then evaluate your model performance for each combination of hyperparameter. At the end, you keep the hyperparameter configuration which performed best.</p>
<p>There are a few advantages to this approach:</p>
<ul>
<li>It gives you precise control over what hyperparameter configurations are evaluated.</li>
<li>It is simple to implement and easily parallelizable.</li>
</ul>
<p>However, there are also a number of serious drawbacks:</p>
<ol type="1">
<li>The runtime scales exponentially in the number of hyperparameter dimensions.</li>
<li>The runtime is tied to the hyperparameter search space which you specify. To reduce runtime, you need to manually redefine this space.</li>
</ol>
<p>Let’s see an example of how this works in practice. First, we define a grid of hyperparameter values to evaluate. Given the scoring function <img src="https://latex.codecogs.com/png.latex?%5Chat%20R">, we can then use scikit-learn’s <code>GridSearchCV()</code> function to evaluate the model performance at each hyperparameter combination. This is done below:</p>
<div id="973db1cf" class="cell" data-execution_count="5">
<details open="" class="code-fold">
<summary>Code</summary>
<div class="sourceCode cell-code" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb6-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.model_selection <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> GridSearchCV</span>
<span id="cb6-2"></span>
<span id="cb6-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Budget of 54 evaluations</span></span>
<span id="cb6-4">grid <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {</span>
<span id="cb6-5">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'max_depth'</span>: [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>],</span>
<span id="cb6-6">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'learning_rate'</span>: [<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>],</span>
<span id="cb6-7">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'max_features'</span>: [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span>],</span>
<span id="cb6-8">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'min_samples_leaf'</span>: [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>]</span>
<span id="cb6-9">}</span>
<span id="cb6-10"></span>
<span id="cb6-11"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> scoring(estimator, X_test, y_test):</span>
<span id="cb6-12">  y_pred <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> estimator.predict(X_test)</span>
<span id="cb6-13">  <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>np.mean(np.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">abs</span>(y_test <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> y_pred))</span>
<span id="cb6-14"></span>
<span id="cb6-15">results <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> GridSearchCV(model, grid, cv<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, n_jobs<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, scoring<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>scoring).fit(X, y)</span></code></pre></div>
</details>
</div>
<p>We can then recover the best score and best hyperparameters. The best model is slightly better than the default model we looked at earlier, with a $4,000 decrease in average absolute error.</p>
<div id="04996684" class="cell" data-execution_count="6">
<details open="" class="code-fold">
<summary>Code</summary>
<div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb7-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>results.best_score_ <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Lowest cross-validated mean absolute error</span></span>
<span id="cb7-2"></span>
<span id="cb7-3">{key:results.best_params_[key] <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> key <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> grid.keys()} <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Best parameters</span></span></code></pre></div>
</details>
<div class="cell-output cell-output-display" data-execution_count="6">
<pre><code>{'max_depth': 5,
 'learning_rate': 0.1,
 'max_features': 8,
 'min_samples_leaf': 1}</code></pre>
</div>
</div>
<p>It is also informative to plot an histogram for the distribution of model scores. We can see that most model configurations performed much worst than the default.</p>
<div id="bdf987a9" class="cell" data-execution_count="7">
<details open="" class="code-fold">
<summary>Code</summary>
<div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb9-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> matplotlib.pyplot <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> plt</span>
<span id="cb9-2"></span>
<span id="cb9-3">plt.clf()</span>
<span id="cb9-4">p <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> plt.hist(<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>results.cv_results_[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"mean_test_score"</span>], bins<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span>)</span>
<span id="cb9-5">p <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> plt.title(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Score distribution for evaluated hyperparameters"</span>, loc<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"left"</span>)</span>
<span id="cb9-6">p <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> plt.xlabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Cross-validated average absolute error"</span>)</span>
<span id="cb9-7">plt.show()</span></code></pre></div>
</details>
<div class="cell-output cell-output-display">
<div>
<figure class="figure">
<p><a href="a-brief-introduction-to-hyperparameter-optimization_files/figure-html/cell-8-output-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-3"><img src="https://olivierbinette.ca/pages/posts/2022-01-29-a-brief-introduction-to-hyperparameter-optimization/a-brief-introduction-to-hyperparameter-optimization_files/figure-html/cell-8-output-1.png" width="558" height="449" class="figure-img"></a></p>
</figure>
</div>
</div>
</div>
</section>
<section id="random-search" class="level3">
<h3 class="anchored" data-anchor-id="random-search">2.2 Random Search</h3>
<p>The second method we’ll look at is <strong>random search.</strong> Here, the idea is to sample a number <img src="https://latex.codecogs.com/png.latex?k"> of hyperparameter configurations at random from a given space, and to evaluate those random configurations.</p>
<p>This might seem like a silly idea. Why pick hyperparameter values at random?</p>
<p>The answer is that doing so <strong>removes all computational penalties</strong> from the consideration of useless hyperparameter dimensions. That is, imagine that a number <img src="https://latex.codecogs.com/png.latex?s"> of your hyperparameters have actually no impact on model performance. With grid search, the consideration of these hyperparameters would incur you a computational penalty which is exponential in <img src="https://latex.codecogs.com/png.latex?s">. With random search, however, there is <strong>no penalty at all</strong> for adding these <img src="https://latex.codecogs.com/png.latex?s"> additional hyperparameter dimensions. The results from random search with or without these additional dimensions are <strong>exactly the same</strong> in both cases.</p>
<p>This is the huge advantage of random search over grid search: you do not get penalized for useless dimensions. Furthermore, in practice, being able to tune the search effort through the number of samples <img src="https://latex.codecogs.com/png.latex?k"> can be quite convenient.</p>
<p>Let’s see how this can be implemented in practice. We’ll define a hyperparameter space which is similar to the grid space we specified earlier, but which is filled in with additional possible values. We can then run scikit-learn’s <code>RandomizedSearchCV()</code> function to do the randomized search:</p>
<div id="0d914dad" class="cell" data-execution_count="8">
<details open="" class="code-fold">
<summary>Code</summary>
<div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb10-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.model_selection <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> RandomizedSearchCV</span>
<span id="cb10-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> scipy.stats <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> loguniform</span>
<span id="cb10-3"></span>
<span id="cb10-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Around roughly the same values as for the grid search</span></span>
<span id="cb10-5">param_distribution <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {</span>
<span id="cb10-6">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'max_depth'</span>: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span>),</span>
<span id="cb10-7">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'learning_rate'</span>: loguniform(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>),</span>
<span id="cb10-8">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'max_features'</span>: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">9</span>),</span>
<span id="cb10-9">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'min_samples_leaf'</span>: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span>)</span>
<span id="cb10-10">}</span>
<span id="cb10-11"></span>
<span id="cb10-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Budget of 54 evaluations</span></span>
<span id="cb10-13">results <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> RandomizedSearchCV(model, param_distribution, n_iter<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">54</span>, cv<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, n_jobs<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, scoring<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>scoring, random_state<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>).fit(X, y)</span></code></pre></div>
</details>
</div>
<p>The results are below. By considering a richer hyperparameter space, and without being penalized by this in the same way we would with a grid search, randomized search allows us to find a better model with the same amount of effort.</p>
<div id="78b2454b" class="cell" data-execution_count="9">
<details open="" class="code-fold">
<summary>Code</summary>
<div class="sourceCode cell-code" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb11-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>results.best_score_ <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Lowest cross-validated mean absolute error</span></span>
<span id="cb11-2"></span>
<span id="cb11-3">{key:results.best_params_[key] <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> key <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> grid.keys()} <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Best parameters</span></span></code></pre></div>
</details>
<div class="cell-output cell-output-display" data-execution_count="9">
<pre><code>{'max_depth': 6,
 'learning_rate': np.float64(0.1453937524243155),
 'max_features': 7,
 'min_samples_leaf': 26}</code></pre>
</div>
</div>
<p>Again, we can look at the distribution of model performance for sampled hyperparameter configurations. It’s quite similar to grid search, with only a few better-performing models being identified.</p>
<div id="8bb9cbaf" class="cell" data-execution_count="10">
<details open="" class="code-fold">
<summary>Code</summary>
<div class="sourceCode cell-code" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb13-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> matplotlib.pyplot <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> plt</span>
<span id="cb13-2"></span>
<span id="cb13-3">plt.clf()</span>
<span id="cb13-4">p <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> plt.hist(<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>results.cv_results_[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"mean_test_score"</span>], bins<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span>)</span>
<span id="cb13-5">p <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> plt.title(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Score distribution for evaluated hyperparameters"</span>, loc<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"left"</span>)</span>
<span id="cb13-6">p <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> plt.xlabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Cross-validated average absolute error"</span>)</span>
<span id="cb13-7">plt.show()</span></code></pre></div>
</details>
<div class="cell-output cell-output-display">
<div>
<figure class="figure">
<p><a href="a-brief-introduction-to-hyperparameter-optimization_files/figure-html/cell-11-output-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-4"><img src="https://olivierbinette.ca/pages/posts/2022-01-29-a-brief-introduction-to-hyperparameter-optimization/a-brief-introduction-to-hyperparameter-optimization_files/figure-html/cell-11-output-1.png" width="572" height="449" class="figure-img"></a></p>
</figure>
</div>
</div>
</div>
</section>
<section id="sequential-model-based-optimization" class="level3">
<h3 class="anchored" data-anchor-id="sequential-model-based-optimization">2.3 Sequential Model-Based Optimization</h3>
<p>All of the techniques considered so far made no assumption at all about the function <img src="https://latex.codecogs.com/png.latex?%5Chat%20R"> to optimize.</p>
<p>This is a problem, because we do have prior information about <img src="https://latex.codecogs.com/png.latex?%5Chat%20R">. We can expect <img src="https://latex.codecogs.com/png.latex?%5Chat%20R"> to have some level of regularity, meaning that similar hyperparameter configurations should have similar performance. This knowledge allows us to make inference about <img src="https://latex.codecogs.com/png.latex?%5Chat%20R(%5Clambda)"> given the evaluation of <img src="https://latex.codecogs.com/png.latex?%5Chat%20R"> at other points <img src="https://latex.codecogs.com/png.latex?%5Ctilde%20%5Clambda%20%5Cnot%20=%20%5Clambda">.</p>
<p>More formally, suppose we have evaluated <img src="https://latex.codecogs.com/png.latex?%5Chat%20R"> at a sequence of hyperparameter configurations <img src="https://latex.codecogs.com/png.latex?%5Clambda_1,%20%5Clambda_2,%20%5Cdots,%20%5Clambda_n">, thus observing <img src="https://latex.codecogs.com/png.latex?%5Chat%20R(%5Clambda_1),%20%5Chat%20R(%5Clambda_2),%20%5Cdots,%20%5Chat%20R(%5Clambda_n)">. This allows us to make inference about <img src="https://latex.codecogs.com/png.latex?%5Chat%20R">. In particular, we can try guessing what next <img src="https://latex.codecogs.com/png.latex?%5Clambda_%7Bn+1%7D"> will maximize <img src="https://latex.codecogs.com/png.latex?%5Chat%20R"> or improve our knowledge of <img src="https://latex.codecogs.com/png.latex?%5Chat%20R">. Once we’ve observed <img src="https://latex.codecogs.com/png.latex?%5Chat%20R(%5Clambda_%7Bn+1%7D)">, we repeat the process, trying to guess which <img src="https://latex.codecogs.com/png.latex?%5Clambda_%7Bn+2%7D"> to pick to improve the procedure. That is the entire idea behind <strong>sequential model-based optimization</strong>.</p>
<p>To make this work in practice, we need the following ingredients:</p>
<ol type="1">
<li>An inferential model for <img src="https://latex.codecogs.com/png.latex?%5Chat%20R">. That could be a Bayesian nonparametric model, like a Gaussian Process, or something else, like a Tree-structure Parzen Estimator.</li>
<li>A method to guess the next best hyperparameter value to pick. Typically, <img src="https://latex.codecogs.com/png.latex?%5Clambda_%7Bn+1%7D"> is chosen to maximize the <strong>expected improvement criterion</strong>. This chooses <img src="https://latex.codecogs.com/png.latex?%5Clambda"> to maximize the expected value of <img src="https://latex.codecogs.com/png.latex?%5Cmax%5C%7B%5Chat%20R(%5Clambda)%20-%20R%5E*,%200%5C%7D">, where <img src="https://latex.codecogs.com/png.latex?R%5E*"> is the current observed performance maximum. In other words, we want to maximize the potential for improving the current optimum, without penalizing for the possibility of observing a lower performance. This allows us to optimize <img src="https://latex.codecogs.com/png.latex?%5Chat%20R"> while still exploring the hyperparameter space. I refer the reader to <a href="https://www.cse.wustl.edu/~garnett/cse515t/spring_2015/files/lecture_notes/12.pdf">here</a> for a review of a few other selection criterions.</li>
</ol>
<p>When a Bayesian inferential framework is chosen, then sequential model-based optimization is called <strong>Bayesian optimization</strong> or <strong>Bayesian search</strong>. It is beyond of the scope of this blog post to go into the details of gaussian processes, but below I show howthe scikit-optimize library can be used to perform Bayesian optimization based on Gaussian Processes and the expected improvement criterion:</p>
<div id="a3aeca5a" class="cell" data-execution_count="11">
<details open="" class="code-fold">
<summary>Code</summary>
<div class="sourceCode cell-code" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb14-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> skopt <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> gp_minimize</span>
<span id="cb14-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> skopt.utils <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> use_named_args</span>
<span id="cb14-3"></span>
<span id="cb14-4"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@use_named_args</span>(space)</span>
<span id="cb14-5"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> objective(<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span>params):</span>
<span id="cb14-6">  model.set_params(<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span>params)</span>
<span id="cb14-7">  </span>
<span id="cb14-8">  <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>np.mean(cross_val_score(model, X, y, cv<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, n_jobs<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,</span>
<span id="cb14-9">                                  scoring<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"neg_mean_absolute_error"</span>))</span>
<span id="cb14-10"></span>
<span id="cb14-11">res_gp <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> gp_minimize(objective, space, n_calls<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">54</span>, random_state<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb14-12"></span>
<span id="cb14-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">## 0.46</span></span></code></pre></div>
</details>
</div>
<p>With Bayesian optimization, we see that much more time is spent sampling performant models.</p>
<div id="7b1eb831" class="cell" data-execution_count="12">
<details open="" class="code-fold">
<summary>Code</summary>
<div class="sourceCode cell-code" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb15-1">plt.clf()</span>
<span id="cb15-2">p <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> plt.hist(res_gp.func_vals, bins<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span>)</span>
<span id="cb15-3">p <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> plt.title(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Score distribution for evaluated hyperparameters"</span>, loc<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"left"</span>)</span>
<span id="cb15-4">p <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> plt.xlabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Cross-validated average absolute error"</span>)</span>
<span id="cb15-5">plt.show()</span></code></pre></div>
</details>
<div class="cell-output cell-output-display">
<div>
<figure class="figure">
<p><a href="a-brief-introduction-to-hyperparameter-optimization_files/figure-html/cell-13-output-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-5"><img src="https://olivierbinette.ca/pages/posts/2022-01-29-a-brief-introduction-to-hyperparameter-optimization/a-brief-introduction-to-hyperparameter-optimization_files/figure-html/cell-13-output-1.png" width="579" height="449" class="figure-img"></a></p>
</figure>
</div>
</div>
</div>
<p>Furthermore, we can see that the algorithm quickly converges towards performant models.</p>
<div id="518c0157" class="cell" data-execution_count="13">
<details open="" class="code-fold">
<summary>Code</summary>
<div class="sourceCode cell-code" id="cb16" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb16-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> skopt.plots <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> plot_convergence</span>
<span id="cb16-2"></span>
<span id="cb16-3">plot_convergence(res_gp)</span></code></pre></div>
</details>
<div class="cell-output cell-output-display">
<div>
<figure class="figure">
<p><a href="a-brief-introduction-to-hyperparameter-optimization_files/figure-html/cell-14-output-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-6"><img src="https://olivierbinette.ca/pages/posts/2022-01-29-a-brief-introduction-to-hyperparameter-optimization/a-brief-introduction-to-hyperparameter-optimization_files/figure-html/cell-14-output-1.png" width="598" height="449" class="figure-img"></a></p>
</figure>
</div>
</div>
</div>
</section>
</section>
<section id="summary" class="level2">
<h2 class="anchored" data-anchor-id="summary">3 Summary</h2>
<p>This blog post provided a basic overview of hyperparameter optimization and of what can be gained from these techniques. We reviewed grid search, the simplest brute force approach. We reviewed random search, which improves upon grid search when some hyperparameter dimensions are not influencial. Finally, we reviewed sequential model-based optimization, which much more effectively samples models with good performance.</p>


</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-reuse"><h2 class="anchored quarto-appendix-heading">Reuse</h2><div class="quarto-appendix-contents"><div><a rel="license" href="https://creativecommons.org/licenses/by/4.0/">CC BY 4.0</a></div></div></section><section class="quarto-appendix-contents" id="quarto-copyright"><h2 class="anchored quarto-appendix-heading">Copyright</h2><div class="quarto-appendix-contents"><div>Olivier Binette</div></div></section></div> ]]></description>
  <category>technical</category>
  <category>statistics</category>
  <category>machine learning</category>
  <guid>https://olivierbinette.ca/pages/posts/2022-01-29-a-brief-introduction-to-hyperparameter-optimization/a-brief-introduction-to-hyperparameter-optimization.html</guid>
  <pubDate>Sat, 29 Jan 2022 05:00:00 GMT</pubDate>
</item>
<item>
  <title>Record Linkage at the Duke GPSG Community Pantry</title>
  <link>https://olivierbinette.ca/pages/posts/2022-01-01-record-linkage-at-the-gpsg-community-pantry/record-linkage-at-the-gpsg-community-pantry.html</link>
  <description><![CDATA[ 





<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="pantry.jpg" class="lightbox" data-gallery="quarto-lightbox-gallery-1" title="Figure from https://gpsg.duke.edu/resources-for-students/community-pantry/"><img src="https://olivierbinette.ca/pages/posts/2022-01-01-record-linkage-at-the-gpsg-community-pantry/pantry.jpg" class="external img-fluid figure-img" alt="Figure from https://gpsg.duke.edu/resources-for-students/community-pantry/"></a></p>
<figcaption>Figure from <a href="https://gpsg.duke.edu/resources-for-students/community-pantry/" class="uri">https://gpsg.duke.edu/resources-for-students/community-pantry/</a></figcaption>
</figure>
</div>
<section id="introduction" class="level2">
<h2 class="anchored" data-anchor-id="introduction">Introduction</h2>
<p>Duke’s Graduate and Professional Student Government (GPSG) has been operating a community food pantry for about five years. The pantry provides nonperishable food and basic need items to graduate and professional students on campus. There is a weekly bag program, where students order customized bags of food to be picked up on Saturdays, as well as an in-person shopping program open on Thursdays and Saturdays.</p>
<figcaption align="center">
<p>Figure 1: Weekly number of customers at the Pantry. The black line is a moving average of weekly visits.</p>
</figcaption>
<p><a href="customers.png" class="lightbox" data-gallery="quarto-lightbox-gallery-2"><img src="https://olivierbinette.ca/pages/posts/2022-01-01-record-linkage-at-the-gpsg-community-pantry/customers.png" class="preview-image img-fluid"></a></p>
<p>The weekly bag program, which began in May 2018 and is still the most popular pantry offering, provides quite a bit of data regarding pantry customers and their habits. Some customers have ordered more than 80 times in the past 2 years, while others only ordered once or twice. For every bag order, we have the customer’s first name and last initial, an email address (which became mandatory around mid 2018), a phone number in a few cases, an address in some cases (for delivery), we have demographic information some cases, and we have the food order information. Available quasi-identifying information is shown in Table 1 below.</p>
<table class="caption-top table">
<caption>Table 1: Quasi-identifying information provided on Qualtrics bag order forms. Note that phone number and address were only required while delivery was offered. Furthermore, most customers stop answering demographic questions after a few orders.</caption>
<colgroup>
<col style="width: 31%">
<col style="width: 22%">
<col style="width: 22%">
<col style="width: 22%">
</colgroup>
<thead>
<tr class="header">
<th>Question no.</th>
<th>Question</th>
<th>Answer form</th>
<th>Mandatory?</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>-</td>
<td>IP address</td>
<td>-</td>
<td>Yes</td>
</tr>
<tr class="even">
<td>2</td>
<td>First name and last initial</td>
<td>Free form</td>
<td>Yes</td>
</tr>
<tr class="odd">
<td>3</td>
<td>Duke email</td>
<td>Free form</td>
<td>Yes</td>
</tr>
<tr class="even">
<td>4</td>
<td>Phone number</td>
<td>Free form</td>
<td>No</td>
</tr>
<tr class="odd">
<td>6</td>
<td>Address</td>
<td>Free form</td>
<td>No</td>
</tr>
<tr class="even">
<td>8</td>
<td>Food allergies</td>
<td>Free form</td>
<td>No</td>
</tr>
<tr class="odd">
<td>9</td>
<td>Number of members in household</td>
<td>1-2 or 3+</td>
<td>Yes</td>
</tr>
<tr class="even">
<td>10</td>
<td>Want baby bag?</td>
<td>Yes or no</td>
<td>Yes</td>
</tr>
<tr class="odd">
<td>30</td>
<td>Degree</td>
<td>Multiple choices or Other</td>
<td>No</td>
</tr>
<tr class="even">
<td>31</td>
<td>School</td>
<td>Multiple choices or Other</td>
<td>No</td>
</tr>
<tr class="odd">
<td>32</td>
<td>Year in graduate school</td>
<td>Multiple choices</td>
<td>No</td>
</tr>
<tr class="even">
<td>33</td>
<td>Number of adults in household</td>
<td>Multiple choices</td>
<td>No</td>
</tr>
<tr class="odd">
<td>34</td>
<td>Number of children in household</td>
<td>Multiple choices</td>
<td>No</td>
</tr>
</tbody>
</table>
<p>Gaining the most insight from this data requires linking order records from the same customer. Identifying individual customers and associating them with an order history allows us to investigate shopping recurrence patterns and identify potential issues with the pantry’s offering. For instance, we can know who stopped ordering from the pantry after the home delivery program ended. These are people who, most likely, do not have a car to get to the pantry but might benefit from new programs, such as a ride-share program or a gift card program.</p>
<p>This blog post describes the way in which records are linked at the Community Pantry. As we will see, the record linkage problem is not particularly difficult. It is not trivial either, however, and it does require care to ensure that it runs reliably and efficiently, and that it is intelligible and properly validated. This post goes in detail into these two aspects of the problem.</p>
<p>Regarding efficiency and reliability of the software system, I describe the development of a Python module, called <a href="https://github.com/olivierbinette/groupbyrule"><strong>GroupByRule</strong></a>, for record linkage at the pantry. This Python module is maintainable, documented and tested, ensuring reliability of the system and the potential for its continued use throughout the years, even as technical volunteers change at the pantry. Regarding validation of the record linkage system, I describe simple steps that can be taken to evaluate model performance.</p>
<p>Before jumping into the technical part, let’s take a step back to discuss the issue of food insecurity on campus.</p>
<section id="food-insecurity-on-campus" class="level3">
<h3 class="anchored" data-anchor-id="food-insecurity-on-campus">Food Insecurity on Campus</h3>
<p>It is often surprising to people that some Duke students might struggle having access to food. After all, Duke is one of the richest campuses in the US with its <a href="https://www.dukechronicle.com/article/2021/10/duke-university-endowment-gain-56-percent-dumac-how-used-financial-aid-faculty-pay-who-manages">12 billion endowment</a>, high tuition and substantial research grants. Prior to the covid-19 pandemic, this wealth could be seen on campus and benefit many. Every weekday, there were several conferences and events with free food. Me and many other graduate students would participate in these events, earning 3-4 free lunches every week. Free food on campus is now a thing of the past, for the most part.</p>
<p>However, free lunch or not, it’s important to realize the many financial challenges which students can face. International students on F-1 and J-1 visas have limited employment opportunities in the US. Many graduate students are married, have children or have other dependents which may not be eligible to work in the US either. Even if they are lucky enough to be paid a 9 or 12-month stipend, this stipend doesn’t go very far. For other students, going to Duke means living on a mixture of loans, financial aid, financial support from parents, and side jobs. Any imbalance in this rigid system can leave students having to compromise between their education and their health.</p>
<p>A 2019 study from the World Food Policy Center reported that about 19% of graduate and professional students at Duke experienced food insecurity in the past year. This means they were unable to afford a balanced and sufficient diet, they were afraid of not having enough money for food, or they skipped meals and went hungry due to lack of money. The GPSG Community Pantry has been leading efforts to expand food insecurity monitoring on campus – we are hoping to have more data in 2022 and in following years.</p>
</section>
</section>
<section id="the-record-linkage-approach" class="level2">
<h2 class="anchored" data-anchor-id="the-record-linkage-approach">The Record Linkage Approach</h2>
<p>The bag order form contains email addresses which are highly reliable for linkage. If two records have the same email, we know for certain that they are from the same customer. However, customers do not always enter the same email address when submitting orders. Despite the request to use a Duke email address, some customers use personal emails. Furthermore, Duke email addresses have two forms. For instance, my duke email is both <code>ob37@duke.edu</code> and <code>olivier.binette@duke.edu</code>. Emails are therefore not sufficient for linkage. Phone numbers can be used as well, but these are only available for the period when home delivery was available.</p>
<p>First name and last initial can be used to supplement emails and phone numbers. Again, agreement on first name and last initial provides strong evidence for match. On the other hand, people do not always enter their names in the same way.</p>
<p>Combining the use of emails, phone numbers, and names, we may therefore link records which agree on any one of these attributes. This is a simple deterministic record linkage approach which should be reliable enough for the data analysis use of the pantry.</p>
<section id="deterministic-record-linkage-rule" class="level3">
<h3 class="anchored" data-anchor-id="deterministic-record-linkage-rule">Deterministic Record Linkage Rule</h3>
<p>To be more precise, record linkage proceeds as follows:</p>
<ol type="1">
<li><p>Records are processed to clean and standardize the email, phone and name attributes. That is, leading and trailing whitespace are removed, capitalization is standardized, phone numbers are validated and standardized, and punctuation is removed from names.</p></li>
<li><p>Records which agree on any of their email, phone or name attributes are linked together.</p></li>
<li><p>Connected components of the resulting graph are computed in order to obtain record clusters.</p></li>
</ol>
<p>This record linkage procedure is extremely simple. It relies the fact that all three attributes are reliable indicators of a match and that, for two matching records, it is likely that at least one of these three attributes will be in agreement.</p>
<p>Also, the simplicity of the approach allows the use of available additional information (such as IP address and additional questions) for model validation. If the use of this additional information does not highlight any flaws with the simple deterministic approach, then this means that the deterministic approach is already good enough. We will come back to this when discussing model validation techniques.</p>
</section>
<section id="implementation" class="level3">
<h3 class="anchored" data-anchor-id="implementation">Implementation</h3>
<p>Our deterministic record linkage system is implemented in Python with some generality. The goal is for the system to be able to adapt to changes in data or processes.</p>
<p>The fundamental component of the system is a <code>LinkageRule</code> class. LinkageRule objects can be fitted to data, providing either a clustering or a linkage graph. For instance, a LinkageRule might be a rule to link all records which agree on the email attribute. Another LinkageRule might summarize a set of other rules, such as taking the union or intersection of their links.</p>
<p>The interface is as follows:</p>
<div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> abc <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> ABC, abstractmethod</span>
<span id="cb1-2"></span>
<span id="cb1-3"></span>
<span id="cb1-4"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">class</span> LinkageRule(ABC):</span>
<span id="cb1-5">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb1-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Interface for a linkage rule which can be fitted to data.</span></span>
<span id="cb1-7"></span>
<span id="cb1-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    This abstract class specifies three methods. The `fit()` method fits the </span></span>
<span id="cb1-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    linkage rule to a pandas DataFrame. The `graph` property can be used after </span></span>
<span id="cb1-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    `fit()` to obtain a graph representing the linkage fitted to data.  The </span></span>
<span id="cb1-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    `groups` property can be used after `fit()` to obtain a membership vector </span></span>
<span id="cb1-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    representing the clustering fitted to data.</span></span>
<span id="cb1-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    """</span></span>
<span id="cb1-14">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@abstractmethod</span></span>
<span id="cb1-15">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> fit(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, df):</span>
<span id="cb1-16">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">pass</span></span>
<span id="cb1-17"></span>
<span id="cb1-18">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@property</span></span>
<span id="cb1-19">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@abstractmethod</span></span>
<span id="cb1-20">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> graph(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>):</span>
<span id="cb1-21">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">pass</span></span>
<span id="cb1-22"></span>
<span id="cb1-23">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@property</span></span>
<span id="cb1-24">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@abstractmethod</span></span>
<span id="cb1-25">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> groups(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>):</span>
<span id="cb1-26">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">pass</span></span></code></pre></div>
<p>Note that group membership vectors, our representation for cluster groups, are meant to be a numpy integer array with entries indicating what group (cluster) a given record belongs to. Such a “groups” vector should not contain NA values; rather it should contain distinct integers for records that are not in the same cluster.</p>
<p>We will now define two other classes, <code>Match</code> and <code>Any</code>, which allow us to implement deterministic record linkage. The <code>Match</code> class implements an exact matching rule, while <code>Any</code> is the logical disjunction of a given set of rules. Our deterministic record linkage rule for the pantry will therefore be defined as follows:</p>
<div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb2-1">rule <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Any(Match(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"name"</span>), Match(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"email"</span>), Match(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"phone"</span>))</span></code></pre></div>
<p>Following the <code>LinkageRule</code> interface, this rule will then be fitted to the data and used as follows:</p>
<div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb3-1">rule.fit(data)</span>
<span id="cb3-2">data.groupby(rule.groups).last() <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Get last visit data for all customers.</span></span></code></pre></div>
<p>The benefit of this general interface is that it is extendable. By default, the <code>Any</code> class will return connected components when requesting group clusters. However, other clustering approaches could be used. Exact matching rules could also be relaxed to fuzzy matching rules based on string distance metrics or probabilistic record linkage. All of this can be implemented as additional <code>LinkageRule</code> subclasses in a way which is compatible with the above.</p>
<p>Let’s now work on the <code>Match</code> class. For efficiency, we’ll want <code>Match</code> to operate at the groups level. That is, if <code>Match</code> is called on a set of rules, then we’ll first compute groups for these rules, before computing the intersection of these groups. This core functionality is implemented in the function <code>_groups_from_rules()</code> below. The function <code>_groups()</code> is a simple wrapper to interpret strings as a matching rule on the corresponding column.</p>
<div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb4-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb4-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> itertools</span>
<span id="cb4-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> igraph <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Graph</span>
<span id="cb4-5"></span>
<span id="cb4-6"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> _groups(rule, df):</span>
<span id="cb4-7">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb4-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Fit linkage rule to dataframe and return membership vector.</span></span>
<span id="cb4-9"></span>
<span id="cb4-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Parameters</span></span>
<span id="cb4-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    ----------</span></span>
<span id="cb4-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    rule: string or LinkageRule</span></span>
<span id="cb4-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        Linkage rule to be fitted to the data. If `rule` is a string, then this </span></span>
<span id="cb4-14"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        is interpreted as an exact matching rule for the corresponding column.</span></span>
<span id="cb4-15"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    df: DataFrame</span></span>
<span id="cb4-16"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        pandas Dataframe to which the rule is fitted.</span></span>
<span id="cb4-17"></span>
<span id="cb4-18"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Returns</span></span>
<span id="cb4-19"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    -------</span></span>
<span id="cb4-20"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Membership vector (i.e. integer vector) u such that u[i] indicates the </span></span>
<span id="cb4-21"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    cluster to which dataframe row i belongs. </span></span>
<span id="cb4-22"></span>
<span id="cb4-23"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Notes</span></span>
<span id="cb4-24"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    -----</span></span>
<span id="cb4-25"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    NA values are considered to be non-matching.</span></span>
<span id="cb4-26"></span>
<span id="cb4-27"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Examples</span></span>
<span id="cb4-28"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    --------</span></span>
<span id="cb4-29"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    &gt;&gt;&gt; import pandas as pd</span></span>
<span id="cb4-30"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    &gt;&gt;&gt; df = pd.DataFrame({"fname":["Olivier", "Jean-Francois", "Alex"], </span></span>
<span id="cb4-31"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">      "lname":["Binette", "Binette", pd.NA]})</span></span>
<span id="cb4-32"></span>
<span id="cb4-33"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Groups specified by distinct first names:</span></span>
<span id="cb4-34"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    &gt;&gt;&gt; _groups("fname", df)</span></span>
<span id="cb4-35"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    array([2, 1, 0], dtype=int8)</span></span>
<span id="cb4-36"></span>
<span id="cb4-37"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Groups specified by same last names:</span></span>
<span id="cb4-38"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    &gt;&gt;&gt; _groups("lname", df)</span></span>
<span id="cb4-39"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    array([0, 0, 3], dtype=int8)</span></span>
<span id="cb4-40"></span>
<span id="cb4-41"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Groups specified by a given linkage rule:</span></span>
<span id="cb4-42"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    &gt;&gt;&gt; rule = Match("fname")</span></span>
<span id="cb4-43"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    &gt;&gt;&gt; _groups(rule, df)</span></span>
<span id="cb4-44"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    array([2, 1, 0])</span></span>
<span id="cb4-45"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    """</span></span>
<span id="cb4-46">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> (<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">isinstance</span>(rule, <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>)):</span>
<span id="cb4-47">        arr <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.array(pd.Categorical(df[rule]).codes, dtype<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>np.int32) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Specifying dtype avoids overflow issues</span></span>
<span id="cb4-48">        I <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (arr <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># NA value indicators</span></span>
<span id="cb4-49">        arr[I] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.arange(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(arr), <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(arr)<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>(I))</span>
<span id="cb4-50">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> arr</span>
<span id="cb4-51">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">elif</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">isinstance</span>(rule, LinkageRule):</span>
<span id="cb4-52">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> rule.fit(df).groups</span>
<span id="cb4-53">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span>:</span>
<span id="cb4-54">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">raise</span> <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">NotImplementedError</span>()</span>
<span id="cb4-55"></span>
<span id="cb4-56"></span>
<span id="cb4-57"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> _groups_from_rules(rules, df):</span>
<span id="cb4-58">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb4-59"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Fit linkage rules to data and return groups corresponding to their logical </span></span>
<span id="cb4-60"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    conjunction.</span></span>
<span id="cb4-61"></span>
<span id="cb4-62"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    This function computes the logical conjunction of a set of rules, operating </span></span>
<span id="cb4-63"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    at the groups level. That is, rules are fitted to the data, membership </span></span>
<span id="cb4-64"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    vector are obtained, and then the groups specified by these membership </span></span>
<span id="cb4-65"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    vectors are intersected.</span></span>
<span id="cb4-66"></span>
<span id="cb4-67"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Parameters</span></span>
<span id="cb4-68"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    ----------</span></span>
<span id="cb4-69"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    rules: list[LinkageRule]</span></span>
<span id="cb4-70"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        List of strings or Linkage rule objects to be fitted to the data. </span></span>
<span id="cb4-71"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        Strings are interpreted as exact matching rules on the corresponding </span></span>
<span id="cb4-72"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        columns.</span></span>
<span id="cb4-73"></span>
<span id="cb4-74"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    df: DataFrame</span></span>
<span id="cb4-75"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        pandas DataFrame to which the rules are fitted.</span></span>
<span id="cb4-76"></span>
<span id="cb4-77"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Returns</span></span>
<span id="cb4-78"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    -------</span></span>
<span id="cb4-79"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Membership vector representing the cluster to which each dataframe row </span></span>
<span id="cb4-80"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    belongs.</span></span>
<span id="cb4-81"></span>
<span id="cb4-82"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Notes</span></span>
<span id="cb4-83"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    -----</span></span>
<span id="cb4-84"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    NA values are considered to be non-matching.</span></span>
<span id="cb4-85"></span>
<span id="cb4-86"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Examples</span></span>
<span id="cb4-87"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    --------</span></span>
<span id="cb4-88"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    &gt;&gt;&gt; import pandas as pd</span></span>
<span id="cb4-89"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    &gt;&gt;&gt; df = pd.DataFrame({"fname":["Olivier", "Jean-Francois", "Alex"], </span></span>
<span id="cb4-90"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">      "lname":["Binette", "Binette", pd.NA]})</span></span>
<span id="cb4-91"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    &gt;&gt;&gt; _groups_from_rules(["fname", "lname"], df)</span></span>
<span id="cb4-92"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    array([2, 1, 0])</span></span>
<span id="cb4-93"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    """</span></span>
<span id="cb4-94"></span>
<span id="cb4-95">    arr <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.array([_groups(rule, df) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> rule <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> rules]).T</span>
<span id="cb4-96">    groups <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.unique(arr, axis<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, return_inverse<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb4-97">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> groups</span></code></pre></div>
<p>We can now implement <code>Match</code> as follows. Note that the <code>Graph</code> representation of the clustering is only computed if and when needed.</p>
<div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">class</span> Match(LinkageRule):</span>
<span id="cb5-2">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb5-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Class representing an exact matching rule over a given set of columns.</span></span>
<span id="cb5-4"></span>
<span id="cb5-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Attributes</span></span>
<span id="cb5-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    ----------</span></span>
<span id="cb5-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    graph: igraph.Graph</span></span>
<span id="cb5-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        Graph representing linkage fitted to the data. Defaults to None and is </span></span>
<span id="cb5-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        instantiated after the `fit()` function is called.</span></span>
<span id="cb5-10"></span>
<span id="cb5-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    groups: integer array</span></span>
<span id="cb5-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        Membership vector for the linkage clusters fitted to the data. Defaults </span></span>
<span id="cb5-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        to None and is instantiated after the `fit()` function is called.</span></span>
<span id="cb5-14"></span>
<span id="cb5-15"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Methods</span></span>
<span id="cb5-16"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    -------</span></span>
<span id="cb5-17"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    fit(df)</span></span>
<span id="cb5-18"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        Fits linkage rule to the given dataframe.</span></span>
<span id="cb5-19"></span>
<span id="cb5-20"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Examples</span></span>
<span id="cb5-21"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    --------</span></span>
<span id="cb5-22"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    &gt;&gt;&gt; import pandas as pd</span></span>
<span id="cb5-23"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    &gt;&gt;&gt; df = pd.DataFrame({"fname":["Olivier", "Jean-Francois", "Alex"], </span></span>
<span id="cb5-24"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    "lname":["Binette", "Binette", pd.NA]})</span></span>
<span id="cb5-25"></span>
<span id="cb5-26"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Link records which agree on both the "fname" and "lname" fields.</span></span>
<span id="cb5-27"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    &gt;&gt;&gt; rule = Match("fname", "lname")</span></span>
<span id="cb5-28"></span>
<span id="cb5-29"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Fit linkage rule to the data.</span></span>
<span id="cb5-30"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    &gt;&gt;&gt; _ = rule.fit(df)</span></span>
<span id="cb5-31"></span>
<span id="cb5-32"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Construct deduplicated dataframe, retaining only the first record in each cluster.</span></span>
<span id="cb5-33"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    &gt;&gt;&gt; _ = df.groupby(rule.groups).first()</span></span>
<span id="cb5-34"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    """</span></span>
<span id="cb5-35"></span>
<span id="cb5-36">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">__init__</span>(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>args):</span>
<span id="cb5-37">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb5-38"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        Parameters</span></span>
<span id="cb5-39"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        ----------</span></span>
<span id="cb5-40"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        args: list containing strings and/or LinkageRule objects.</span></span>
<span id="cb5-41"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">            The `Match` object represents the logical conjunction of the set of </span></span>
<span id="cb5-42"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">            rules given in the `args` parameter. </span></span>
<span id="cb5-43"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        """</span></span>
<span id="cb5-44">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.rules <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> args</span>
<span id="cb5-45">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>._update_graph <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span></span>
<span id="cb5-46">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.n <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span></span>
<span id="cb5-47"></span>
<span id="cb5-48">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> fit(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, df):</span>
<span id="cb5-49">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>._groups <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> _groups_from_rules(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.rules, df)</span>
<span id="cb5-50">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>._update_graph <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span></span>
<span id="cb5-51">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.n <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df.shape[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]</span>
<span id="cb5-52"></span>
<span id="cb5-53">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span></span>
<span id="cb5-54"></span>
<span id="cb5-55">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@property</span></span>
<span id="cb5-56">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> groups(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>):</span>
<span id="cb5-57">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>._groups</span></code></pre></div>
<p>One more method is needed to complete the implementation of a <code>LinkageRule</code>, namely the <code>graph</code> property. This property returns a Graph object corresponding to the matching rule. The graph is built as follows. First, we construct an inverted index for the clustering. That is, we construct a dictionary associating to each cluster the nodes which it contains. Then, an edge list is obtained by linking all pairs of nodes which belong to the same cluster. Note that the pure Python implementation below if not efficient for large clusters. This is not a problem for now since we will generally avoid computing this graph.</p>
<div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb6-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Part of the definition of the `Match` class:</span></span>
<span id="cb6-2">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@property</span></span>
<span id="cb6-3">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> graph(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> Graph:</span>
<span id="cb6-4">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>._update_graph:</span>
<span id="cb6-5">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Inverted index</span></span>
<span id="cb6-6">            clust <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.DataFrame({<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"groups"</span>: <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.groups}</span>
<span id="cb6-7">                                 ).groupby(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"groups"</span>).indices</span>
<span id="cb6-8">            <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>._graph <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Graph(n<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.n)</span>
<span id="cb6-9">            <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>._graph.add_edges(itertools.chain.from_iterable(</span>
<span id="cb6-10">                itertools.combinations(c, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> c <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> clust.values()))</span>
<span id="cb6-11">            <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>._update_graph <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span></span>
<span id="cb6-12">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>._graph</span></code></pre></div>
<p>Finally, let’s implement the <code>Any</code> class. It’s purpose is to take the union (i.e.&nbsp;logical disjunction) of a set of rules. Just like for <code>Match</code>, we can choose to operate at the groups or graph level. Here we’ll work at the groups level for efficiency. That is, given a set of rules, <code>Any</code> will first compute their corresponding clusters before merging overlapping clusters.</p>
<p>There are quite a few different ways to efficiently merge clusters. Here we’ll merge clusters by computing a “path graph” representation, taking the union of these graphs, and then computing connected components. For a given clustering, say containing records a, b, and c, the “path graph” links records as a path a–b–c.</p>
<p>First, we define the functions needed to compute path graphs:</p>
<div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb7-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> pairwise(iterable):</span>
<span id="cb7-2">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb7-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Iterate over consecutive pairs:</span></span>
<span id="cb7-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        s -&gt; (s[0], s[1]), (s[1], s[2]), (s[2], s[3]), ...</span></span>
<span id="cb7-5"></span>
<span id="cb7-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Note</span></span>
<span id="cb7-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    ----</span></span>
<span id="cb7-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Current implementation is from itertools' recipes list available at </span></span>
<span id="cb7-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    https://docs.python.org/3/library/itertools.html</span></span>
<span id="cb7-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    """</span></span>
<span id="cb7-11">    a, b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> itertools.tee(iterable)</span>
<span id="cb7-12">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">next</span>(b, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>)</span>
<span id="cb7-13">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">zip</span>(a, b)</span>
<span id="cb7-14"></span>
<span id="cb7-15"></span>
<span id="cb7-16"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> _path_graph(rule, df):</span>
<span id="cb7-17">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb7-18"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Compute path graph corresponding to the rule's clustering: cluster elements </span></span>
<span id="cb7-19"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    are connected as a path.</span></span>
<span id="cb7-20"></span>
<span id="cb7-21"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Parameters</span></span>
<span id="cb7-22"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    ----------</span></span>
<span id="cb7-23"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    rule: string or LinkageRule</span></span>
<span id="cb7-24"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        Linkage rule for which to compute the corresponding path graph </span></span>
<span id="cb7-25"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        (strings are interpreted as exact matching rules for the corresponding column).</span></span>
<span id="cb7-26"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    df: DataFrame</span></span>
<span id="cb7-27"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        Data to which the linkage rule is fitted.</span></span>
<span id="cb7-28"></span>
<span id="cb7-29"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Returns</span></span>
<span id="cb7-30"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    -------</span></span>
<span id="cb7-31"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Graph object such that nodes in the same cluster (according to the fitted </span></span>
<span id="cb7-32"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    linkage rule) are connected as graph paths.</span></span>
<span id="cb7-33"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    """</span></span>
<span id="cb7-34">    gr <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> _groups(rule, df)</span>
<span id="cb7-35">    </span>
<span id="cb7-36">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Inverted index</span></span>
<span id="cb7-37">    clust <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.DataFrame({<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"groups"</span>: gr}</span>
<span id="cb7-38">                         ).groupby(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"groups"</span>).indices</span>
<span id="cb7-39">    graph <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Graph(n<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>df.shape[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>])</span>
<span id="cb7-40">    graph.add_edges(itertools.chain.from_iterable(</span>
<span id="cb7-41">        pairwise(c) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> c <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> clust.values()))</span>
<span id="cb7-42"></span>
<span id="cb7-43">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> graph</span></code></pre></div>
<p>We can now implement the <code>Any</code> class:</p>
<div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb8-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">class</span> Any(LinkageRule):</span>
<span id="cb8-2">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb8-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Class representing the logical disjunction of linkage rules.</span></span>
<span id="cb8-4"></span>
<span id="cb8-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Attributes</span></span>
<span id="cb8-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    ----------</span></span>
<span id="cb8-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    graph: igraph.Graph</span></span>
<span id="cb8-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        Graph representing linkage fitted to the data. Defaults to None and is </span></span>
<span id="cb8-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        instantiated after the `fit()` function is called.</span></span>
<span id="cb8-10"></span>
<span id="cb8-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    groups: integer array</span></span>
<span id="cb8-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        Membership vector for the linkage clusters fitted to the data. Defaults </span></span>
<span id="cb8-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        to None and is instantiated after the `fit()` function is called.</span></span>
<span id="cb8-14"></span>
<span id="cb8-15"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Methods</span></span>
<span id="cb8-16"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    -------</span></span>
<span id="cb8-17"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    fit(df)</span></span>
<span id="cb8-18"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        Fits linkage rule to the given dataframe.</span></span>
<span id="cb8-19"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    """</span></span>
<span id="cb8-20"></span>
<span id="cb8-21">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">__init__</span>(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>args):</span>
<span id="cb8-22">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb8-23"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        Parameters</span></span>
<span id="cb8-24"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        ----------</span></span>
<span id="cb8-25"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        args: list containing strings and/or LinkageRule objects.</span></span>
<span id="cb8-26"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">            The `Any` object represents the logical disjunction of the set of </span></span>
<span id="cb8-27"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">            rules given by `args`. </span></span>
<span id="cb8-28"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        """</span></span>
<span id="cb8-29">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.rules <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> args</span>
<span id="cb8-30">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>._graph <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span></span>
<span id="cb8-31">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>._groups <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span></span>
<span id="cb8-32">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>._update_groups <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span></span>
<span id="cb8-33"></span>
<span id="cb8-34">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> fit(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, df):</span>
<span id="cb8-35">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>._update_groups <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span></span>
<span id="cb8-36">        graphs_vect <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [_path_graph(rule, df) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> rule <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.rules]</span>
<span id="cb8-37">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>._graph <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> igraph.union(graphs_vect)</span>
<span id="cb8-38">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span></span>
<span id="cb8-39"></span>
<span id="cb8-40">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@property</span></span>
<span id="cb8-41">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> groups(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>):</span>
<span id="cb8-42">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>._update_groups:</span>
<span id="cb8-43">            <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>._update_groups <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span></span>
<span id="cb8-44">            <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>._groups <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.array(</span>
<span id="cb8-45">                <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>._graph.clusters().membership)</span>
<span id="cb8-46">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>._groups</span>
<span id="cb8-47"></span>
<span id="cb8-48">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@property</span></span>
<span id="cb8-49">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> graph(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> Graph:</span>
<span id="cb8-50">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>._graph</span></code></pre></div>
<p>The complete Python module (still under development) implementing this approach can be found on Github at <a href="https://github.com/olivierbinette/groupbyrule">OlivierBinette/GroupByRule</a>.</p>
</section>
<section id="limitations" class="level3">
<h3 class="anchored" data-anchor-id="limitations">Limitations</h3>
<p>There are quite a few limitations with this simple deterministic approach. We’ll see in the model evaluation section that these do not affect performance to a large degree. However, for a system used with more data or over a longer timeframe, these should be carefully considered.</p>
<p>First, the deterministic linkage does not allow the consideration of contradictory evidence. For instance, if long-form Duke email addresses are provided on two records and do not agree (e.g.&nbsp;“olivier.binette@duke.edu” and “olivier.bonhomme@duke.edu” are provided), then we know <em>for sure</em> that the records do not correspond to the same individual, even if the same name was provided (here Olivier B.). The consideration of such evidence could rely on probabilistic record linkage, where each record pair is associated a match probability.</p>
<p>Second, the use of connected components to resolve transitivity can be problematic, as a single spurious link could connect two large clusters by mistake. More sophisticated graph clustering techniques, in combination with probabilistic record linkage, would be required to mitigate the issue.</p>
</section>
</section>
<section id="model-evaluation" class="level2">
<h2 class="anchored" data-anchor-id="model-evaluation">Model Evaluation</h2>
<p>I cannot share any of the data which we have at the Pantry. However, I can describe general steps to be taken to evaluate model performance in practice.</p>
<section id="pairwise-precision-and-recall" class="level3">
<h3 class="anchored" data-anchor-id="pairwise-precision-and-recall">Pairwise Precision and Recall</h3>
<p>Here we will evaluate linkage performance using pairwise precision <img src="https://latex.codecogs.com/png.latex?P"> and recall <img src="https://latex.codecogs.com/png.latex?R">. The precision <img src="https://latex.codecogs.com/png.latex?P"> is defined as the proportion of predicted links which are true matches, whereas <img src="https://latex.codecogs.com/png.latex?R"> is the proportion of true matches which are correctly predicted. That is, if <img src="https://latex.codecogs.com/png.latex?TP"> is the number of true positive links, <img src="https://latex.codecogs.com/png.latex?P"> the number of predicted links, and <img src="https://latex.codecogs.com/png.latex?T"> the number of true matches, then we have <img src="https://latex.codecogs.com/png.latex?%0AP%20=%20TP/P,%20%5Cquad%20R%20=%20TP/T.%0A"></p>
<section id="estimating-precision" class="level4">
<h4 class="anchored" data-anchor-id="estimating-precision">Estimating Precision</h4>
<p>It is helpful to express precision and recall in cluster form, where cluster elements are all interlinked. Let <img src="https://latex.codecogs.com/png.latex?C"> be the set of true clusters and let <img src="https://latex.codecogs.com/png.latex?%5Chat%20C"> be the set of predicted clusters. For a given cluster <img src="https://latex.codecogs.com/png.latex?%5Chat%20c%20%5Cin%20%5Chat%20C">, let <img src="https://latex.codecogs.com/png.latex?C%20%5Ccap%20%5Chat%20c"> be the restriction of the clustering <img src="https://latex.codecogs.com/png.latex?C"> to <img src="https://latex.codecogs.com/png.latex?%5Chat%20c">. Then we have <img src="https://latex.codecogs.com/png.latex?%0A%20%20P%20=%20%5Cfrac%7B%5Csum_%7B%5Chat%20c%20%5Cin%20%5Chat%20C%7D%20%5Csum_%7Be%20%5Cin%20C%20%5Ccap%20%5Chat%20c%7D%20%7B%5Clvert%20e%5Crvert%20%5Cchoose%202%7D%20%7D%7B%20%5Csum_%7B%5Chat%20c%20%5Cin%20%5Chat%20C%7D%20%7B%5Clvert%20%5Chat%20c%20%5Crvert%20%5Cchoose%202%7D%7D.%0A"></p>
<p>The denominator can be computed exactly, while the numerator can be estimated by randomly sampling clusters <img src="https://latex.codecogs.com/png.latex?%5Chat%20c%20%5Cin%20%5Chat%20C">, breaking them up into true clusters <img src="https://latex.codecogs.com/png.latex?e%20%5Cin%20C%20%5Ccap%20%5Chat%20c">, and then computing the sum of the combinations <img src="https://latex.codecogs.com/png.latex?%7B%5Clvert%20e%5Crvert%20%5Cchoose%202%7D">. Importance sampling could be used to reduce the variance of the estimator, but it does not seem necessary for the scale of the data which we have at the pantry, where each predicted cluster can be examined quite quickly.</p>
<p>In practice, the precision estimation process can be carried out as follows:</p>
<ol type="1">
<li>Sample predicted clusters at random (in the case of the pantry, we can take all predicted clusters).</li>
<li>Make a spreadsheet with all the records corresponding to the sampled clusters.</li>
<li>Sort the spreadsheet by predicted cluster ID.</li>
<li>Add a new empty column to the spreadsheet, called “trueSubClusters”.</li>
<li>Separately look at each predicted cluster. If the cluster should be broken up in multiple parts, use the “trueSubClusters” column to provide identifiers for true cluster membership. Note that these identifiers do not need to match across predicted clusters.</li>
</ol>
<p>The spreadsheet can then be read-in and processed in a straightforward way to obtain an estimated precision value.</p>
</section>
<section id="estimating-recall" class="level4">
<h4 class="anchored" data-anchor-id="estimating-recall">Estimating Recall</h4>
<p>Estimating recall is a bit trickier than estimating precision, but we can make one assumption to simplify the process. Assume that precision is exactly 1, or very close to 1, so that all predicted clusters can roughly be taken at face value. Estimating recall then boils to the problem of identifying which predicted clusters should be merged together.</p>
<p>Indeed, using the same notations as above, we can write <img src="https://latex.codecogs.com/png.latex?%0AR%20=%20%5Cfrac%7B%5Csum_%7B%20c%20%5Cin%20%20C%7D%20%5Csum_%7Be%20%5Cin%20%5Chat%20C%20%5Ccap%20%20c%7D%20%7B%5Clvert%20e%5Crvert%20%5Cchoose%202%7D%20%7D%7B%20%5Csum_%7B%20c%20%5Cin%20%20C%7D%20%7B%5Clvert%20%20c%20%5Crvert%20%5Cchoose%202%7D%7D.%0A"> If precision is 1, then the denominator can be computed from the sizes of predicted clusters which are identified to be merged. On the other hand, the nominator simplifies to <img src="https://latex.codecogs.com/png.latex?%5Csum_%7Be%20%5Cin%20%5Chat%20C%7D%7B%5Clvert%20e%20%5Crvert%20%5Cchoose%202%7D"> which can be computed exactly from the sizes of predicted clusters. In the case of the Pantry, wrongly separated clusters are likely to be due to small differences in names and emails. Our procedure to identify clusters which should have been merged together is as follows:</p>
<ol type="1">
<li>Make a spreadsheet containing canonical customer records (one representative record for each predicted individual customer).</li>
<li>Create a new empty column named “trueClustersA”.</li>
<li>Sort the spreadsheet by name.</li>
<li>Go through the spreadsheet from top to bottom, looking at whether or not consecutive predicted clusters should be merged together. If so, write a corresponding cluster membership ID in the “trueClustersA” column.</li>
<li>Create a new empty column named “trueClustersB”.</li>
<li>Sort the spreadsheet by email</li>
<li>Go through the spreadsheet from top to bottom, looking at whether or not consecutive predicted clusters should be merged together. If so, write a corresponding cluster membership ID in the “trueClustersB” column.</li>
</ol>
<p>This process might not catch all wrongly separated clusters, but it is likely to find many of the errors due to different ways of writing names and different email addresses. The resulting spreadsheet can then easily be processed to obtain an estimated recall. If we were working with a larger dataset, we’d have to use further blocking to restrict our consideration to a more manageable subset of the data.</p>
</section>
</section>
<section id="results" class="level3">
<h3 class="anchored" data-anchor-id="results">Results</h3>
<p>I used the above procedures to estimate precision and recall of our simple deterministic approach to deduplicate the Pantry’s data. There was a total of 3281 bag order records for 689 estimated customers. The results are below.</p>
<p><strong>Estimated Precision: 92%</strong></p>
<p>Precision is somewhat low due to about 3 relatively large clusters (around 30-50 records each) which should have been broken up in a few parts. 2% precision was lost due to a couple that shared a phone number, where each had about 20 order records. The vast majority of spurious links were tied to bag orders for which only the first name was provided (e.g.&nbsp;“Sam”). The use of negative evidence to distinguish between individuals would help resolve these cases.</p>
<p><strong>Estimated Recall: 99.6%</strong></p>
<p>This is certainly an overestimate, but it does show that missing links are not obviously showing up. Given the structure of the Pantry data, it is likely that recall is indeed quite high.</p>
</section>
</section>
<section id="final-thoughts" class="level2">
<h2 class="anchored" data-anchor-id="final-thoughts">Final thoughts</h2>
<p>There are many ways in which the record linkage approach could be improved. As previously discussed, probabilistic record linkage would allow the consideration of negative evidence and the use of additional quasi-identifying information (such as IP addresses and other responses on the bag order forms). I’m looking forward to building on the <code>GroupByRule</code> Python module to provide a user-friendly and unified interface to more flexible methodology.</p>
<p>However, it is important to ensure that any record linkage approach is intelligible and rooted in a good understanding of the underlying data. In this context, the use of a well-thought deterministic approach can provide good performance, at least as a first step or baseline for comparison. Furthermore, it is important to spend sufficient time investigating the results of the linkage to evaluate performance. I have highlighted simple steps which can be taken to estimate precision and make a good effort at identifying missing links. This is highly informative for model validation, improvement, and for the interpretation of any following results.</p>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent" data-entry-spacing="0">
<div id="ref-Campbell2008" class="csl-entry">
Campbell, Kevin M., Dennis Deck, and Antoinette Krupski. 2008. <span>“Record Linkage Software in the Public Domain: A Comparison of <span>L</span>ink Plus, the <span>L</span>ink <span>K</span>ing, and a ’Basic’ Deterministic Algorithm.”</span> <em>Health Informatics Journal</em> 14 (1): 5–15.
</div>
<div id="ref-Gomatam2002" class="csl-entry">
Gomatam, Shanti, Randy Carter, Mario Ariet, and Glenn Mitchell. 2002. <span>“An Empirical Comparison of Record Linkage Procedures.”</span> <em>Statistics in Medicine</em> 21 (10): 1485–96. <a href="https://doi.org/10.1002/sim.1147">https://doi.org/10.1002/sim.1147</a>.
</div>
<div id="ref-Monge1997" class="csl-entry">
Monge, Alvaro E., and Charles P. Elkan. 1997. <span>“An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records.”</span> <em>Proceedings of the SIGMOD 1997 Workshop on Research Issues on Sata Mining and Knowledge Discovery</em>, 23–29. <a href="https://doi.org/10.1.1.28.8405">https://doi.org/10.1.1.28.8405</a>.
</div>
<div id="ref-Potosky1993" class="csl-entry">
Potosky, Arnold L., Gerald F. Riley, James D. Lubitz, Renee M. Mentnech, and Larry G. Kessler. 1993. <span>“Potential for Cancer Related Health Services Research Using a Linked Medicare-Tumor Registry Database.”</span> <em>Medical Care</em> 31 (8): 732–48. <a href="https://doi.org/10.1097/00005650-199308000-00006">https://doi.org/10.1097/00005650-199308000-00006</a>.
</div>
<div id="ref-Tromp2011" class="csl-entry">
Tromp, Miranda, Anita C. Ravelli, Gouke J. Bonsel, Arie Hasman, and Johannes B. Reitsma. 2011. <span>“Results from Simulated Data Sets: Probabilistic Record Linkage Outperforms Deterministic Record Linkage.”</span> <em>Journal of Clinical Epidemiology</em> 64 (5): 565–72. <a href="https://doi.org/10.1016/j.jclinepi.2010.05.008">https://doi.org/10.1016/j.jclinepi.2010.05.008</a>.
</div>
</div></section><section class="quarto-appendix-contents" id="quarto-reuse"><h2 class="anchored quarto-appendix-heading">Reuse</h2><div class="quarto-appendix-contents"><div><a rel="license" href="https://creativecommons.org/licenses/by/4.0/">CC BY 4.0</a></div></div></section><section class="quarto-appendix-contents" id="quarto-copyright"><h2 class="anchored quarto-appendix-heading">Copyright</h2><div class="quarto-appendix-contents"><div>Olivier Binette</div></div></section></div> ]]></description>
  <category>technical</category>
  <category>python</category>
  <category>statistics</category>
  <guid>https://olivierbinette.ca/pages/posts/2022-01-01-record-linkage-at-the-gpsg-community-pantry/record-linkage-at-the-gpsg-community-pantry.html</guid>
  <pubDate>Thu, 23 Dec 2021 05:00:00 GMT</pubDate>
</item>
<item>
  <title>Validating function arguments in R</title>
  <link>https://olivierbinette.ca/pages/posts/2020-11-15-validating-arguments-in-r/2020-11-15-validating-arguments.html</link>
  <description><![CDATA[ 





<p><strong>Update:</strong> The <code>assert</code> package is now available on CRAN:</p>
<div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode numberSource r number-lines code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">install.packages</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"assert"</span>)</span></code></pre></div>
<hr>
<p>I was programming a Gibbs sampler the other day and all hell broke loose: small errors were hard to trace back to the source of the problem and debugging was a pain.</p>
<p>The bugs could have been caught much more early if I had properly validated the input arguments of my various helper functions. So I decided it was time for me to learn how to do this properly.</p>
<section id="validating-function-input-arguments-in-r" class="level2">
<h2 class="anchored" data-anchor-id="validating-function-input-arguments-in-r">Validating function input arguments in R</h2>
<p>The easiest way is to manually incorporate checks.</p>
<div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode numberSource r number-lines code-with-copy"><code class="sourceCode r"><span id="cb2-1">mySum <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(a, b) {</span>
<span id="cb2-2">  <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> (<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.numeric</span>(a) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.numeric</span>(b)) {</span>
<span id="cb2-3">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">stop</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Arguments should be numeric."</span>)</span>
<span id="cb2-4">  }</span>
<span id="cb2-5">  <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> (<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">length</span>(a) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">length</span>(b)) {</span>
<span id="cb2-6">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">stop</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Arguments should be of the same length."</span>)</span>
<span id="cb2-7">  } </span>
<span id="cb2-8">  </span>
<span id="cb2-9">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">return</span>(a<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>b)</span>
<span id="cb2-10">}</span></code></pre></div>
<p>This works well enough, but it takes up a lot of space and you have to manually write up the description of the errors.</p>
<section id="a-first-solution" class="level3">
<h3 class="anchored" data-anchor-id="a-first-solution">A first solution</h3>
<p>Let’s use the <code>assertthat</code> package.</p>
<div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode numberSource r number-lines code-with-copy"><code class="sourceCode r"><span id="cb3-1">mySum <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(a, b) {</span>
<span id="cb3-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">assert_that</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.numeric</span>(a), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.numeric</span>(b))</span>
<span id="cb3-3">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">assert_that</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">length</span>(a) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">length</span>(b))</span>
<span id="cb3-4">  </span>
<span id="cb3-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">return</span>(a<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>b)</span>
<span id="cb3-6">}</span></code></pre></div>
<p>This is neater, but the error messages are not very descriptive.</p>
<div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode numberSource r number-lines code-with-copy"><code class="sourceCode r"><span id="cb4-1"><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mySum</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"1"</span>)</span>
<span id="cb4-2">        Error<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> b is not a numeric or integer vector </span></code></pre></div>
<p>What is <code>b</code> here? What arguments in the function call caused the error? It’s a bit hard to tell, especially if the call to this function is hidden in some large Gibbs sampler.</p>
</section>
<section id="the-assert-function" class="level3">
<h3 class="anchored" data-anchor-id="the-assert-function">The <code>assert</code> function</h3>
<p>My solution is the <code>assert</code> function which you can find <a href="https://gist.github.com/OlivierBinette/a048d7c1f470740b64e95c74828c8516">on my Github Gist</a>.</p>
<div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode numberSource r number-lines code-with-copy"><code class="sourceCode r"><span id="cb5-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">source</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"assert.R"</span>)</span></code></pre></div>
<p>Usage is similar to what we did above:</p>
<div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode numberSource r number-lines code-with-copy"><code class="sourceCode r"><span id="cb6-1">mySum <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(a, b) {</span>
<span id="cb6-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">assert</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.numeric</span>(a), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.numeric</span>(b))</span>
<span id="cb6-3">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">assert</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">length</span>(a) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">length</span>(b))</span>
<span id="cb6-4">  </span>
<span id="cb6-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">return</span>(a<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>b)</span>
<span id="cb6-6">}</span></code></pre></div>
<p>But now we have much more descriptive error messages.</p>
<div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode numberSource r number-lines code-with-copy"><code class="sourceCode r"><span id="cb7-1"><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mySum</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"1"</span>)</span>
<span id="cb7-2">        Error<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mySum</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">a =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">b =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"1"</span>)</span>
<span id="cb7-3">        Failed checks<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> </span>
<span id="cb7-4">            <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.numeric</span>(b) </span></code></pre></div>


</section>
</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-reuse"><h2 class="anchored quarto-appendix-heading">Reuse</h2><div class="quarto-appendix-contents"><div><a rel="license" href="https://creativecommons.org/licenses/by/4.0/">CC BY 4.0</a></div></div></section><section class="quarto-appendix-contents" id="quarto-copyright"><h2 class="anchored quarto-appendix-heading">Copyright</h2><div class="quarto-appendix-contents"><div>Olivier Binette</div></div></section></div> ]]></description>
  <category>technical</category>
  <category>r-programming</category>
  <guid>https://olivierbinette.ca/pages/posts/2020-11-15-validating-arguments-in-r/2020-11-15-validating-arguments.html</guid>
  <pubDate>Sun, 15 Nov 2020 05:00:00 GMT</pubDate>
</item>
<item>
  <title>Posterior Concentration in terms of the Separation Alpha-Entropy</title>
  <dc:creator>Olivier Binette</dc:creator>
  <link>https://olivierbinette.ca/pages/posts/2020-11-15-posterior-concentration-in-terms-of-the-separation-alpha-entropy/posterior-concentration-in-terms-of-the-separation-alpha-entropy.html</link>
  <description><![CDATA[ 





<p>This post continues the series on posterior concentration under misspecification. Here I introduce an unifying point of view on the subject through the introduction of the <em>separation <img src="https://latex.codecogs.com/png.latex?%5Calpha">-entropy</em>. We use this notion of prior entropy to bridge the gap between Bayesian fractional posteriors and regular posterior distributions: in the case where this entropy is finite, direct analogues to some of the concentration results for fractional posteriors (Bhattacharya et al., 2019) are recovered.</p>
<p>This post is going to be quite abstract, just like last week. I’ll talk in a future post about how this separation <img src="https://latex.codecogs.com/png.latex?%5Calpha">-entropy generalizes generalizes the covering numbers for testing under misspecification of Kleijn et al.&nbsp;(2006) as well as the prior summability conditions of De Blasi et al.&nbsp;(2013).</p>
<p>Quick word of warning: this is not the definitive version of the results I’m working on, but I still had to get them out somewhere.</p>
<p>Another word of warning: Wordpress has gotten significantly worse at dealing with math recently. I will find a new platform, but for now expect to find typos and some rendering issues.</p>
<section id="the-framework" class="level2">
<h2 class="anchored" data-anchor-id="the-framework">The framework</h2>
<p>We continue in the same theoretical framework as before: <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BF%7D"> is a set of densities on a complete and separable metric space <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BX%7D"> with respect to a <img src="https://latex.codecogs.com/png.latex?%5Csigma">-finite measure <img src="https://latex.codecogs.com/png.latex?%5Cmu"> defined on the Borel <img src="https://latex.codecogs.com/png.latex?%5Csigma">-algebra of <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BX%7D">, <img src="https://latex.codecogs.com/png.latex?H"> is the Hellinger distance defined by <img src="https://latex.codecogs.com/png.latex?%20H(f,%20g)%20=%20%5Cleft(%5Cint%20%5Cleft(%5Csqrt%7Bf%7D%20-%20%5Csqrt%7Bg%7D%5Cright)%5E2%20%5C,%20d%5Cmu%5Cright)%5E%7B1/2%7D"> and we make use of the Rényi divergences defined by <img src="https://latex.codecogs.com/png.latex?%20d_%5Calpha(f,%20g)%20=%20-%5Calpha%5E%7B-1%7D%5Clog%20A_%5Calpha(f,%20g),%5Cquad%20A_%5Calpha(f,%20g)%20=%20%5Cint%20f%5E%7B%5Calpha%7Dg%5E%7B1-%5Calpha%7D%20%5C,d%5Cmu%20."> Here we assume that data is generated following a distribution <img src="https://latex.codecogs.com/png.latex?f_0%20%5Cin%20%5Cmathbb%7BF%7D"> having a density in our model (this assumption could be weakened), and therefore defined the <em>off-centered</em> Rényi divergence <img src="https://latex.codecogs.com/png.latex?%20d_%5Calpha%5E%7Bf_0%7D(f,%20f%5E%5Cstar)%20=%20-%5Calpha%5E%7B-1%7D%5Clog(A_%5Calpha%5E%7Bf_0%7D(f,%20f%5E%5Cstar))"> where <img src="https://latex.codecogs.com/png.latex?%20A_%5Calpha%5E%7Bf_0%7D(f,%20f%5E%5Cstar)%20=%20%5Cint%20(f/f%5E%5Cstar)%5E%5Calpha%20f_0%5C,d%5Cmu"> assuming that all this is well defined.</p>
<section id="prior-and-posterior-distributions" class="level3">
<h3 class="anchored" data-anchor-id="prior-and-posterior-distributions">Prior and posterior distributions</h3>
<p>Now let <img src="https://latex.codecogs.com/png.latex?%5CPi"> be a prior on <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BF%7D">. Given either a single data point <img src="https://latex.codecogs.com/png.latex?X%20%5Csim%20f_0"> or a sequence of independent variables <img src="https://latex.codecogs.com/png.latex?X%5E%7B(n)%7D%20=%20%5C%7BX_i%5C%7D_%7Bi=1%7D%5En"> with common probability density function <img src="https://latex.codecogs.com/png.latex?f_0">, the posterior distribution of <img src="https://latex.codecogs.com/png.latex?%5CPi"> given <img src="https://latex.codecogs.com/png.latex?X%5E%7B(n)%7D"> is the random quantity <img src="https://latex.codecogs.com/png.latex?%5CPi(%5Ccdot%20%5Cmid%20X%5E%7B(n)%7D)"> defined by <img src="https://latex.codecogs.com/png.latex?%20%5CPi%5Cleft(A%5Cmid%20X%5E%7B(n)%7D%5Cright)%20=%20%5Cint_A%20%5Cprod_%7Bi=1%7D%5En%20f(X_i)%20%5CPi(df)%5CBig/%20%5Cint_%7B%5Cmathbb%7BF%7D%7D%20%5Cprod_%7Bi=1%7D%5En%20f(X_i)%20%5CPi(df)"> and <img src="https://latex.codecogs.com/png.latex?%5CPi(%5Ccdot%20%5Cmid%20X)%20=%20%5CPi(%5Ccdot%20%5Cmid%20X%5E%7B(1)%7D)">. This may not always be well-defined, but I don’t want to get into technicalities for now.</p>
</section>
</section>
<section id="separation-alpha-entropy" class="level2">
<h2 class="anchored" data-anchor-id="separation-alpha-entropy">Separation <img src="https://latex.codecogs.com/png.latex?%5Calpha">-entropy</h2>
<p>We state our concentration results in terms of the separation <img src="https://latex.codecogs.com/png.latex?%5Calpha">-entropy. It is inspired by the Hausdorff <img src="https://latex.codecogs.com/png.latex?%5Calpha">-entropy introduced in Xing et al.&nbsp;(2009), although the separation <img src="https://latex.codecogs.com/png.latex?%5Calpha">-entropy has no relationship with the Hausdorff measure and instead builds upon the concept of <img src="https://latex.codecogs.com/png.latex?%5Cdelta">-separation of Choi et al.&nbsp;(2008) defined below.</p>
<p>Given a set <img src="https://latex.codecogs.com/png.latex?A%20%5Csubset%20%5Cmathbb%7BF%7D">, we denote by <img src="https://latex.codecogs.com/png.latex?%5Clangle%20A%20%5Crangle"> the convex hull of <img src="https://latex.codecogs.com/png.latex?A">: it is the set of all densities of the form <img src="https://latex.codecogs.com/png.latex?%5Cint_A%20f%20%5C,%5Cnu(df)"> where <img src="https://latex.codecogs.com/png.latex?%5Cnu"> is a probability measure on <img src="https://latex.codecogs.com/png.latex?A">.</p>
<p><strong>Definition</strong> (<img src="https://latex.codecogs.com/png.latex?%5Cdelta">-separation). Let <img src="https://latex.codecogs.com/png.latex?f_0%20%5Cin%20%5Cmathbb%7BF%7D"> be fixed as above. A set of densities <img src="https://latex.codecogs.com/png.latex?A%20%5Csubset%20%5Cmathbb%7BF%7D"> is said to be <img src="https://latex.codecogs.com/png.latex?%5Cdelta">-separated from <img src="https://latex.codecogs.com/png.latex?f%5E%5Cstar%20%5Cin%20%5Cmathbb%7BF%7D"> with respect to the divergence <img src="https://latex.codecogs.com/png.latex?d_%5Calpha%5E%7Bf_0%7D"> if for every <img src="https://latex.codecogs.com/png.latex?f%20%5Cin%20%5Clangle%20A%20%5Crangle">, <img src="https://latex.codecogs.com/png.latex?%20d_%5Calpha%5E%7Bf_0%7D%5Cleft(f,%20f%5E%5Cstar%5Cright)%20%5Cgeq%20%5Cdelta."> A collection of sets <img src="https://latex.codecogs.com/png.latex?%5C%7BA_i%5C%7D_%7Bi=1%7D%5E%5Cinfty"> is said to be <img src="https://latex.codecogs.com/png.latex?%5Cdelta">-separated from <img src="https://latex.codecogs.com/png.latex?f_0"> if every $A {A_i}_{i=1}^$ is <img src="https://latex.codecogs.com/png.latex?%5Cdelta">-separated from <img src="https://latex.codecogs.com/png.latex?f_0">.</p>
<p>An important property of <img src="https://latex.codecogs.com/png.latex?%5Cdelta">-separation, first noted by Walker (2004) and used for the study of posterior consistency, is that it scales with product densities. The general statement of the result is stated in the following lemma.</p>
<p><strong>Lemma</strong> (Separation of product densities). <em>Let <img src="https://latex.codecogs.com/png.latex?(%5Cmathcal%7BX%7D_i,%20%5Cmathcal%7BB%7D_%7Bi%7D,%20%5Cmu_i)">, <img src="https://latex.codecogs.com/png.latex?i%20%5Cin%5C%7B%201,2,%20%5Cdots,%20n%5C%7D">, be a sequence of <img src="https://latex.codecogs.com/png.latex?%5Csigma">-finite measured spaces where each <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BX%7D_i"> is a complete and separable locally compact metric space and <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BB%7D_i"> is the corresponding Borel <img src="https://latex.codecogs.com/png.latex?%5Csigma">-algebra. Denote by <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BF%7D_i"> the set of probability density functions on <img src="https://latex.codecogs.com/png.latex?(%5Cmathcal%7BX%7D_i,%20%5Cmathcal%7BB%7D%7Bi%7D,%20%5Cmu_i)">, fix <img src="https://latex.codecogs.com/png.latex?f_%7B0,i%7D%20%5Cin%20%5Cmathbb%7BF%7D_i"> and let <img src="https://latex.codecogs.com/png.latex?A_i%20%5Csubset%20%5Cmathbb%7BF%7D_i"> be <img src="https://latex.codecogs.com/png.latex?%5Cdelta_i">-separated from <img src="https://latex.codecogs.com/png.latex?f_%7Bi%7D%5E%5Cstar%20%5Cin%20%5Cmathbb%7BF%7D_i"> with respect to <img src="https://latex.codecogs.com/png.latex?d_%5Calpha%5E%7Bf_%7B0,i%7D%7D"> for some <img src="https://latex.codecogs.com/png.latex?%5Cdelta_i%20%5Cgeq%200">. Let <img src="https://latex.codecogs.com/png.latex?%5Cprod_%7Bi=1%7D%5En%20A_i%20=%20%5Cleft%5C%7B%5Cprod_%7Bi=1%7D%5En%20f_%7Bi%7D%20%5Cmid%20f_i%20%5Cin%20%5Cmathbb%7BF%7D_i%5Cright%5C%7D"> where <img src="https://latex.codecogs.com/png.latex?%5Cprod_%7Bi=1%7D%5En%20f_i"> is the product density on <img src="https://latex.codecogs.com/png.latex?%5Cprod_%7Bi=1%7D%5En%20%5Cmathcal%7BX%7D_i"> defined by <img src="https://latex.codecogs.com/png.latex?(x_1,%20%5Cdots,%20x_n)%20%5Cmapsto%20%5Cprod_%7Bi=1%7D%5En%20f_i(x_i)">. Then <img src="https://latex.codecogs.com/png.latex?%5Cprod_%7Bi=1%7D%5EnA_i"> is <img src="https://latex.codecogs.com/png.latex?%5Cleft(%5Csum_%7Bi=1%7D%5En%5Cdelta_i%5Cright)">-separated from <img src="https://latex.codecogs.com/png.latex?%5Cprod_%7Bi=1%7D%5En%20f_%7Bi%7D%5E%5Cstar"> with respect to <img src="https://latex.codecogs.com/png.latex?d_%5Calpha%5E%7Bf_0%7D"> where <img src="https://latex.codecogs.com/png.latex?f_0%20=%20%5Cprod_%7Bi=1%7D%5Enf_%7B0,i%7D">.</em></p>
<p>We can now define the <em>separation <img src="https://latex.codecogs.com/png.latex?%5Calpha">-entropy</em> of a set <img src="https://latex.codecogs.com/png.latex?A%5Csubset%20%5Cmathbb%7BF%7D"> with parameter <img src="https://latex.codecogs.com/png.latex?%5Cdelta%20%3E%200"> as the minimal <img src="https://latex.codecogs.com/png.latex?%5Calpha">-entropy of a <img src="https://latex.codecogs.com/png.latex?%5Cdelta">-separated covering of <img src="https://latex.codecogs.com/png.latex?A">. When this entropy is finite, we can study the concentration properties of the posterior distribution using simple information-theoretic techniques similar to those used in Bhattacharya (2019) for the study of Bayesian fractional posteriors.</p>
<p><strong>Definition</strong> (Separation <img src="https://latex.codecogs.com/png.latex?%5Calpha">-entropy). Fix <img src="https://latex.codecogs.com/png.latex?%5Cdelta%20%3E%200">, <img src="https://latex.codecogs.com/png.latex?%5Calpha%20%5Cin%20(0,1)"> and let <img src="https://latex.codecogs.com/png.latex?A"> be a subset of <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BF%7D">. Recall <img src="https://latex.codecogs.com/png.latex?%5CPi">, <img src="https://latex.codecogs.com/png.latex?f_0"> and <img src="https://latex.codecogs.com/png.latex?f%5E%5Cstar"> fixed as previously. The <em>separation <img src="https://latex.codecogs.com/png.latex?%5Calpha">-entropy</em> of <img src="https://latex.codecogs.com/png.latex?A"> is defined as <img src="https://latex.codecogs.com/png.latex?%20%5Cmathcal%7BS%7D_%5Calpha%5E%5Cstar(A,%20%5Cdelta)%20=%20%5Cmathcal%7BS%7D_%5Calpha%5E%5Cstar(A,%20%5Cdelta;%20%5CPi,%20f_0,%20f%5E%5Cstar)%20=%20%5Cinf%20%5C,(1-%5Calpha)%5E%7B-1%7D%20%5Clog%20%5Csum_%7Bi=1%7D%5E%5Cinfty%20%5Cleft(%5Cfrac%7B%5CPi(A_i)%7D%7B%5CPi(A)%7D%5Cright)%5E%5Calpha"> where the infimum is taken over all (measurable) families <img src="https://latex.codecogs.com/png.latex?%5C%7BA_i%5C%7D_%7Bi=1%7D%5E%5Cinfty">, <img src="https://latex.codecogs.com/png.latex?A_i%20%5Csubset%20%5Cmathbb%7BF%7D">, satisfying <img src="https://latex.codecogs.com/png.latex?%5CPi(A%20%5Cbackslash%20(%5Ccup_%7Bi%7DA_i))%20=%200"> and which are <img src="https://latex.codecogs.com/png.latex?%5Cdelta">-separated from <img src="https://latex.codecogs.com/png.latex?f_0"> with respect to the divergence <img src="https://latex.codecogs.com/png.latex?d_%5Calpha%5E%7Bf_0%7D">. When no such covering exists we let <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BS%7D_%5Calpha(A,%20%5Cdelta)%20=%20%5Cinfty">, and when <img src="https://latex.codecogs.com/png.latex?%5CPi(A)%20=%200"> we define <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BS%7D_%5Calpha(A,%20%5Cdelta)%20=%200">.</p>
<p><strong>Remark.</strong> <em>When <img src="https://latex.codecogs.com/png.latex?f_0%20=%20f%5E%5Cstar">, so that <img src="https://latex.codecogs.com/png.latex?d_%5Calpha%5E%7Bf_0%7D(f,%20f%5E%5Cstar)%20=%20d_%5Calpha(f,%20f_0)">, we drop the indicator <img src="https://latex.codecogs.com/png.latex?%5Cstar"> and denote <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BS%7D_%5Calpha(A,%20%5Cdelta)%20=%20%5Cmathcal%7BS%7D%5E%5Cstar(A,%20%5Cdelta)">, to emphasize the fact.</em></p>
<p><strong>Proposition</strong> (Properties of the separation <img src="https://latex.codecogs.com/png.latex?%5Calpha">-entropy). <em>The separation <img src="https://latex.codecogs.com/png.latex?%5Calpha">-entropy <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BS%7D_%5Calpha%5E%5Cstar(A,%20%5Cdelta)"> of a set <img src="https://latex.codecogs.com/png.latex?A%5Csubset%20%5Cmathbb%7BF%7D"> is non-negative and <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BS%7D_%5Calpha(A,%20%5Cdelta)%20=%200"> if <img src="https://latex.codecogs.com/png.latex?A"> is <img src="https://latex.codecogs.com/png.latex?%5Cdelta">-separated from <img src="https://latex.codecogs.com/png.latex?f%5E%5Cstar"> with respect to the divergence <img src="https://latex.codecogs.com/png.latex?d_%5Calpha%5E%7Bf_0%7D">. Furthermore, if <img src="https://latex.codecogs.com/png.latex?0%20%3C%20%5Calpha%20%5Cleq%20%5Cbeta%20%3C%201"> and <img src="https://latex.codecogs.com/png.latex?0%20%3C%20%5Cdelta%20%5Cleq%20%5Cdelta'">, then</em><br>
<img src="https://latex.codecogs.com/png.latex?%20%7B%7D%5Cmathcal%7BS%7D_%5Calpha%5E%5Cstar(A,%20%5Cdelta)%20%5Cleq%20%5Cmathcal%7BS%7D_%5Calpha%5E%5Cstar(A,%20%5Cdelta')"> <em>and if also <img src="https://latex.codecogs.com/png.latex?f%5E%5Cstar%20=%20f_0">, then</em><br>
<img src="https://latex.codecogs.com/png.latex?%20%7B%7D%5Cmathcal%7BS%7D_%5Cbeta(A,%20%5Ctfrac%7B1-%5Cbeta%7D%7B%5Cbeta%7D%5Cdelta)%20%5Cleq%20%5Cmathcal%7BS%7D_%5Calpha(A,%20%5Ctfrac%7B1-%5Calpha%7D%7B%5Calpha%7D%5Cdelta)."> <em>For a subset <img src="https://latex.codecogs.com/png.latex?A%20%5Csubset%20B%20%5Csubset%20%5Cmathbb%7BF%7D"> with <img src="https://latex.codecogs.com/png.latex?%5CPi(A)%20%3E%200">, we have</em></p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="img.png" class="lightbox" data-gallery="quarto-lightbox-gallery-1" title="img"><img src="https://olivierbinette.ca/pages/posts/2020-11-15-posterior-concentration-in-terms-of-the-separation-alpha-entropy/img.png" class="img-fluid figure-img" alt="img"></a></p>
<figcaption>img</figcaption>
</figure>
</div>
<p><em>and, more generally, if <img src="https://latex.codecogs.com/png.latex?A%20%5Csubset%20%5Cbigcup_%7Bn=1%7D%5E%5Cinfty%20B_n"> for subsets <img src="https://latex.codecogs.com/png.latex?B_n%20%5Csubset%20%5Cmathbb%7BF%7D">, then</em> <img src="https://latex.codecogs.com/png.latex?%20%7B%7D%5CPi(A)%5E%7B%5Calpha%7D%5Cleft(%5Cexp%5Cmathcal%7BS%7D_%5Calpha%5E%5Cstar(A,%20%5Cdelta)%5Cright)%5E%7B1-%5Calpha%7D%20%20%20%20%20%20%20%20%20%5Cleq%20%5Csum_%7Bn=1%7D%5E%5Cinfty%20%5CPi(B_n)%5E%5Calpha%20%5Cleft(%5Cexp%5Cmathcal%7BS%7D_%5Calpha%5E%5Cstar(B_n,%20%5Cdelta)%5Cright)%5E%7B1-%5Calpha%7D."></p>
</section>
<section id="posterior-consistency" class="level2">
<h2 class="anchored" data-anchor-id="posterior-consistency">Posterior consistency</h2>
<p><strong>Theorem</strong> (Posterior consistency). <em>Let <img src="https://latex.codecogs.com/png.latex?f_0,%20f%5E%5Cstar%20%5Cin%20%5Cmathbb%7BF%7D"> and let <img src="https://latex.codecogs.com/png.latex?%5C%7BX_i%5C%7D"> be a sequence of independent random variables with common probability density <img src="https://latex.codecogs.com/png.latex?f_0">. Suppose there exists <img src="https://latex.codecogs.com/png.latex?%5Cdelta%20%3E%200"> such that</em> <img src="https://latex.codecogs.com/png.latex?%20%5CPi%5Cleft(%7Bf%20%5Cin%20%5Cmathbb%7BF%7D%20%5Cmid%20D(f_0%7C%20f)%20%3C%20%5Cdelta%7D%5Cright)%20%3E%200."> <em>If <img src="https://latex.codecogs.com/png.latex?A%20%5Csubset%20%5Cmathbb%7BF%7D"> satisfies <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BS%7D_%5Calpha%5E%5Cstar%5Cleft(A,%20%5Cdelta%5Cright)%20%3C%20%5Cinfty"> for some <img src="https://latex.codecogs.com/png.latex?%5Calpha%20%5Cin%20(0,1)">, then <img src="https://latex.codecogs.com/png.latex?%5CPi%5Cleft(A%5Cmid%20%5C%7BX_i%5C%7D_%7Bi=1%7D%5En%5Cright)%20%5Crightarrow%200"> almost surely as <img src="https://latex.codecogs.com/png.latex?n%5Crightarrow%20%5Cinfty">.</em></p>
<p><strong>Remark.</strong> <em>The condition <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BS%7D_%5Calpha%5E%5Cstar(A,%20%5Cdelta)%20%3C%20%5Cinfty"> implies in particular that <img src="https://latex.codecogs.com/png.latex?A%20%5Csubset%20%5C%7Bf%5Cin%20%5Cmathbb%7BF%7D%20%5Cmid%20d_%5Calpha%5E%7Bf_0%7D(f,%20f%5E%5Cstar)%20%5Cgeq%20%5Cdelta%5C%7D">.</em></p>
<p><strong>Corollary</strong> (Well-specified consistency). <em>Suppose that <img src="https://latex.codecogs.com/png.latex?f_0"> is in the Kullback-Leibler support of <img src="https://latex.codecogs.com/png.latex?%5CPi">. If <img src="https://latex.codecogs.com/png.latex?A%20%5Csubset%20%5Cmathbb%7BF%7D"> satisfies <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BS%7D_%5Calpha(A,%20%5Cdelta)%20%3C%20%5Cinfty"> for some <img src="https://latex.codecogs.com/png.latex?%5Calpha%20%5Cin%20(0,1)"> and for some <img src="https://latex.codecogs.com/png.latex?%5Cdelta%20%3E%200">, then <img src="https://latex.codecogs.com/png.latex?%5CPi_n(A)%20%5Crightarrow%200"> almost surely as <img src="https://latex.codecogs.com/png.latex?n%20%5Crightarrow%20%5Cinfty">.</em></p>
<p><strong>Corollary</strong> (Well-specified Hellinger consistency). <em>Suppose that <img src="https://latex.codecogs.com/png.latex?f_0"> is in the Kullback-Leibler support of <img src="https://latex.codecogs.com/png.latex?%5CPi"> and fix <img src="https://latex.codecogs.com/png.latex?%5Cvarepsilon%20%3E%200">. If there exists a covering <img src="https://latex.codecogs.com/png.latex?%5C%7BA_i%5C%7D_%7Bi=1%7D%5E%5Cinfty"> of <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BF%7D"> by Hellinger balls of diameter at most <img src="https://latex.codecogs.com/png.latex?%5Cdelta%20%3C%20%5Cvarepsilon"> satisfying <img src="https://latex.codecogs.com/png.latex?%5Csum_%7Bi=1%7D%5E%5Cinfty%20%5CPi(A_i)%5E%5Calpha%20%3C%20%5Cinfty"> for some <img src="https://latex.codecogs.com/png.latex?%5Calpha%20%5Cin%20(0,1)">, then <img src="https://latex.codecogs.com/png.latex?%5CPi_n%5Cleft(%5Cleft%5C%7Bf%20%5Cin%20%5Cmathbb%7BF%7D%20%5Cmid%20H(f,%20f_0)%20%5Cgeq%20%5Cvarepsilon%20%5Cright%5C%7D%5Cright)%20%5Crightarrow%200"> almost surely as <img src="https://latex.codecogs.com/png.latex?n%5Crightarrow%20%5Cinfty">.</em></p>
</section>
<section id="posterior-concentration" class="level2">
<h2 class="anchored" data-anchor-id="posterior-concentration">Posterior concentration</h2>
<p>Following Kleijn et al.&nbsp;(2006) and Bhattacharya et al.&nbsp;(2019), we let <img src="https://latex.codecogs.com/png.latex?%20B(%5Cdelta,%20f%5E%5Cstar;f_0)%20=%20%5Cleft%5C%7B%20f%5Cin%20%5Cmathbb%7BF%7D%20%5Cmid%20%5Cint%20%5Clog%20%5Cleft(%5Cfrac%7Bf%7D%7Bf%5E%5Cstar%7D%5Cright)%20f_0%20%5C,d%5Cmu%20%5Cleq%20%5Cdelta,%5C,%20%5Cint%20%5Cleft(%5Clog%20%5Cleft(%5Cfrac%7Bf%7D%7Bf%5E%5Cstar%7D%5Cright)%5Cright)%5E2%20f_0%20%5C,d%5Cmu%20%5Cleq%20%5Cdelta%20%5Cright%5C%7D"> be a Kullback-Leibler type neighborhood of <img src="https://latex.codecogs.com/png.latex?f%5E%5Cstar"> (relatively to <img src="https://latex.codecogs.com/png.latex?f_0">) where the second moment of the log likelihood ratio <img src="https://latex.codecogs.com/png.latex?%5Clog(f/f%5E%5Cstar)"> is also controlled.</p>
<p><strong>Theorem</strong> (Posterior concentration bound). <em>Let <img src="https://latex.codecogs.com/png.latex?f_0,%20f%5E%5Cstar%20%5Cin%20%5Cmathbb%7BF%7D"> and let <img src="https://latex.codecogs.com/png.latex?X%20%5Csim%20f_0">. For any <img src="https://latex.codecogs.com/png.latex?%5Cdelta%20%3E%200"> and <img src="https://latex.codecogs.com/png.latex?%5Ckappa%20%3E%201"> we have that</em> <img src="https://latex.codecogs.com/png.latex?%20%5Clog%5CPi(A%20%5Cmid%20X)%20%5Cleq%20%5Cfrac%7B1-%5Calpha%7D%7B%5Calpha%7D%5Cmathcal%7BS%7D_%5Calpha%5E%5Cstar(A,%20%5Ckappa%20%5Cdelta)-%20%5Clog%5CPi(B(%5Cdelta,%20f%5E%5Cstar;f_0))%20-%20%5Ckappa%5Cdelta"> <em>holds with probability at least <img src="https://latex.codecogs.com/png.latex?1-8/(%5Calpha%5E2%5Cdelta)">.</em></p>
<p><strong>Corollary</strong> (Posterior concentration bound, i.i.d. case). <em>Let <img src="https://latex.codecogs.com/png.latex?f_0,%20f%5E%5Cstar%20%5Cin%20%5Cmathbb%7BF%7D"> and let <img src="https://latex.codecogs.com/png.latex?%5C%7BX_i%5C%7D"> be a sequence of independent random variables with common probability density <img src="https://latex.codecogs.com/png.latex?f_0">. For any <img src="https://latex.codecogs.com/png.latex?%5Cdelta%20%3E%200"> and <img src="https://latex.codecogs.com/png.latex?%5Ckappa%20%3E%201"> we have that</em> <img src="https://latex.codecogs.com/png.latex?%20%5Clog%5CPi%5Cleft(A%20%5Cmid%20%5C%7BX_i%5C%7D_%7Bi=1%7D%5En%5Cright)%20%5Cleq%20%5Cfrac%7B1-%5Calpha%7D%7B%5Calpha%7D%5Cmathcal%7BS%7D_%5Calpha%5E%5Cstar(A,%20%5Ckappa%20%5Cdelta)-%20%5Clog%5CPi(B(%5Cdelta,%20f%5E%5Cstar;f_0))%20-%20n%5Ckappa%5Cdelta"> <em>holds with probability at least <img src="https://latex.codecogs.com/png.latex?1-8/(%5Calpha%5E2%20n%20%5Cdelta)">.</em></p>
<p><strong>References</strong></p>
<ul>
<li>Bhattacharya, A., D. Pati, and Y. Yang (2019).Bayesian fractional posteriors.Ann.Statist. 47(1), 39–66.</li>
<li>Choi, T. and R. V. Ramamoorthi (2008).Remarks on consistency of posterior distributions,Volume Volume 3, pp.&nbsp;170–186. Beachwood, Ohio, USA: Institute of Mathematical Statistics.</li>
<li>De Blasi, P. and S. G. Walker (2013). Bayesian asymptotics with misspecified models.StatisticaSinica, 169–187.</li>
<li>Grünwald, P. and T. van Ommen (2017). Inconsistency of bayesian inference for misspecifiedlinear models, and a proposal for repairing it.Bayesian Anal. 12(4), 1069–1103.</li>
<li>Kleijn, B. J., A. W. van der Vaart, et al.&nbsp;(2006). Misspecification in infinite-dimensional bayesianstatistics.The Annals of Statistics 34(2), 837–877.</li>
<li>Ramamoorthi, R. V., K. Sriram, and R. Martin (2015). On posterior concentration in misspec-ified models.Bayesian Anal. 10(4), 759–789.</li>
<li>Walker, S. (2004). New approaches to Bayesian consistency.Ann. Statist. 32(5), 2028–2043.</li>
<li>Xing, Y. and B. Ranneby (2009). Sufficient conditions for Bayesian consistency. J. Stat. Plan.Inference 139(7), 2479–2489.</li>
</ul>


</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-reuse"><h2 class="anchored quarto-appendix-heading">Reuse</h2><div class="quarto-appendix-contents"><div><a rel="license" href="https://creativecommons.org/licenses/by/4.0/">CC BY 4.0</a></div></div></section><section class="quarto-appendix-contents" id="quarto-copyright"><h2 class="anchored quarto-appendix-heading">Copyright</h2><div class="quarto-appendix-contents"><div>Olivier Binette</div></div></section></div> ]]></description>
  <category>technical</category>
  <category>math</category>
  <category>statistics</category>
  <guid>https://olivierbinette.ca/pages/posts/2020-11-15-posterior-concentration-in-terms-of-the-separation-alpha-entropy/posterior-concentration-in-terms-of-the-separation-alpha-entropy.html</guid>
  <pubDate>Fri, 11 Oct 2019 04:00:00 GMT</pubDate>
</item>
<item>
  <title>Theory of Gibbs posterior concentration</title>
  <dc:creator>Olivier Binette</dc:creator>
  <link>https://olivierbinette.ca/pages/posts/2020-11-15-theory-of-gibbs-posterior-concentration/theory-of-gibbs-posterior-concentration.html</link>
  <description><![CDATA[ 





<p>Consider the statistical learning framework where we have data <img src="https://latex.codecogs.com/png.latex?X%5Csim%20Q"> for some unknown distribution <img src="https://latex.codecogs.com/png.latex?Q">, a model <img src="https://latex.codecogs.com/png.latex?%5CTheta"> and a loss function <img src="https://latex.codecogs.com/png.latex?%5Cell_%5Ctheta(X)"> measuring a cost associated with fitting the data <img src="https://latex.codecogs.com/png.latex?X"> using a particular <img src="https://latex.codecogs.com/png.latex?%5Ctheta%5Cin%5CTheta">. Our goal is to use the data to learn about parameters which minimize the risk <img src="https://latex.codecogs.com/png.latex?R(%5Ctheta)%20=%20%5Cmathbb%7BE%7D%5B%5Cell_%5Ctheta(X)%5D">. Here are two standard examples.</p>
<p><strong>Density estimation.</strong> Suppose we observe independent random variables <img src="https://latex.codecogs.com/png.latex?X_1,%20X_2,%20%5Cdots,%20X_n">. Here the model <img src="https://latex.codecogs.com/png.latex?%5CTheta"> parametrizes a set <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BM%7D%20=%20%5C%7Bp_%5Ctheta%20:%20%5Ctheta%20%5Cin%20%5CTheta%20%5C%7D"> of probability density functions (with respect to some dominating measure on the sample space), and our loss for <img src="https://latex.codecogs.com/png.latex?X%20=%20(X_1,%20%5Cdots,%20X_n)"> is defined as <img src="https://latex.codecogs.com/png.latex?%0A%5Cell_%5Ctheta(X)%20=%20-%20%5Csum_%7Bi=1%7D%5En%20%5Clog%20p_%5Ctheta(X_i).%0A"> If, for instance, the variables <img src="https://latex.codecogs.com/png.latex?X_i"> are independent with common distribution with density function <img src="https://latex.codecogs.com/png.latex?p_%7B%5Ctheta_0%7D"> for some <img src="https://latex.codecogs.com/png.latex?%5Ctheta_0%20%5Cin%20%5Cmathbb%7B%5CTheta%7D">, then it follows from the positivity of the Kullback-Leibler divergence that <img src="https://latex.codecogs.com/png.latex?%5Ctheta_0%20%5Cin%20%5Carg%5Cmin%20_%20%5Ctheta%20%5Cmathbb%7BE%7D%5B%5Cell%20_%20%5Ctheta(X)%5D">. That is, under identifiability conditions, our learning target is the true data-generating distribution.</p>
<p>If the model is misspecified, roughly meaning that there is no <img src="https://latex.codecogs.com/png.latex?%5Ctheta_0%5Cin%20%5CTheta"> such that <img src="https://latex.codecogs.com/png.latex?p_%7B%5Ctheta_0%7D"> is a density of <img src="https://latex.codecogs.com/png.latex?X_i">, then our framework sets up the learning problem to be about the parameter <img src="https://latex.codecogs.com/png.latex?%5Ctheta_0"> which is such that <img src="https://latex.codecogs.com/png.latex?p_%7B%5Ctheta_0%7D"> mininizes the Kullback-Leibler divergence between <img src="https://latex.codecogs.com/png.latex?p_%7B%5Ctheta_0%7D"> and the true marginal distribution of the <img src="https://latex.codecogs.com/png.latex?X_i">’s.</p>
<p><strong>Regression.</strong> Here our observations take the form <img src="https://latex.codecogs.com/png.latex?(Y_i,%20X_i)">, the model <img src="https://latex.codecogs.com/png.latex?%5CTheta"> parameterizes regression functions <img src="https://latex.codecogs.com/png.latex?f_%5Ctheta"> and we can consider a sum of squared errors loss <img src="https://latex.codecogs.com/png.latex?%0A%5Cell_%5Ctheta(X)%20=%20%5Csum_%7Bi=1%7D%5En(Y_i%20-%20f_%5Ctheta(X_i))%5E2.%0A"></p>
<section id="gibbs-posterior-distributions" class="level3">
<h3 class="anchored" data-anchor-id="gibbs-posterior-distributions">Gibbs posterior distributions</h3>
<p><strong>Gibbs Learning</strong> approaches this problem from a pseudo Bayesian point of view. While typically a Bayesian approach would require the specification of a full data-generating model, here we replace the likelihood function by the <em>pseudo-likelihood</em> function <img src="https://latex.codecogs.com/png.latex?%5Ctheta%20%5Cmapsto%20e%5E%7B-%5Cell_%5Ctheta(X)%7D">. Given a prior <img src="https://latex.codecogs.com/png.latex?%5Cpi"> on <img src="https://latex.codecogs.com/png.latex?%5CTheta">, the Gibbs posterior distribution is then given by <img src="https://latex.codecogs.com/png.latex?%0A%5Cpi(%5Ctheta%20%5Cmid%20X)%20%5Cpropto%20e%5E%7B-%5Cell_%5Ctheta(X)%7D%20%5Cpi(%5Ctheta)%0A"> and satisfies <img src="https://latex.codecogs.com/png.latex?%0A%5Cpi(%5Ccdot%20%5Cmid%20X)%20%5Cin%20%5Ctext%7Bargmin%7D_%7B%5Chat%20%5Cpi%7D%20%5Cleft%5C%7B%20%5Cmathbb%7BE%7D_%7B%5Ctheta%20%5Csim%20%5Chat%20%5Cpi%7D%5B%5Cell_%5Ctheta(X)%5D%20+%20D_%7B%5Ctext%7BKL%7D%7D(%5Chat%20%5Cpi%20%5Cmid%20%5Cpi)%20%5Cright%5C%7D%0A"> whenever these expressions are well defined.</p>
<p>In the context of integrable pseudo-likelihoods, the above can be re-interpreted as a regular posterior distributions built from density functions <img src="https://latex.codecogs.com/png.latex?f%20_%20%5Ctheta(x)%20%5Cpropto%20e%5E%7B-%5Cell%20_%20%5Ctheta(x)%7D"> and with a prior <img src="https://latex.codecogs.com/png.latex?%5Ctilde%20%5Cpi"> satisfying <img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7Bd%5Ctilde%20%5Cpi%7D%7Bd%5Cpi%7D(%5Ctheta)%20%5Cpropto%20%5Cint%20e%5E%7B-%5Cell_%5Ctheta(x)%7D%5C,dx%20=:%20c(%5Ctheta).%0A"> However, the reason we cannot apply standard asymptotic theory to the analysis of Gibbs posterior is that the quantity <img src="https://latex.codecogs.com/png.latex?c(%5Ctheta)"> will typically be sample-size dependent. That is, if <img src="https://latex.codecogs.com/png.latex?X=X%5En=(X_1,%20X_2,%20%5Cdots,%20X_n)"> for i.i.d. random variables <img src="https://latex.codecogs.com/png.latex?X_i"> and if the loss <img src="https://latex.codecogs.com/png.latex?%5Cell_%5Ctheta"> separates as the sum <img src="https://latex.codecogs.com/png.latex?%0A%5Cell_%5Ctheta(X)%20=%20%5Csum_%7Bi=1%7D%5Enl_%7B%5Ctheta%7D(X_i),%0A"> then <img src="https://latex.codecogs.com/png.latex?c(%5Ctheta)%20=%20%5Cleft(%5Cint%20e%5E%7B-l_%5Ctheta(x_1)%7D%20%5C,%20dx_1%5Cright)%5En">. This data-dependent prior, tilting <img src="https://latex.codecogs.com/png.latex?%5Cpi"> by the function <img src="https://latex.codecogs.com/png.latex?c(%5Ctheta)%5En">, is what allows Gibbs learning to target general risk-minimizing parameters rather than likelihood Kullback-Leibler minimizers.</p>
<p>Some of my ongoing research, presented as a poster at the O’Bayes conference in Warwick last summer, focused on understand the theoretical behaviour of Gibbs posterior distributions. I studied the posterior convergence and finite sample concentration properties of Gibbs posterior distributions under the large sample regime with additive losses <img src="https://latex.codecogs.com/png.latex?%5Cell_%5Ctheta%5E%7B(n)%7D(X_1,%20%5Cdots,%20X_n)%20=%20%5Csum_%7Bi=1%7D%5En%5Cell_%5Ctheta(X_i)">. I’ve attached <a href="http://olivierbinette.github.io/blog/media/2019-10-11/poster.pdf">the poster</a> (joint work with Yu Luo) below and you can find the additional references <a href="http://olivierbinette.github.io/blog/media/2019-10-11/references.pdf">here</a>.</p>
<p>Note that this is very preliminary work. We’re still in the process of exploring interesting directions (and I have very limited time this semester with the beginning of my PhD at Duke).</p>
<p><a href="poster.png" class="lightbox" data-gallery="quarto-lightbox-gallery-1"><img src="https://olivierbinette.ca/pages/posts/2020-11-15-theory-of-gibbs-posterior-concentration/poster.png" class="img-fluid"></a></p>


</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-reuse"><h2 class="anchored quarto-appendix-heading">Reuse</h2><div class="quarto-appendix-contents"><div><a rel="license" href="https://creativecommons.org/licenses/by/4.0/">CC BY 4.0</a></div></div></section><section class="quarto-appendix-contents" id="quarto-copyright"><h2 class="anchored quarto-appendix-heading">Copyright</h2><div class="quarto-appendix-contents"><div>Olivier Binette</div></div></section></div> ]]></description>
  <category>technical</category>
  <category>math</category>
  <category>statistics</category>
  <guid>https://olivierbinette.ca/pages/posts/2020-11-15-theory-of-gibbs-posterior-concentration/theory-of-gibbs-posterior-concentration.html</guid>
  <pubDate>Fri, 11 Oct 2019 04:00:00 GMT</pubDate>
</item>
<item>
  <title>The Credibility of confidence intervals</title>
  <link>https://olivierbinette.ca/pages/posts/2019-09-11-credibility-confidence-intervals/2019-09-11-credibility-confidence-intervals.html</link>
  <description><![CDATA[ 





<p>Andrew Gelman and Sander Greenman went “head to head” in a <a href="https://www.bmj.com/content/366/bmj.l5381">discussion on the interpretation of confidence intervals in <em>The BMJ</em></a>. Greenman stated the following, which doesn’t seem quite right to me.</p>
<blockquote class="blockquote">
<p>The label “95% confidence interval” evokes the idea that we should invest the interval with 95/5 (19:1) betting odds that the observed interval contains the true value (which would make the confidence interval a 95% bayesian posterior interval<img src="https://latex.codecogs.com/png.latex?%5E%7B11%7D">). This view may be harmless in a perfect randomized experiment with no background information to inform the bet (the original setting for the “confidence” concept); more often, however […]</p>
</blockquote>
<p>It’s not true that “this view may is harmless in perfect randomized experiments”, and I’m not sure where this “original setting of the confidence concept” is coming from. In fact, even in the simplest possible cases, the posterior probability of a <img src="https://latex.codecogs.com/png.latex?95%5C%25"> confidence interval can be pretty much anything.</p>
<p>Imagine a “perfect randomized experiment”, where we use a test of the hypothesis <img src="https://latex.codecogs.com/png.latex?H_0:%20%5Cmu%20=%200"> for which, for some reason, has zero power. If <img src="https://latex.codecogs.com/png.latex?p%20%3C%200.05">, meaning that the associated confidence interval excludes <img src="https://latex.codecogs.com/png.latex?0">, then we are certain that <img src="https://latex.codecogs.com/png.latex?H_0"> holds and the posterior probability of the confidence interval is zero.</p>
<p>Let this sink in. For some (albeit trivial) statistical tests, observing <img src="https://latex.codecogs.com/png.latex?p%20%3C%200.05"> brings evidence <em>in favor</em> of the null.</p>
<p>The power of the test carries information, and the posterior probability of a confidence interval (or of an hypothesis), depends on this power among other things, even in perfect randomized experiments.</p>



<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-reuse"><h2 class="anchored quarto-appendix-heading">Reuse</h2><div class="quarto-appendix-contents"><div><a rel="license" href="https://creativecommons.org/licenses/by/4.0/">CC BY 4.0</a></div></div></section><section class="quarto-appendix-contents" id="quarto-copyright"><h2 class="anchored quarto-appendix-heading">Copyright</h2><div class="quarto-appendix-contents"><div>Olivier Binette</div></div></section></div> ]]></description>
  <category>statistics</category>
  <guid>https://olivierbinette.ca/pages/posts/2019-09-11-credibility-confidence-intervals/2019-09-11-credibility-confidence-intervals.html</guid>
  <pubDate>Wed, 11 Sep 2019 04:00:00 GMT</pubDate>
</item>
<item>
  <title>Bayesian Optimalities</title>
  <dc:creator>Olivier Binette</dc:creator>
  <link>https://olivierbinette.ca/pages/posts/2020-11-15-bayesian-optimalities/bayesian-optimalities.html</link>
  <description><![CDATA[ 





<section id="point-estimation-and-minimal-expected-risk" class="level2">
<h2 class="anchored" data-anchor-id="point-estimation-and-minimal-expected-risk">1. Point estimation and minimal expected risk</h2>
<p>This first section is not directly about properties of the posterior distribution, but it is rather concerned with some summaries of the posterior which have nice statistical properties in different contexts.</p>
<section id="squared-error-loss" class="level3">
<h3 class="anchored" data-anchor-id="squared-error-loss">Squared error loss</h3>
<p>Suppose <img src="https://latex.codecogs.com/png.latex?%5Cpi"> is a prior on an <strong>euclidean</strong> parameter space <img src="https://latex.codecogs.com/png.latex?%5CTheta%20%5Csubset%20%5Cmathbb%7BR%7D%5Ed"> with norm <img src="https://latex.codecogs.com/png.latex?%5C%7C%5Ctheta%5C%7C%5E2%20=%20%5Ctheta%5ET%20%5Ctheta"> defined through the dot product. Given a likelihood <img src="https://latex.codecogs.com/png.latex?p%20_%20%5Ctheta(X)"> for data <img src="https://latex.codecogs.com/png.latex?X">, the posterior distribution is defined as</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cpi(A%20%5Cmid%20X)%20%5Cpropto%20%5Cint%20_%20A%20p%20_%20%5Ctheta(X)%20%5Cpi(d%5Ctheta)%0A"></p>
<p>and the mean of the posterior distribution is</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Chat%20%5Ctheta%20_%20%7B%5Cpi%7D%20=%20%5Cint%20%5Ctheta%20%5C,%5Cpi(d%5Ctheta%20%5Cmid%20X)%20=%20%5Cmathbb%7BE%7D%20_%20%7B%5Ctheta%20%5Csim%20%5Cpi%7D%5B%5Ctheta%20%5Cmid%20X%5D.%0A"></p>
<p>If we define the <em>risk</em> of an estimator <img src="https://latex.codecogs.com/png.latex?%5Chat%20%5Ctheta"> for the estimation of a parameter <img src="https://latex.codecogs.com/png.latex?%5Ctheta%20_%200"> as</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AR(%5Chat%20%5Ctheta;%20%5Ctheta%20_%200)%20=%20%5Cmathbb%7BE%7D%20_%20%7BX%20%5Csim%20p%20_%20%5Ctheta%7D%5B%5C%7C%5Ctheta%20_%200%20-%20%5Chat%20%5Ctheta(X)%5C%7C%5E2%5D,%0A"></p>
<p>and if</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AB%20_%20%5Cpi(%5Chat%20%5Ctheta)%20=%20%5Cmathbb%7BE%7D%20_%20%7B%5Ctheta%20_%200%20%5Csim%20%5Cpi%7D%5BR(%5Chat%20%5Ctheta;%20%5Ctheta%20_%200)%5D%0A"></p>
<p>is the expected risk of <img src="https://latex.codecogs.com/png.latex?%5Chat%20%5Ctheta"> with respect to the prior <img src="https://latex.codecogs.com/png.latex?%5Cpi"> (also called the <strong>Bayes risk</strong>), then we have that the posterior mean estimate <img src="https://latex.codecogs.com/png.latex?%5Chat%20%5Ctheta%20_%20%7B%5Cpi%7D"> satisfies</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AB%20_%20%5Cpi(%5Chat%20%5Ctheta)%20%5Cgeq%20B%20_%20%5Cpi(%5Chat%20%5Ctheta%20_%20%5Cpi)%0A"></p>
<p>for any estimator <img src="https://latex.codecogs.com/png.latex?%5Chat%20%5Ctheta">. That is, the posterior mean estimate minimizes the expected risk.</p>
<p>The proof follows from the fact that</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5C%7C%20%5Ctheta%20_%200%20-%20%5Chat%20%5Ctheta(X)%20%5C%7C%5E2%20%5Cgeq%20%5C%7C%5Ctheta%20_%200%20-%20%5Chat%20%5Ctheta%20_%20%5Cpi(X)%20%5C%7C%5E2%20+%20%5Clangle%20%5Ctheta%20_%200%20-%20%5Chat%20%5Ctheta%20_%20%5Cpi(X),%20%5Chat%20%5Ctheta%20_%20%5Cpi(X)%20-%20%5Chat%20%5Ctheta(X)%5Crangle.%0A"></p>
<p>Writing the expected risk as an expected posterior loss, i.e.&nbsp;using the fact that</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathbb%7BE%7D%20_%20%7B%5Ctheta%20_%200%20%5Csim%20%5Cpi%7D%5Cleft%5B%5Cmathbb%7BE%7D%20_%20%7BX%20%5Csim%20p%20_%20%7B%5Ctheta%20_%200%7D%7D%5B%5C,%5Ccdot%5C,%5D%5Cright%5D%20=%20%5Cmathbb%7BE%7D%20_%20%7BX%20%5Csim%20m%7D%5Cleft%5B%5Cmathbb%7BE%7D%20_%20%7B%5Ctheta%20_%200%20%5Csim%20%5Cpi(%5Ccdot%20%5Cmid%20X)%7D%5B%5C,%5Ccdot%5C,%5D%5Cright%5D%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?m"> has density <img src="https://latex.codecogs.com/png.latex?m(x)%20=%20%5Cint%20p%20_%20%5Ctheta(x)%20%5Cpi(%5Ctheta)%5C,d%5Ctheta">, and since</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathbb%7BE%7D%20_%20%7B%5Ctheta%20_%200%20%5Csim%20%5Cpi(%5Ccdot%20%5Cmid%20X)%7D%5Cleft%5B%5Clangle%20%5Ctheta%20_%200%20-%20%5Chat%20%5Ctheta%20_%20%5Cpi(X),%20%5Chat%20%5Ctheta%20_%20%5Cpi(X)%20-%20%5Chat%20%5Ctheta(X)%5Crangle%5Cright%5D%20=%200,%0A"></p>
<p>we obtain the result.</p>
<p>A few remarks:</p>
<ol type="1">
<li>The expected risk has stability properties. If <img src="https://latex.codecogs.com/png.latex?%5Ctilde%20%5Cpi"> and <img src="https://latex.codecogs.com/png.latex?%5Cpi"> are two priors that are absolutely continuous with respect to each other, and if <img src="https://latex.codecogs.com/png.latex?%5C%7C%5Clog%20%5Cfrac%7Bd%5Ctilde%20%5Cpi%7D%7Bd%5Cpi%7D%5C%7C%20_%20%5Cinfty%20%5Cleq%20C">, then</li>
</ol>
<p><img src="https://latex.codecogs.com/png.latex?%0A%20%20%20e%5E%7B-C%7DB%20_%20%5Cpi(%5Chat%5Ctheta)%20%5Cleq%20B%20_%20%7B%5Ctilde%20%5Cpi%7D(%5Chat%20%5Ctheta)%20%5Cleq%20e%5EC%20B%20_%20%7B%5Cpi%7D(%5Chat%20%5Ctheta).%0A"></p>
<p>If the risk <img src="https://latex.codecogs.com/png.latex?R(%5Chat%20%5Ctheta;%20%5Ctheta%20_%200)"> is uniformly bounded by some constant <img src="https://latex.codecogs.com/png.latex?M"> over <img src="https://latex.codecogs.com/png.latex?%5Ctheta%20_%200%5Cin%20%5CTheta">, then</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%20%20%20B%20_%20%7B%5Ctilde%20%5Cpi%7D(%5Chat%20%5Ctheta)%20%5Cleq%20%5Csqrt%7BM%20B%20_%20%7B%5Cpi%7D(%5Chat%20%5Ctheta)%7D%20%5Cleft%5C%7Cd%5Ctilde%5Cpi/d%5Cpi%5Cright%5C%7C%20_%20%7BL%5E2(%5Cpi)%7D.%0A"></p>
<p>This shows how small chances in the prior does not result in a dramatic change in the expected loss of an estimator, as long as the priors have “compatible tails” (i.e.&nbsp;a manageable likelihood ratio).</p>
<ol start="2" type="1">
<li><p>It is sometimes advocated to choose the prior <img src="https://latex.codecogs.com/png.latex?%5Cpi"> so that the risk <img src="https://latex.codecogs.com/png.latex?R(%5Chat%20%5Ctheta%20_%20%5Cpi;%20%5Ctheta%20_%200)"> is constant over <img src="https://latex.codecogs.com/png.latex?%5Ctheta%20_%200">: the resulting estimator <img src="https://latex.codecogs.com/png.latex?%5Chat%20%5Ctheta%20_%20%5Cpi"> is then agnostic, from a risk point of view, to <img src="https://latex.codecogs.com/png.latex?%5Ctheta%20_%200">. This may result in a sample-size dependent prior (which is arguably not in the Bayesian spirit), but the fun thing is that it makes the expected risk <em>maximal</em> and the Bayes estimator <img src="https://latex.codecogs.com/png.latex?%5Chat%20%5Ctheta%20_%20%5Cpi"> minimax: <img src="https://latex.codecogs.com/png.latex?%5Chat%20%5Ctheta%20%20_%20%20%5Cpi%20%5Cin%20%5Carg%5Cmin%20%20_%20%20%20%7B%5Chat%5Ctheta%7D%20%5Csup%20%20_%20%20%20%7B%5Ctheta%20%20_%20%200%7DR(%5Chat%20%5Ctheta;%5Ctheta%20%20_%20%200)">. Indeed, in that case we have for any estimator <img src="https://latex.codecogs.com/png.latex?%5Chat%20%5Ctheta"> that <img src="https://latex.codecogs.com/png.latex?%5Csup%20%20%20_%20%20%20%7B%5Ctheta%20%20_%20%200%7D%20R(%5Chat%20%5Ctheta;%20%5Ctheta%20%20_%20%200)%20%5Cgeq%20B%20%20_%20%20%5Cpi(%5Chat%20%5Ctheta)%20%5Cgeq%20B%20%20_%20%20%20%5Cpi(%5Ctheta%20%20_%20%20%5Cpi)%20=%20%5Csup%20%20_%20%20%20%7B%5Ctheta%20%20_%20%20%200%7DR(%5Chat%20%5Ctheta%20%20_%20%20%5Cpi;%5Ctheta%20%20_%20%200)">, from which it follows that <img src="https://latex.codecogs.com/png.latex?%5Chat%20%5Ctheta%20%20_%20%20%5Cpi"> is minimax.</p></li>
<li><p>The idea of minimizing expected risk is not quite Bayesian, since it required us to first average over all data possibilities when computing the risk. One of the main advantage of the Bayesian framework is that it allows us to <em>condition</em> over the observed data, rather than pre-emptively considering all possibilities, and we can try to make use of that. Define the <strong>posterior expected loss</strong> (or <strong>posterior risk</strong>) or an estimator <img src="https://latex.codecogs.com/png.latex?%5Chat%20%5Ctheta">, conditionally on <img src="https://latex.codecogs.com/png.latex?X">, as</p></li>
</ol>
<p><img src="https://latex.codecogs.com/png.latex?%0A%20%20%20R%20_%20%5Cpi(%5Chat%20%5Ctheta%5Cmid%20X)%20=%20%5Cmathbb%7BE%7D%20_%20%7B%5Ctheta%20_%200%20%5Csim%20%5Cpi(%5Ccdot%20%5Cmid%20X)%7D%5Cleft%5B(%5Chat%20%5Ctheta(X)%20-%20%5Ctheta%20_%200)%5E2%5Cright%5D.%0A"></p>
<ul>
<li style="list-style-type: none;">
It is clear from the previous computations that the posterior mean estimate minimizes the posterior risk, and hence the two approaches are equivalent. It turns out that, whatever the loss function we consider (under some regularity condition ensuring that stuff is finite and minimizers exist), minimizing the posterior risk is equivalent to minimizing the Bayes risk. In other words, we have that for any loss function (again under some regularity conditions ensuring finiteness and existence of stuff), we have
</li>
</ul>
<p><img src="https://latex.codecogs.com/png.latex?%0A%20%20%20%5Carg%5Cmin%20_%20%7B%5Chat%20%5Ctheta%7D%5Cmathbb%7BE%7D%20_%20%7BX%20%5Csim%20m%7D%5Cleft%5B%5Cmathbb%7BE%7D%20_%20%7B%5Ctheta%20_%200%20%5Csim%20%5Cpi(%5Ccdot%20%5Cmid%20X)%7D%5B%5Cell(%5Chat%20%5Ctheta(X),%20%5Ctheta%20_%200)%5D%5Cright%5D%20=%20%5Carg%5Cmin%20_%20%7B%5Chat%5Ctheta%7D%5Cmathbb%7BE%7D%20_%20%7B%5Ctheta%20_%200%20%5Csim%20%5Cpi(%5Ccdot%20%5Cmid%20X)%7D%5B%5Cell(%5Chat%20%5Ctheta(X),%20%5Ctheta%20_%200)%5D.%0A"></p>
<ul>
<li style="list-style-type: none;">
This is roughly self-evident if we think about it. An interesting consequence is that any estimator minimizing a Bayes risk is a function of the posterior distribution.
</li>
</ul>
</section>
</section>
<section id="randomized-estimation-and-information-risk-minimization" class="level2">
<h2 class="anchored" data-anchor-id="randomized-estimation-and-information-risk-minimization">2. Randomized estimation and information risk minimization</h2>
<p>Let <img src="https://latex.codecogs.com/png.latex?%5CTheta"> be a model, let <img src="https://latex.codecogs.com/png.latex?X%20%5Csim%20Q"> be some data and let <img src="https://latex.codecogs.com/png.latex?%5Cell%20_%20%5Ctheta(X)"> be a loss associated with using <img src="https://latex.codecogs.com/png.latex?%5Ctheta"> to fitting the data <img src="https://latex.codecogs.com/png.latex?X">. For instance, we could have <img src="https://latex.codecogs.com/png.latex?%5CTheta%20=%20%5C%7Bf:%5Cmathcal%7BX%7D%20%5Crightarrow%20%5Cmathbb%7BR%7D%5C%7D"> a set of functions, <img src="https://latex.codecogs.com/png.latex?X%20=%5C%7B(U%20_%20i,%20Y%20_%20i)%5C%7D%20_%20%7Bi=1%7D%5En%20%5Csubset%20%5Cmathcal%7BX%7D%5Ctimes%20%5Cmathbb%7BR%7D"> a set of features with associated responses, and <img src="https://latex.codecogs.com/png.latex?%5Cell%20_%20%5Ctheta(X)%20=%20%5Csum%20_%20%7Bi%7D(Y%20_%20i%20-%5Ctheta(U%20_%20i))%5E2"> the sum of squared loss.</p>
<p>There may be a parameter <img src="https://latex.codecogs.com/png.latex?%5Ctheta%20_%200%5Cin%5CTheta"> minimizing the risk <img src="https://latex.codecogs.com/png.latex?R(%5Ctheta)%20=%20%5Cmathbb%7BE%7D%20_%20%7BX%5Csim%20Q%7D%5B%5Cell%20_%20%5Ctheta(X)%5D">, which will then be our learning target. Now we consider <em>randomized</em> estimators taking the form <img src="https://latex.codecogs.com/png.latex?%5Ctheta%5Csim%20%5Chat%20%5Cpi%20_%20X">, where <img src="https://latex.codecogs.com/png.latex?%5Chat%5Cpi%20_%20X"> is a data-dependent distribution, and the performance of this estimation method can then be evaluated by the empirical risk <img src="https://latex.codecogs.com/png.latex?R%20_%20X%20(%5Chat%5Cpi%20_%20X)%20=%20%5Cmathbb%7BE%7D%20_%20%7B%5Ctheta%20%5Csim%20%5Chat%20%5Cpi%20_%20X%7D%5B%5Cell%20_%20%5Ctheta(X)%5D">.</p>
<p>Here we should be raising an eyebrow. There is typically no point in having the estimator <img src="https://latex.codecogs.com/png.latex?%5Ctheta"> being random, i.e.&nbsp;we typically will prefer to take <img src="https://latex.codecogs.com/png.latex?%5Chat%20%5Cpi%20_%20X"> a point mass rather than anything else. But bear with me for a sec.&nbsp;The cool thing is that if we choose</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Chat%20%5Cpi%20_%20X%20=%20%5Carg%5Cmin%20_%20%7B%5Chat%20%5Cpi%20_%20X%7D%20%5Cleft%5C%7BR(%5Chat%20%5Cpi%20_%20X)%20+%20D(%5Chat%20%5Cpi%20_%20X%20%5C%7C%20%5Cpi)%5Cright%5C%7D,%20%5Ctag%7B$*$%7D%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?D(%5Chat%20%5Cpi%20_%20X%5C%7C%20%5Cpi)%20=%20%5Cint%20%5Clog%20%5Cfrac%7Bd%5Chat%20%5Cpi%20_%20X%7D%7Bd%5Cpi%7D%20%5C,d%5Chat%20%5Cpi%20_%20X"> is the Kullback-Leibler divergence, then this distribution will satisfy</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ad%5Chat%20%5Cpi%20_%20X(%5Ctheta)%20%5Cpropto%20e%5E%7B-%5Cell%20_%20%5Ctheta(X)%7Dd%5Cpi(%5Ctheta).%0A"></p>
<p>That is, Bayesian-type posteriors arise by minimizing the empirical risk of a randomized estimation scheme penalized by the Kullback-Leibler divergence form prior to posterior <a href="https://ieeexplore.ieee.org/document/1614067/">(Zhang, 2006)</a>.</p>
<p>For the proof, write</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AR%20_%20X(%5Chat%20%5Cpi%20_%20X)%20+%20D(%5Chat%20%5Cpi%20_%20X%20%5C%7C%20%5Cpi)%20=%20%5Cint%20%5Cleft(%5Cell%20_%20%5Ctheta(X)%20+%20%5Clog%5Cfrac%7Bd%5Chat%20%5Cpi%20_%20X(%5Ctheta)%7D%7Bd%5Cpi(%5Ctheta)%7D%5Cright)%20d%5Chat%20%5Cpi%20_%20X%20(%5Ctheta)=%5Cint%5Cleft(%5Clog%5Cfrac%7Bd%5Chat%5Cpi%20_%20X(%5Ctheta)%7D%7Be%5E%7B-%5Cell%20_%20%5Ctheta(X)%7Dd%5Cpi(%5Ctheta)%7D%5Cright)d%5Chat%20%5Cpi%20_%20X(%5Ctheta)%0A"></p>
<p>which is also equal to <img src="https://latex.codecogs.com/png.latex?D(d%5Chat%20%5Cpi%20_%20X%20%5C%7C%20e%5E%7B-%5Cell%20_%20%5Ctheta(X)%7D%20d%5Cpi)"> and, by properties of the Kullback-Leibler divergence, obviously minimized at <img src="https://latex.codecogs.com/png.latex?d%5Chat%20%5Cpi%20_%20X%20%5Cpropto%20e%5E%7B%5Cell%20_%20%5Ctheta(X)%7Dd%5Cpi(%5Ctheta)">.</p>
<p>Is this practically useful and insightful? Possibly. But at least this approach is suited to a general theory, as shown in Zhang (2006) and as I reproduce below.</p>
<p>Let us introduce a Rényi-type generalization error defined, for <img src="https://latex.codecogs.com/png.latex?%5Calpha%20%5Cin%20(0,1)">, by</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ad%20_%20%5Calpha(%5Ctheta;%20Q)%20=%20-%5Calpha%5E%7B-1%7D%5Clog%5Cmathbb%7BE%7D%20_%20%7BX'%20%5Csim%20Q%7D%5Be%5E%7B-%5Calpha%20%5Cell%20_%20%5Ctheta(X')%7D%5D.%0A"></p>
<p>This is a measure of loss associated with the use of a parameter <img src="https://latex.codecogs.com/png.latex?%5Ctheta"> to fit new data <img src="https://latex.codecogs.com/png.latex?X'%20%5Csim%20Q">. We also write</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ad%20_%20%5Calpha(%5Chat%20%5Cpi%20_%20X;%20Q)%20=%20-%5Cmathbb%7BE%7D%20_%20%7B%5Ctheta%20%5Csim%20%5Chat%20%5Cpi%20_%20X%7D%5Cleft%5B%20%5Calpha%5E%7B-1%7D%5Clog%5Cmathbb%7BE%7D%20_%20%7BX'%20%5Csim%20Q%7D%5Be%5E%7B-%5Calpha%20%5Cell%20_%20%5Ctheta(X')%7D%5D%20%5Cright%5D%0A"></p>
<p>for the expected Rényi generalization error when using the randomization scheme <img src="https://latex.codecogs.com/png.latex?%5Ctheta%20%5Csim%20%5Chat%20%5Cpi%20_%20X">.</p>
<p>In order to get interesting bounds on this generalization error, we can follow the approach of Zhang (2006).</p>
<section id="change-of-measure-inequality" class="level4">
<h4 class="anchored" data-anchor-id="change-of-measure-inequality">Change of measure inequality</h4>
<p>We’ll need the change of measure inequality, which states that for any function <img src="https://latex.codecogs.com/png.latex?f"> and distributions <img src="https://latex.codecogs.com/png.latex?%5Cpi">, <img src="https://latex.codecogs.com/png.latex?%5Chat%20%5Cpi"> on <img src="https://latex.codecogs.com/png.latex?%5CTheta,"></p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathbb%7BE%7D%20_%20%7B%5Ctheta%20%5Csim%20%5Chat%5Cpi%7D%5Bf(%5Ctheta)%5D%20%5Cleq%20D(%5Chat%20%5Cpi%20%5C%7C%20%5Cpi)%20+%20%5Clog%20%5Cmathbb%7BE%7D%20_%20%7B%5Ctheta%20%5Csim%20%5Cpi%7D%5Cleft%5Be%5E%7Bf(%5Ctheta)%7D%5Cright%5D.%0A"></p>
<p>Indeed, with some sloppyness and Jensen’s inequality we can compute</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Clog%20%5Cint%20e%5E%7Bf(%5Ctheta)%7D%5Cpi(d%5Ctheta)%5Cgeq%20%5Cint%20f(%5Ctheta)%5Clog(d%5Cpi/d%5Chat%5Cpi(%5Ctheta))d%5Chat%20%5Cpi%20=%20%5Cmathbb%7BE%7D%20_%20%7B%5Ctheta%20%5Csim%20%5Chat%20%5Cpi%7D%5Bf(%5Ctheta)%5D%20-%20D(%5Chat%20%5Cpi%5C%7C%5Cpi).%0A"></p>
</section>
<section id="generalization-error-bound" class="level4">
<h4 class="anchored" data-anchor-id="generalization-error-bound">Generalization error bound</h4>
<p>We can now attempt bounding <img src="https://latex.codecogs.com/png.latex?d%20_%20%5Calpha(%5Chat%20%5Cpi%20_%20X;Q)">. Consider the difference <img src="https://latex.codecogs.com/png.latex?%5CDelta%20_%20X%20(%5Ctheta)%20=%20d%20_%20%5Calpha(%5Ctheta;Q)%20-%20%5Cell%20_%20%5Ctheta(X)"> between the generalization error and the empirical loss corresponding to the use of a fixed parameter <img src="https://latex.codecogs.com/png.latex?%5Ctheta">. Then by the change of measure inequality,</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cexp%5C%7B%5Cmathbb%7BE%7D%20_%20%7B%5Ctheta%20%5Csim%20%5Chat%20%5Cpi%20_%20X%7D%5B%5CDelta%20_%20X(%5Ctheta)%5D%20-%20D(%5Chat%20%5Cpi%20_%20X%5C%7C%5Cpi)%5C%7D%20%5Cleq%20%5Cmathbb%7BE%7D%20_%20%7B%5Ctheta%20%5Csim%20%5Cpi%7D%5Cleft%5Be%5E%7B%5CDelta%20_%20X(%5Ctheta)%7D%5Cright%5D%0A"></p>
<p>and hence for any <img src="https://latex.codecogs.com/png.latex?%5Cpi">,</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathbb%7BE%7D%20_%20%7BX%20%5Csim%20Q%7D%5Cleft%5B%5Cexp%5Cleft%5C%7B%5Cmathbb%7BE%7D%20_%20%7B%5Ctheta%20%5Csim%20%5Chat%20%5Cpi%20_%20X%7D%5B%5CDelta%20_%20X(%5Ctheta)%5D%20-%20D(%5Chat%20%5Cpi%20_%20X%5C%7C%5Cpi)%5Cright%5C%7D%5Cright%5D%20%5Cleq%20%5Cmathbb%7BE%7D%20_%20%7BX%20%5Csim%20Q%7D%5Cleft%5B%5Cmathbb%7BE%7D%20_%20%7B%5Ctheta%20%5Csim%20%5Cpi%7D%5Cleft%5Be%5E%7B%5CDelta%20_%20X(%5Ctheta)%7D%5Cright%5D%5Cright%5D%20=%201%0A"></p>
<p>By Markov’s inequality, this implies that <img src="https://latex.codecogs.com/png.latex?%5Cforall%20t%20%3E%200">,</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathbb%7BP%7D%5Cleft(%5Cmathbb%7BE%7D%20_%20%7B%5Ctheta%20%5Csim%20%5Chat%20%5Cpi%20_%20X%7D%5B%5CDelta%20_%20X(%5Ctheta)%5D%20-%20D(%5Chat%20%5Cpi%20_%20X%5C%7C%5Cpi)%20%5Cgeq%20t%5Cright)%20%5Cleq%20e%5E%7B-t%7D.%0A"></p>
<p>Rewriting yields</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ad%20_%20%5Calpha(%5Chat%20%5Cpi%20_%20X;Q)%20%5Cleq%20R%20_%20X(%5Chat%20%5Cpi%20_%20X)%20+%20D(%5Chat%20%5Cpi%20_%20X%5C%7C%5Cpi)%20+%20t%0A"></p>
<p>with probability at least <img src="https://latex.codecogs.com/png.latex?1-e%5E%7B-t%7D">. To recap: the term <img src="https://latex.codecogs.com/png.latex?d%20_%20%5Calpha(%5Chat%20%5Cpi%20_%20X;Q)"> is understood as a generalization error, on the right hand side <img src="https://latex.codecogs.com/png.latex?R%20_%20X(%5Chat%20%5Cpi%20_%20X)%20=%20%5Cmathbb%7BE%7D%20_%20%7B%5Ctheta%20%5Csim%20%5Chat%20%5Cpi%20_%20X%7D%5B%5Cell%20_%20%5Ctheta(X)%5D"> is the empirical risk, the Kullback-Leibler divergence <img src="https://latex.codecogs.com/png.latex?D(%5Chat%20%5Cpi%20_%20X%5C%7C%5Cpi)"> penalizes the complexity of <img src="https://latex.codecogs.com/png.latex?%5Chat%5Cpi%20_%20X"> seen as a divergence from a “prior” <img src="https://latex.codecogs.com/png.latex?%5Cpi">, and <img src="https://latex.codecogs.com/png.latex?t"> is a tuning parameter.</p>
</section>
</section>
<section id="online-learning-regret-and-kullback-leibler-divergence" class="level2">
<h2 class="anchored" data-anchor-id="online-learning-regret-and-kullback-leibler-divergence">3. Online learning, regret and Kullback-Leibler divergence</h2>
<p>Following <a href="http://www.stat.yale.edu/~arb4/publications_files/information%20theoric%20characterization%20of%20bayes%20performance.pdf">Barron (1998)</a>, suppose we sequentially observe data points <img src="https://latex.codecogs.com/png.latex?X%20_%201,%20X%20_%202,%20X%20_%203,%20%5Cdots"> which are say i.i.d. with common distribution <img src="https://latex.codecogs.com/png.latex?Q"> with density <img src="https://latex.codecogs.com/png.latex?q">. At each time step <img src="https://latex.codecogs.com/png.latex?n">, the goal is to predict <img src="https://latex.codecogs.com/png.latex?X%20_%20%7Bn+1%7D"> using the data <img src="https://latex.codecogs.com/png.latex?X%5En%20=%20(X%20_%201,%20%5Cdots,%20X%20_%20n)">. Our prediction is not a point estimate of <img src="https://latex.codecogs.com/png.latex?X%20_%20%7Bn+1%7D">, but somewhat similarly as in the randomized estimation scenario we output a density estimate <img src="https://latex.codecogs.com/png.latex?%5Chat%20p%20_%20n%20=%20p(%5Ccdot%20%5Cmid%20X%5En)">, the goal being that <img src="https://latex.codecogs.com/png.latex?p(X%20_%20%7Bn+1%7D%5Cmid%20X%5En)"> be as large as possible. A bit more precisely, we individually score a density estimate <img src="https://latex.codecogs.com/png.latex?%5Chat%20p%20_%20n"> through the risk <img src="https://latex.codecogs.com/png.latex?%5Cell%20_%20q(%5Chat%20p%20_%20n)%20=%20%5Cmathbb%7BE%7D%20_%20%7BX%20_%20%7Bn+1%7D%5Csim%20q%7D%5B%5Clog(q(X%20_%20%7Bn+1%7D)/%5Chat%20p%20_%20n(X%20_%20%7Bn+1%7D%20))%5D%20=%20D(q%5C%7C%20%5Chat%20p%20_%20n)"> which is the Kullback-Leibler divergence between <img src="https://latex.codecogs.com/png.latex?%5Chat%20p%20_%20n"> and <img src="https://latex.codecogs.com/png.latex?q">. The <em>regret</em> over times <img src="https://latex.codecogs.com/png.latex?n=1,%202,%5Cdots,%20N"> is the sum of the risk over the whole process, i.e.</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctext%7Bregret%7D%20=%20%5Csum%20_%20%7Bn=1%7D%5EN%20D(q%5C%7C%20%5Chat%20p%20_%20n).%0A"></p>
<p>Formally, this process is equivalent to estimating the distribution of <img src="https://latex.codecogs.com/png.latex?X%5EN"> all at once: our density estimate <img src="https://latex.codecogs.com/png.latex?%5Chat%20p%5EN"> of <img src="https://latex.codecogs.com/png.latex?X%5EN"> would simply be</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Chat%20p%5EN(X%5EN)%20=%20%5Cprod%20_%20%7Bn=1%7D%5EN%20%5Chat%20p%20_%20n(X%20_%20n)%0A"></p>
<p>and the regret is, by the chain rule, simply <img src="https://latex.codecogs.com/png.latex?D(q%5EN%20%5C%7C%20%5Chat%20p%5EN)">, where <img src="https://latex.codecogs.com/png.latex?q%5EN"> is the <img src="https://latex.codecogs.com/png.latex?N">th independent product of <img src="https://latex.codecogs.com/png.latex?q">.</p>
<p>Given a prior <img src="https://latex.codecogs.com/png.latex?%5Cpi"> over a space of distributions for <img src="https://latex.codecogs.com/png.latex?q">, our problem then to minimize the Bayes risk</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AB%20_%20%5Cpi(%5Chat%20p%5EN)%20=%20%5Cmathbb%7BE%7D%20_%20%7Bq%5Csim%20%5Cpi%7D%20D(q%5EN%5C%7C%5Chat%20p%5EN).%0A"></p>
<p>This is achieved by choosing <img src="https://latex.codecogs.com/png.latex?%5Chat%20p%5EN(x)%20=%20%5Chat%20p%20_%20%5Cpi%5EN(x)%20=%20%5Cint%20q%5EN(x)%20%5Cpi(dq)"> the <em>prior predictive</em> density. This is equivalent to using, at each time step <img src="https://latex.codecogs.com/png.latex?n">, the poterior predictive density <img src="https://latex.codecogs.com/png.latex?%5Chat%20p%20_%20%7Bn,%20%5Cpi%7D(x)%20=%20%5Cint%20q(x)%20%5C,%5Cpi(dq%5Cmid%20%5C%7BX%20%20_%20%20i%5C%7D%20%20_%20%7Bi=1%7D%5En)">.</p>
<p>To see this minimizing property of the Bayes average, it suffices to write</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AB%20_%20%5Cpi(%5Chat%20p%5EN)%20=%20%5Cmathbb%7BE%7D%20_%20%7Bq%20%5Csim%20%5Cpi%7D%20%5Cleft%5BD(q%5EN%5C%7C%20%5Chat%20p%20_%20%5Cpi%5EN)%5Cright%5D%20+%20D(%5Chat%20p%20_%20%5Cpi%5EN%20%5C%7C%20%5Chat%20p%5EN).%0A"></p>
<p>Note that an consequence of this analysis is also that the posterior predictive distribution <img src="https://latex.codecogs.com/png.latex?%5Chat%20p%20_%20%7Bn,%20%5Cpi%7D"> will minimize the expected posterior risk:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Chat%20p%20_%20%7Bn,%20%5Cpi%7D%20%5Cin%20%5Carg%5Cmin%20_%20%7B%5Chat%20p%20_%20%7Bn%7D%7D%20%5Cmathbb%7BE%7D%20_%20%7Bq%20%5Csim%20%5Cpi(%5Ccdot%5Cmid%20X%5En)%7D%5Cleft%5BD(q%5C%7C%5Chat%20p%20_%20n)%5Cright%5D.%0A"></p>
<p>Following section 1, this furthermore means that the posterior predictive distribution minimizes the Bayes risk associated with the Kullback-Leibler loss.</p>


</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-reuse"><h2 class="anchored quarto-appendix-heading">Reuse</h2><div class="quarto-appendix-contents"><div><a rel="license" href="https://creativecommons.org/licenses/by/4.0/">CC BY 4.0</a></div></div></section><section class="quarto-appendix-contents" id="quarto-copyright"><h2 class="anchored quarto-appendix-heading">Copyright</h2><div class="quarto-appendix-contents"><div>Olivier Binette</div></div></section></div> ]]></description>
  <category>technical</category>
  <category>math</category>
  <category>statistics</category>
  <guid>https://olivierbinette.ca/pages/posts/2020-11-15-bayesian-optimalities/bayesian-optimalities.html</guid>
  <pubDate>Fri, 24 May 2019 04:00:00 GMT</pubDate>
</item>
</channel>
</rss>
