Olivier's blog

Best Practices for Accurate Data Labeling in Entity Resolution Systems

In this blog post, we explore the importance of accurate data labeling in entity resolution systems and share some best practices for data labeling.

Just arXiv'd: Estimating the Performance of Entity Resolution Algorithms: Lessons Learned Through PatentsView.org

This paper introduces a novel evaluation methodology for entity resolution algorithms. It is motivated by PatentsView.org, a U.S. Patents and Trademarks Office patent data exploration tool that disambiguates patent inventors using an entity resolution algorithm...

Potential of Privacy-Preserving Record Linkage for the Statistics of Hidden Population

Entity Resolution

Many populations are "hidden" from the point of view of traditional probabilistic surveys. They are populations for which we have no meaningful sampling frame and whose members may be difficult to identify. Examples include victims of human trafficking and civilian casualties in armed conflicts. Understanding these populations is central to policymaking and to prosecuting human rights violations, yet statistical inference remains extremely difficult in practice. [...]

Talk Wednesday at the Joint Statistical Meeting!

I'll be giving a talk Wednesday 2pm at JSM in D.C. Come see it! It'll be an interesting session with winners of the best student paper award of the Survey Methods, Government Statistics, and Social Statistics sections.

Talk Thursday at UQÀM's Statistics Seminar

I will be talking about my recent paper "Estimating the Performance of Entity Resolution Algorithms: Lessons Learned Through PatentsView.org"

How can you keep your computer systems secure?

As a professional working with sensitive data, I am responsible for keeping my computers and accounts secure. Here's how I do it and how you can too.

What's the Difference Between Science and Storytelling in Data Science?

It's easy to come up with stories and plausible explanations. What's hard is to be *right*.

Web App for "Welcome to the Moon"

Play "Welcome to the Moon" on the couch or remotely.

A Brief Introduction to Hyperparameter Optimization in Machine Learning

Machine Learning
Hyperparameter Optimzation

I review a few black-box hyperparameter optimization techniques at a high conceptual level: grid search, randomized search and sequential model-based optimization.

Record Linkage at the Duke GPSG Community Pantry

Record Linkage

The Duke Graduate and Professional Student Government (GPSG) Community Pantry is a student-operated food pantry serving the student community at Duke University. In this post, I describe the record linkage system used at the Pantry to identify individual customers and obtain their order history. This is done using a Python module for deterministic record linkage and model evaluation techniques which I describe in detail.

Theory of Gibbs posterior concentration

Notes on some research in progress.

The Credibility of confidence intervals

When p < 0.05 provides evidence in favor of the null...

Bayesian Optimalities

Some notes regarding various 'optimalities' of posterior distributions.

Global bounds for the Jensen functional

Some techniques to bound the Jensen functional.

Two sampling algorithms for trigonometric densities

A short description of the post.

The Significance of the adjusted R squared coefficient

A sound re-interpretation of the adjusted $R^2$ value for model comparison.

3D data visualization with WebGL/Three.js

ISM at the Eureka! Science Festival!

Short proof: critical points in invariant domains

Tubular neighborhoods

Complete proof of the tubular neighborhood theorem for submanifolds of euclidean space. I was unable to find an elementary version in the litterature.

More articles »

Olivier’s blog