Olivier Binette

Buy or Build?

Fri, 15 Nov 2024 05:00:00 GMT

Choosing where to get what you need - whether that’s software, hardware, or people - is called strategic sourcing. It’s about minimizing the total cost of ownership, including the costs associated with using, not using, or maintaining the thing you need.

The problem is particularly tricky for software developers. They’re paid to build, after all, and they know or want to know how to build. So why should they buy a solution when they can make it themselves?

Before you build, make sure you understand the real costs to succeed over the long term, and only embark on those code-writing efforts you’re sure your business is capable of. - Robert Sher, HBR

Answering this question requires having clear requirements, understanding the extent to which suppliers can meet these requirements, and understanding the total costs associated with each alternative.

But there’s a rule of thumb that covers many situations. If:

you can buy what you need,
from a reasonably mature competitive market,
that benefits from economies of scale,

then you should buy and not build.

Why? You’re unlikely to beat a competitive market with economies of scale, so buy if you can.

There are exceptions to this, such as if this is an area of core capability where you’re trying to compete. But it’s a good rule of thumb for the rest.

Other tips from PipDecks’ Strategy Tactics:

Reuse

CC BY 4.0

Copyright

Olivier Binette

Product Development Is Hard

Wed, 04 Sep 2024 04:00:00 GMT

I am mostly a “technical” person. This means I tend to work on technology problems that have technology solutions. I’m interested in non-technological things as well, but it’s not my expertise.

In my field, learning about a new technology can feel like gaining a superpower. Think about being able to build a custom ChatGPT - it’s exciting!

With this comes the thought: “Wouldn’t it be nice if I solved problem Y using technology X?”

Unfortunately, the answer to this question is typically a resounding “no.”

It’s not that problem Y is not important. Or that technology X can’t help with problem Y. The problem is that product development is hard.

If I went about building a solution fueled only by my technological enthusiasm, then I would likely fail. It has happened to me before.

Most people don’t care about technology. They care about a job to be done. They want to gain a superpower of their own.

https://jtbd.info/2-what-is-jobs-to-be-done-jtbd-796b82081cca

Building a good product requires understanding what your customer/client wants to get done. To understand where, when, and why they might want to use your product.

This is a science of its own. It’s not a technological problem, it’s a human problem. And it’s not my expertise.

As technologists, we need to embrace our backline role. We need to call on non-technologists to guide the creation of great products that empower others, or learn the skills we need to get this done through training from experts or experience working with experts.

Reuse

CC BY 4.0

Copyright

Olivier Binette

Strategic Project Management Made Simple

Wed, 04 Sep 2024 04:00:00 GMT

Everything that follows is a quote from Terry’s book, with minimal adaptations for flow in some places.

The most potent opportunities seldom show up labeled as “projects,” but arrive disguised as problems, issues, or murky messes. Tackling so called Big, Hairy, Audacious Goals, as Jim Collins describes them in Built to Last, involves juggling a full spectrum of slippery Objectives that can be difficult to define, let alone manage.

In the pages ahead, I’ll walk you through a flexible thinking process, and show you how to sort through the fog of fuzzy ideas and develop sound strategies and executable plans. You’ll see how these tools scale up and down to handle issues of any size and flex to fit multiple situations you may face. But first, let’s review why most project plans are inadequate. See how many of these resonate with your personal experience:

Beware these six dangerous planning mistakes

	Planning Mistake \| Solution Elements \|
Tolerating Vague Objectives In the rush to implement, not enough serious, upfront thinking goes into clarifying Objectives, Measures, and their interconnections.	Make Objectives clear and measurable Identify logical levels and If-Then links Define your strategic hypotheses Define why before what and how
Ignoring Environmental Context Projects unfold in unpredictable ways, but people sometimes think myopically and ignore how risk factors outside their project boundaries might affect them.	Scan the environment for circumstances Understand internal and external context Identify risk elements Make, test, manage, and monitor Assumptions
Poor Planning Tools and Processes When the only tool is a hammer, the whole world looks like a nail. Before firing up your PC, fire up your brain and flesh out your project strategy.	Choose common planning model and language Plan top-down, test bottom-up Plan for the plan Use the Logical Framework as a central planning tool
Neglecting Stakeholder Interests Projects are real-life dramas played out by multiple actors who bring their own agenda and varying degrees of interest and support.	Remember - people support what they help create Involve people who matter Understand the perspectives of others Build consensus and commitment
One-shot Planning Like home-baked bread that grows moldy with time, project plans have a limited shelf-life. They must be updated to reflect new learning and progress.	Build consensus and commitment Treat project documents as living plans, organic in nature Be “cycle logical” - think, plan, act, and assess Iterate and update in predetermined learning cycles Constantly refine the strategic hypothesis
Mismanaging People Dynamics Project success requires the committed, coordinated action of many people.	Build in payoffs (fun, learning, rewards) Grow the team while growing the plan Sharpen the who-when-what-how Manage with emotional intelligence

The Four Critical Questions

All great solutions begin by asking the right questions. They seem like simple questions - that’s exactly the point. They are indeed simple, but not simplistic. The four following carefully crafted questions work wonders in virtually any situation. The first three are usually glossed over in the rush to answer the fourth.

What are we trying to accomplish and why?

The question of what the project should accomplish - and more importantly - why it needs to be done, deserves fine-tuned attention because those answers drive everything else. In the rush to decide on the how, who, and when of a project, people often gloss over the why.
How will we measure success?

This question is significant because Measures flesh out and anchor what the Objectives really mean. Until you define how success will be measured, even the most sincere visions are no more than highfalutin’ fluff.
What other conditions must exist?

This third question puts your project, issue, or initiative into a larger strategic context. Asking this expands the analysis to include some of the outside factors which may disrupt your carefully crafted plans.
How do we get there?

The majority of project teams I have worked with tend to delve deep into the details much too soon, or get sidelined by premature technical arguments. They gloss over the first three questions in a rush to get moving. The value of the fourth question comes from consciously placing it in its only, truly functional place in the planning sequence: Last.

LogFrames

While the LogFrame matrix may initially seem intimidating, the ideas it captures are basic. The four strategic questions offer a user friendly way to learn and apply this tool. These questions are inherently embedded in the matrix and answering them helps you design your project in a way that connect all the dots.

Alternative LogFrame Diagram

What Are We Trying To Accomplish And Why? (Objectives)
The first column describes Objectives and the If-Then logic linking them together. The LogFrame makes important distinctions among various “levels” of Objectives: Strategic intention (Goal), project impact (Purpose), project deliverables (Outcomes), and the key action steps (Inputs).
How Will We Measure Success? (Measures and Verifications)
- The second column identifies the Measures of sucess for Objectives at each level. here wew select appropriate Measures and choose quantity, quality, and time indicators to clarify what each Objective means.
- The third column summarizes how we will verify the status of the Measures at eaech level. Think of the Verification column as the project’s management information and feedback system.
What Other Conditions Must Exist? (Assumptions)
The fourth column captures Assumptions; those ever-present, but often neglected risk factors outside of the project, on which project success depends. Defining and testing Assumptions lets you spot potential problems and deal with them in advance.
How do we get There? (Inputs)
The bottom row captures the project action plan: Who does what, when, and with what resources. Conventional project management like Work Breakdown Structures (WBS) and Gantt chart schedules fit here.

LogFrame Tips

Treat the matrix as a summary. Keep it clear and concise; supplement with other documents.
Make sure everyone on the team has working understanding of the LogFrame (at a minimum, knowing the four critical questions).
Make sure the right peopole are involved. Invite key stakeholders to participate in project planning.
Stress the importance of the process of planning as much as the plan that comes out of the planning process. Supplement liberally with other supporting tools.
Iterate to make it great. Consider the first Logframe to be a rough draft that will require revision and reworking, perhaps through many cycles.
Build in specific milestones on the calendar at which you refine and revise the matrix in the light of new information.
Monitor and manage changing Assumptions over time.

Turning a Problem Into a Set of Objectives

A problem is simply a project in disguise. Projects masquerading as problems must first be converted into Objectives before advancing to solutions. Spend some time carefully diagnosing the problem because the way you define it shapes the range of solution options. Don’t get sucked in by an over-simplified definition, catch phrase, or symptom. Get at the root causes. Find the right problem to solve.

Stakeholder collaboration during problem analysis builds shared understanding, generates better solution approaches, and greases the skids for smoother execution.

Ask Your Stakeholders

What do you see as the problem?
Why is this a problem and for whom?
What causes the problem?
What are the consequences if we ignore the problem?
How will you know when the problem is gone?
What benefits will a solution bring?
What might an ideal solution look like?

Exploring Distinctions Among LogFrame Levels

Goal: The Big Picture Impact

The Goal is the big picture context — the overarching corporate or strategic Objective to which your project, and usually other projects, contribute.

Some typical Goal examples:

Delight our customers
Become the top provider in the market
Increase corporate profits
Ensure reliability of the nuclear stockpile
Foster a climate of innovation
Be the global leader in safety education

These secondary trigger questions can help you get to the priamary Goal of a project:

What is the higher corporate or strategic Objective to which this project contributes?
Why is the project’s impact important?
What should happen after we achieve the Purpose?
What is the big picture reason for doing this project?

Purpose: The Project Sweet Spot

Purpose is the vital, often missing focus that expresses the desired result or the impact we expect the project deliverables to produce. It describes expected change in system behavior, whether the system of interest is a core process, a new organization unit, or target customers. Purpose floats a level above that which we can directly control — the Outcomes. It’s a subtle concept, often hard to grasp because we are so conditioned to thinking of activities and Outcomes.

Consider these examples:

Outcomes Statement	Corresponding Purposes
System built or delivered	Customers use our system
Process improved	Improved process used
System developed	System successfully implemented
Staff trained in safe procedures	Staff operates machinery safely

Here are some trigger questions you can ask to articulate the Purpose:

Why are we really doing this project?
What would the clients or users like to see happen because of this project?
If this project were a success, how would we know?
What impact are we trying to achieve?

Outcomes: What the Project Will Deliver

Project Outcomes describe what the team can, must, and commits to make happen to achieve Purpose. They can be functioning systems or processes (i.e., recruiting process operating) as well as completed end products (i.e., prototype built) and delivered services (i.e., people trained). They describe the specifi c end-results (or deliverables) expected from implementing a series of activities or tasks.

Use these questions to help solidify required Outcomes:

What are our main project deliverables?
What do we need to make happen in order to achieve the project Purpose?
What are the end results for which the project team can be held accountable?
What processes do we need to put in place to achieve Purpose?

Inputs (Activities)	Outcomes
Train users	Users trained
Improve skills	Skills improvevd
Determine best methods	Best methods determined
Build new office	New office built

Four Tips for Meaningful Measures

Don’t fall into the trap of measuring only that which is easy to measure. Measuring Inputs and Outcomes is most straightforward, but progress towards Purpose and Goal is what really counts. The best Measures meet these criteria:

Valid — They accurately measure the Objective. Changes in the status of Measures accurately reflect changes in the status of the Objective.
Verifiable — Clear, non-subjective evidence exists or can be obtained. This third LogFrame column
identifies processes and mechanisms for determining the status of Measures in column two.
Targeted — Quality, quantity, and time targets are pinned down. Choose targets that are sufficient to achieve impact at the next higher level. Sometimes, rather than locking in a single number, it’s appropriate to state a rough range.
Independent — Each level in the hierarchy has separate Measures.
1. Goal Measures tend to be broad macro-Measures that include the long-term impact of one project or multiple projects aimed at the same Goal.
2. Purpose Measures describe those conditions we expect will exist when we are willing to call the project a success.
3. Outcome Measures describe specific tangible results that the project team can make happen and commits to doing so. Describe them as completed results (using the past tense verb form, such as “System developed”or “Training completed”).
4. Input Measures deal with activity, budget, and schedule.

Purpose Measures are the most important in the hierarchy. Why? Because that’s your primary aiming point, the what-should-occur result you expect after you deliver what you can.

Three Steps for Managing Assumptions

Step 1. Identify Key Assumptions

Brainstorm all the conditions you believe are necessary to go from one LogFrame level to the next.

Step 2. Analyze and Test Them

Try to assess the degree of risk you can expect from these critical Assumptions by using a simple rating system or probability percentages. Decide which Assumptions to highlight in the LogFrame matrix.

How important is this Assumption to project success or failure?
How valid or probable is this Assumption? What are the odds that it is valid (or not)? Can we express it as a percentage? How do we know?
If the Assumptions fail, what is the effect on the project? Does a failed Assumption diminish accomplishment? Delay it? Destroy it?
What could cause this Assumption not to be valid? ”(Note: This one raises specific risk factors.)

Step 3. Act on Them

Put each key Assumption under your mental microscope and consider the following:

Is this a reasonable risk to take?
To what extent is it amenable to control? Can we manage it? Influence and nudge it? Or only monitor it
What are some ways we can influence the Assumption?
What contingency plans might we put in place just in case the Assumption proves wrong?
How can we design the project to minimize the impact of, or work around, the Assumption?
Is this Assumption under someone else’s control?
How could we design the project to make this Assumption moot or irrelevant?

Aligning Projects With Strategic Intent

The LogFrame can be the cornerstone of any unit-level management system. However, this presumes that there is a sound, overarching strategy to begin with.

Strategy is the particular means chosen to get from where you are to where you want to go, selected from multiple possibilities and reflecting your vision, mission, and values. An overall Strategy (big “S”) usually consists of multiple strategic initiatives (small “s”), which are executed through programs, projects, and tasks.

Strategic planning steps:

Clarify the Planning Context and Issues - Be clear about your expected planning Outcomes and identify issues to include.
Involve Key Players - Decide who to involve in your process to build buy-in and stay-ini.
Scan Your Environment - Identify what’s changing in your environment; and analyze divvision and department plans to extract Goals your group shares or owns.
Revisit Your Vision/Mission/Values - Turn these “fluff“ statements into high-performance tools that energize staff and build shared commitment.
Sharpen Your Goals and Measures - Develop a meaningful performance scorecard that identifies how you deliver customer value.
Develop Core Strategies - Turn Goals into strategies, and test those strategies for impact against Measures to ensure smart choices.
Turn Strategies into Executable Plans - Using the Logical Framework. Let the responsible players flesh out implementation plans.
Follow Up and Continue the Process - Build momentum by revieweing and updating the plans while strenghtening the planning process itself.

The Strategic Action Cycle

The cycle begins with “Think,” the big picture strategic/program focus which follows the process from Chapter 4, or an equivalent strategic planning process.
Results of strategic thinking identify projects to be managed with the Plan-Act-Assess cycle.
Project plans created with LogFrames provide a solid foundation for action (execution/implementation) and Assessment.
The Assess block can complete the loop in three ways. If assessment shows that success has been achieved - as defined by project Purpose - the project can be considered complete.
1. Project Monitoring is an ongoing process of tracking budget and schedule against deliverables and making tactical adjustments. It presumes the Logical Framework is the best design and focuses team attention on translating Inputs into Outcomes.
2. Project Review is an occasional process that asks managers to step back from the day-to-day work and reassess their approach. It challenges the project design and invites changes in the LogFrame, with emphasis on the Outcome to Purpose link.
3. Project Evaluation examines impact and cost effectiveness. Project evaluations are often timed as the end of one phase nears and another is about to begin, or after the project is over. Evaluation examines Purpose to Goal linkages.

Other

Tips

The process of planning is more crucial than the planning documents that emerge at the other end. The collaborative use of the LogFrame helps you simultaneously build and shape a strong team while they work together to create an actionable plan.
Make sure that everyone speaks the same language by agreeing on what your key terms mean and using them in a consistent way.
The LogFrame matrix usually shows four levels, but Objectives above the Goal can be included to illustrate a higher level of impact. The higher up the hierarchy we climb, the more long-term, general, and “vision-sounding” these Objectives become.
Don’t ask “Hows it going on this task?“ Instead, ask:
- Are you having difficulties that would keep you from meeting targets?
- Are you getting the support you need from others?
- Is there anything else I should know about this?
- What do you need from me?
Project monitoring asks “Are we on track?“; project reviews ask “Are we on the right track?“ Use the LogFrame to challenge your strategy by posing questions such as:
- Is our Purpose still valid? What’s our progress toward Purpose?
- Is our Purpose likely to be achieved with this plan? Will this Purpose get us to the Goal?
- What is the status of Assumptions?
- Are these the right Outcomes? Are we producing them effectively?
- Should new Outcomes or Assumptions be added? Existing ones dropped?
- How should we rervise our key strategic hypotheses (Outcome to Purpose to Goal) to produce better results?
Because the LogFrame’s systems thinking underpinnings are generic and flexible, so is the grid format itself. Be innovative and customize the LogFrame to your needs and add your own categories.
At times you’ll need to zoom in on a project component for more visibility. Some tasks are large enough to justify their own LogFrame.
Make responsibilities clear to all
Clarify Resource Requirements
Analyze stakeholder interests
Manage with emotional intelligence

Reuse

CC BY 4.0

Copyright

Olivier Binette

The Pareto Principle and Project Failures

Sun, 01 Sep 2024 04:00:00 GMT

The Pareto principle, or the 80/20 rule, states that 80% of consequences come from 20% of the causes.

Surprisingly enough, this principle has general statistical underpinnings and does actually occur in a broad range of situations. The numbers 80/20 could be something else, but there is often an imbalance of this sort. It’s related to selection bias and size bias. Let me explain in the context of software development.

Say you’re building a piece of software for some use case. There’s a lot that goes into building and deploying the software: the UI, the logic, the backend, the deployment infrastructure, the iterative changes, etc. Each part contributes more or less to the functionality a user can see.

In this plot, UI+logic+backend is 80% of the functionality* the user can see, but only 40% of the required effort to complete the project.

If functionality and effort are uncorrelated or negatively correlated, then building the most functionality first will lead to decreasing return of efforts on functionality over the project’s life. The smallest set of components that contribute to 80% functionality is a biased selection that isn’t representative of the overall effort distribution.

This doesn’t mean that the 80% seen functionality is more important than the other 20%. In fact, your software is going to be useless if you can’t build the infrastructure it needs for deployment. All components are equally important in this example. This mismatch between true value and apparent functionality can be dangerously misleading.

Why Software Projects Fail

The Pareto principle plays into the common failure (or cost overrun, scope creep, technical debt) of software projects.

Often, development teams prioritize building a minimal viable product (MVP), or delivering the most apparent functionality for a given effort level. The fast achievement of 80% functionality can lead to poor expectations of what’s needed to reach a product that has actual value, i.e. something maintainable and deployable. Clients, project managers, and developers can misunderstand the scope of project if they rank tasks in functionality-first order, without considering the full value chain.

A Better Approach - Managing Risks And the Full Value Chain

As part of good project management, you want to:

Map risks and uncertainties, and address the most important ones first.
Deliver self-contained value to the client throughout the project, if possible.

E.g. for (1), if you don’t know what a client wants, that’s a big risk. Getting an MVP in front of them might help reduce uncertainties and mitigate that risk. A cost overrun is also a big risk. If you don’t know how long it will take to build the infrastructure to deploy your system, then you might want to address that first.

For (2), note that value is not always the same as functionality. Undeployed functionality has no value to a client. An MVP, unless it is truly viable on its own, typically has little value to a client. A product that doesn’t meet quality requirements does not have any value. If clients hire you for software development, value is something they can use without any further software development.

In Short

The Pareto principle is both about the big impact you can have from a few actions (e.g., achieve 80% in 20% of the time), and how easily misled you can be about scope and impact (e.g., forgetting about a necessary 20% that takes 80% of the time).

Infographic from Sheraz Ishak

Reuse

CC BY 4.0

Copyright

Olivier Binette

The NABCs of Innovation

Thu, 29 Aug 2024 04:00:00 GMT

Innovation is creating and delivering new value to customers.

It happens at different levels. R&D projects are often expected to fail, but have potential for breakthroughs. Bringing existing technology to new markets is also a form of innovation, possibly with a higher success rate. Incremental optimizations and process improvements also involve innovation and are essential to an efficient business.

Innovation begins with someone having an idea they think could be valuable. Developing that idea and bringing it to customers requires time an energy.

A value proposition is what explains why this time and energy should be expended.

Curtis R. Carlson, ex-President of SRI International, developed a framework for value propositions. It has four main components (the “NABCs”) that aim to answer essential business questions:

Need: Who’s the customer? What’s their need or job to be done? What’s the gap in the market?
Approach: How are we solving that need? Is it unique, compelling, and defensible?
Benefit: What superior value is the customer getting through our approach?
Competition: What’s the competition? Why is our approach more appealing?

Additionally, there should be a driving force behind the proposition, i.e. motivated people willing and able to push this forward. The value proposition should also be aligned with the organization, both to support its development and enable capturing resulting value.

Building a good value proposition is an iterative process. The customer need is what matters and the approach might change - don’t fall in love with an idea. Focus on customer needs and the reasons underlying what they say they want. Try to quantify the value proposition, even if some of it may be guesswork. Address the most major risks and uncertainties first, before trying to build everything. Maintain and adjust the value proposition throughout the project.

Exceptional Innovations

The best innovations don’t just provide new value.

They fit within or enable compounding processes, where past innovations keep on providing more and more value as they are built upon. Relatedly, they create more than one opportunity to capture value, i.e. they help expose the business to new opportunities, such as by entering new markets.

They align with the business’ strategic vision (its plan for growth) and reinforces its strategic positioning (how it distinguishes itself from competitors and provides compelling value, despite constraints.)

Reuse

CC BY 4.0

Copyright

Olivier Binette

Test-Driven Development is Free

Sat, 24 Aug 2024 04:00:00 GMT

Test-driven development (TDD) is the practice of writing tests before starting to write functional code.

It’s sounds a bit formal, but it’s very close to what we do when developing interactively in a Python notebook: starting with a working example before refactoring code in a general-purpose function, and iterating on the process of creating examples, testing, and developing. The practice started in the early days of programming, which is why some of the guides on the topic can seem complicated. But, in short:

TDD was interactive development, before interactive development was a thing!

Now there are advantages to formalizing TDD, without needing to move away from interactive development. I won’t list all of them here, but I will point out the ones that support my argument that TDD is free.

Why TDD Is Free

Here’s a key assumption I’m making: doing things right the first time is free. If you’re not doing it right the first time, you’ll have to come back to it later anyway. And not doing it right the first time is likely to create many unnecessary costs along the way.

So, how do you do something right the first time? There are 2 parts to this:

You need to know what’s the “right” thing you want to do.
You need to check that you actually did it right.

Point (2) is testing. You’ll have to test, whether it is at the beginning, throughout, or at the end.

Point (1) is having clear requirements. Sure, you can write down requirements specification in detail and work off of that. But you know what else is a clear requirement? A test case.

You can save time by combining points (1) and (2) together in test cases. Just keep in mind that you’ll have to write tests first in order to satisfy point (1).

So, TDD is free: it’s not doing anything that you wouldn’t have to do anyway, and it’s saving you from extra work now and in the future.

Note that there is a learning curve to TDD. You need to find a TDD workflow that works for you. That takes a bit of time. But afterwards, you are saving time.

This Isn’t a New Idea

You’re already doing TDD:

In agile development, we use “User Stories” to describe specifications. These are high-level test case descriptions: “given starting point X, I want to do Y to achieve Z.” User stories don’t tell you how to code things - that’s the functional implementation. It’s something you figure out afterwards, once you know what the input looks like, what the function is meant to do, and what the result should look like.
As mentioned earlier, interactive development is informal TDD. How can you formalize TDD in interactive development, without losing the benefits of interactive development? Simply bring the tests to your interactive development workflow. It can be done by staying organized, or you can use tools like the “ipytest” library for unit testing in Python notebooks.

Next Steps

You’re already doing TDD, but maybe you’re not doing it in the most effective way. If you answer yes to some of the questions below, then it might be worth it to improve your TDD practices:

Could you save time by catching bugs earlier?
Could you save time by writing examples/tests, instead of long-form documentation?
Could you save time by keeping track of the experiments, tests, and examples you use in a notebook as you develop?
Could you save time by clicking a single button to run all tests in your notebook, instead of backtracking to execute notebook cells one by one?
Do you often have to go back to fix bugs in your code or other people’s code?

There are lots of guides online about TDD. But remember: you need to create a workflow that works for you. TDD is not about formality, complicated testing, or full-coverage testing. TDD is about speeding up your development and building things right the first time.

TDD Myths

Be careful not to fall into the following traps:

“All tests need to be written upfront.” No. Your TDD tests only need to cover what you want to code up in the next 5-30 minutes. They’re meant to help you develop, not give you analysis paralysis.
“Tests can’t change.” No. TDD tests are there to help you develop. Change them as much as you like.
“I can’t add more test after I’m done implementing.” No. TDD is an iterative process. Create a test, make sure it runs (and generally fails), develop, create more tests, check what fails, develop, and keep going until you are satisfied.
“I don’t need QA if I do TDD.” No. TDD is all about development. It helps develop faster and better. It’s about you, as a developer, building what you want to build right the first time. But, as often happens, it’s not because something is built right that it is the right thing for your customer!

Practical Example

Here’s what TDD looks like in practice.

Say I want to code a function “fibonacci” that computes the first n numbers of the standard Fibonacci sequence.

Step 1: A first simple example and test

First, I’ll write an example or what I want to do. This defines requirements for my function and lets me check it. The first tests should be simple and useful for development. If I don’t know in advance what the output should be, that’s OK: I can do a smoke test instead (just check that the function runs without error and show its output).

# Input
input_n = 5

# Output
expected_output = [1, 2, 3, 5, 8]

Then I keep track of this as a test case, so it’s easy to execute.

def test_fibonacci():
  assert fibonnaci(input_n) == expected_output

Notice that this first step is very simple and directly related to my current development task: develop a function that gets the logic right. I don’t want to worry about edge cases and every detail right now, so I don’t write tests/examples for that.

Step 2: Implement and check

Now I code the function and test it.

def fibonacci(n):
  result = [1, 2]
  while len(result) < n:
    result.append(result[-1], result[-2])
  
  return result

test_fibonacci()

If it doesn’t pass, make changes until it does. When it passes, great! We have the right logic. Now we can think about edge cases and iterate.

Step 3: Iterate

First, create examples/test cases. Again, this specifies what we want to achieve, and makes it easy for us to check it.

def test_fibonacci_edge_cases():
  assert fibonacci(0) = []
  assert fibonacci(1) = [1]
  # etc

Then, make changes to your function and run the tests:

def fibonacci(n):
  ...

test_fibonacci() # Make sure I didn't break anything
test_fibonacci_edge_cases() # New tests

A large number of tests can quickly become unwieldy. This is where testing frameworks like pytest become handy. They keep track of test suites and let you run all tests in a single click.

Reuse

CC BY 4.0

Copyright

Olivier Binette

Personal Knowledge Management

Thu, 15 Aug 2024 04:00:00 GMT

Essentially all of my work involves reading and writing. I write papers and proposals, code, documentation, emails, and I jot down thoughts in problem-solving sessions. And all of that is in relation to the writings and ideas of an incredibly large number of people.

Keeping up with all this information requires knowledge management systems. They are often integrated into our online experiences - we have bookmarks, searchable email inboxes, online code repositories, etc.

But some effort is needed to use these systems effectively, without being overwhelmed by all of these disparate systems. That’s where personal knowledge management comes in.

It’s not a new idea. For millennia, beginning at least with Aristotle, writers have been using “commonplace“ books to organize their notes, quotes, and ideas. Stephen Johnson, in the book Where Good Ideas Come From, relates Darwin’s notebooks to this tradition:

Darwin’s notebooks lie at the tail end of a long and fruitful tradition that peaked in Enlightenment-era Europe, particularly in England: the practice of maintaining a ‘commonplace’ book. Scholars, amateur scientists, aspiring men of letters - just about anyone with intellectual ambition in the seventeenth and eighteenth centuries was likely to keep a commonplace book. The great minds of the period - Milton, Bacon, Locke - were zealous believers in the memory-enhancing powers of the commonplace book.

Something as simple as the “notes” app on your phone, or sending yourself emails, can work well enough for note-taking. But we can get much more out of our notes by using technology to help index notes, create connections between them, and help summarize and extract relevant information when needed.

Technology can also help us overcome the challenges of determining how to organize notes. Personally, I cannot keep any file tree well organized. There is an alternative: instead of a hierarchical tree, we can organize notes in a graph using tags and links. This is how Wikipedia is structured. You don’t find a wiki page by going down a file tree. Rather, you do keyword searches and follow links between pages.

My favorite tool for this is Obsidian (at work I use Confluence). Previously I used Notion, and before that I only used paper. Obsidian is free, easy-to-use, private (it’s a desktop app!), and responsive. I use it to keep track of everything that isn’t my paper notepad, emails, or LaTeX/Word documents.

There are lots of other tools available:

hypothes.is for web annotation
Roam
Notion

In short, it’s easy to take modern digital features like hypertext or search for granted. But it’s really amazing how far we’ve come to get here, and I think we can do even more amazing things if we can use these features to their full extent or push them even further.

Reuse

CC BY 4.0

Copyright

Olivier Binette

Measurement and Management

Thu, 15 Aug 2024 04:00:00 GMT

W. Edwards Deming pioneered the use of measurement and statistics in manufacturing industries, using data to improve processes. Some even credit part of the success of the post-WWII Japanese auto industry (e.g. Toyota) to Deming’s japanese career, where he taught and popularized the use of Statistical Process Control (SPC) [1].

Unfortunately, Deming’s work and ideas are widely misunderstood. And Deming was aware of this. Much of his later writings emphasize how a naive understanding of quality management is counterproductive. ¹

W. Edwards Deming

Don’t manage by numbers.

It’s a bit confusing: Deming encouraged the use of measurement, metrics, data, and statistics, as a key tool for process improvement and quality control. And yet he also painstakingly tried to drive in points like this:

“It is wrong to suppose that if you can’t measure it, you can’t manage it – a costly myth.”
“Eliminate management by numbers and numerical goals.”

How can this be? How can he simultaneously be pro-measurement, pro-data, and against data-driven management?

How can we resolve this false paradox?

As a statistician, Deming was aware how important what you can’t measure is to making valid inferences. Statistics is not about data. It’s about combining data and context to make valid inferences. Data on its own has no meaning. Missing data - including both the data you wish you had and the data you don’t even know you’re missing - is more important than the data you have. A statistician’s work is to help learn about such unknowns. It’s a fallacy to make decisions based only on available data - the McNamara fallacy.

“But when the McNamara discipline is applied too literally, the first step is to measure whatever can be easily measured. The second step is to disregard that which can’t easily be measured or given a quantitative value. The third step is to presume that what can’t be measured easily really isn’t important. The fourth step is to say that what can’t be easily measured really doesn’t exist. This is suicide.” — Daniel Yankelovich, “Interpreting the New Life Styles”, Sales Management (1971)

The problem isn’t data or measurement. In fact, you should aim to measure as much as you can, as often as you can. You should build measurement and observability as core components of your systems and infrastructures. You should work to continually improve your approach to measurement of what matters. And you should have statisticians or data scientists make sense of these numbers through their context, given specific goals.

But here’s the thing: measurement is not management.

As a manager, your job is to create and maintain structures that drive customer value and continuous improvement. To achieve this, you need to think about knowns (i.e., data, metrics) and unknowns. Statisticians or data scientists can help you contextualize data and shed light on unknowns, athough it’s not always an easy process.

In Short

There are many misconceptions surrounding data and its use in management. It is important for all to understand both the importance of data and its limitations. We can do so by learning from resources such as the Deming Institute’s website:

Deming advocated for structures that removed fear in workers, fostered continuous improvement, and enabled taking pride in one’s work.

Footnotes

Deming started working in Japan in 1947, bringing knowledge of the theory of Statistical Process Control (SPC) that was pioneered by Walter A. Shewhart at Bell Laboratories a few decades earlier. During post-war reconstruction, the Union of Japanese Scientists and Engineers (JUSE) invited Deming to teach SPC to engineers and managers. He went on to work with private enterprises and received multiple awards for his contributions.↩︎

Reuse

CC BY 4.0

Copyright

Olivier Binette

Comment on The Sample Size Required in Importance Sampling

Mon, 18 Mar 2024 04:00:00 GMT

The problem is to evaluate

where $$ is a probability measure on a space and where is measurable. The Monte-Carlo estimate of is

When it is too difficult to sample , for instance, other estimates can be obtained. Suppose that is absolutely continuous with respect to another probability measure , and that the density of with respect to is given by . Another unbiaised estimate of is then

This is the general framework of importance sampling, with the Monte-Carlo estimate recovered by taking . An important question is the following.

How large should be for to be close to ?

An answer is given, under certain conditions, by Chatterjee and Diaconis (2015). Their main result can be interpreted as follows. If and if is concentrated around its expected value , then a sample size of approximately is both necessary and sufficient for to be close to . The exact sample size needed depends on and on the tail behavior of . I state below their theorem with a small modification.

Theorem 1. (Chatterjee and Diaconis, 2015) As above, let . For any and ,

Conversely, for any and ,

Remark 1. Suppose and that is concentrated around , meaning that for some we have that and are both less than an arbitrary . Then, taking we find

$ |I_n(f) - I| e^{-t/4} + 2.$

However, if $n e^{L-t} $, we obtain

$ (1 - I_n(1) ) e^{-t/2} + 2 .$

meaning that there can be a high probability that and are not close.

Remark 2. Let , so that . In that case, only takes its expected value . The theorem yields

and no useful bound on .

Comment. For the theorem to yield a sharp cutoff, it is necessary that be relatively large and that be highly concentrated around . The first condition is not aimed at in the practice of importance sampling. This difficulty contrasts with the broad claim that “a sample of size approximately is necessary and sufficient for accurate estimation by importance sampling”. The result in conceptually interesting, but I’m not convinced that a sharp cutoff is common.

Example

I consider their example 1.4. Here is the exponential distribution of mean , is the exponential distribution of mean 2, and . Thus . We have , meaning that the theorem yields no useful cutoff. Furthermore, and . Optimizing the bound given by the theorem yields

The figure below shows trajectories of . The shaded area bounds the expected error.

This next figure shows trajectories for the Monte-Carlo estimate of , taking and . Here the theorem yields

References.

Chatterjee, S. and Diaconis, P. The Sample Size Required in Importance Sampling. https://arxiv.org/abs/1511.01437v2

Reuse

CC BY 4.0

Copyright

Olivier Binette

What is the Reality-Ideality-Gap in Entity Resolution?

Tue, 12 Dec 2023 05:00:00 GMT

Wang et al (2022) describe the frustration when real-world performance does not match expectations obtained from benchmark datasets. This difference is the “reality-ideality” gap which is all too common in real-world applications of entity resolution.

Why does it happen? They posit that three main issues limit the generalizability of current benchmarks, specifically in the context of deep learning approaches to entity resolution:

1. 𝐓𝐡𝐞𝐫𝐞 𝐢𝐬 𝐥𝐞𝐚𝐤𝐚𝐠𝐞 𝐟𝐫𝐨𝐦 𝐭𝐡𝐞 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐬𝐞𝐭 𝐢𝐧𝐭𝐨 𝐭𝐡𝐞 𝐭𝐞𝐬𝐭 𝐬𝐞𝐭. In typical benchmark constructions, record pairs are randomly sampled, leading to the same cluster being represented in both the train and test dataset. This biases results, especially in deep learning approaches which rely on learning record embeddings.

2. 𝐑𝐞𝐚𝐥-𝐰𝐨𝐫𝐥𝐝 𝐝𝐚𝐭𝐚 𝐢𝐬 𝐦𝐮𝐜𝐡 𝐦𝐨𝐫𝐞 𝐢𝐦𝐛𝐚𝐥𝐚𝐧𝐜𝐞𝐝 𝐭𝐡𝐚𝐧 𝐛𝐞𝐧𝐜𝐡𝐦𝐚𝐫𝐤 𝐝𝐚𝐭𝐚𝐬𝐞𝐭𝐬 in terms of matching vs non-matching record pairs. In other words, there is much more opportunity for error in real data than in a benchmark dataset.

3. Partly as a consequence of the two above issues, 𝐭𝐲𝐩𝐢𝐜𝐚𝐥 𝐛𝐞𝐧𝐜𝐡𝐦𝐚𝐫𝐤𝐬 𝐮𝐧𝐝𝐞𝐫𝐞𝐬𝐭𝐢𝐦𝐚𝐭𝐞 𝐭𝐡𝐞 𝐢𝐦𝐩𝐨𝐫𝐭𝐚𝐧𝐜𝐞 𝐨𝐟 𝐚𝐝𝐝𝐢𝐭𝐢𝐨𝐧𝐚𝐥 𝐟𝐞𝐚𝐭𝐮𝐫𝐞𝐬 𝐚𝐧𝐝 𝐦𝐮𝐥𝐭𝐢𝐦𝐨𝐝𝐚𝐥 𝐢𝐧𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧. This leads to under-specified systems which do not perform as well as they could.

The paper goes on to define clear tasks for entity resolution systems and detail issues with current benchmarks:

“Our findings reveal that previous benchmarks biased the evaluation of the progress of current entity matching approaches, and there is still a long way to go to build effective entity matchers.”

Reuse

CC BY 4.0

Copyright

Olivier Binette

Record Linkage at the Duke GPSG Community Pantry

Thu, 23 Dec 2021 05:00:00 GMT

Figure from https://gpsg.duke.edu/resources-for-students/community-pantry/

Introduction

Duke’s Graduate and Professional Student Government (GPSG) has been operating a community food pantry for about five years. The pantry provides nonperishable food and basic need items to graduate and professional students on campus. There is a weekly bag program, where students order customized bags of food to be picked up on Saturdays, as well as an in-person shopping program open on Thursdays and Saturdays.

Figure 1: Weekly number of customers at the Pantry. The black line is a moving average of weekly visits.

The weekly bag program, which began in May 2018 and is still the most popular pantry offering, provides quite a bit of data regarding pantry customers and their habits. Some customers have ordered more than 80 times in the past 2 years, while others only ordered once or twice. For every bag order, we have the customer’s first name and last initial, an email address (which became mandatory around mid 2018), a phone number in a few cases, an address in some cases (for delivery), we have demographic information some cases, and we have the food order information. Available quasi-identifying information is shown in Table 1 below.

Table 1: Quasi-identifying information provided on Qualtrics bag order forms. Note that phone number and address were only required while delivery was offered. Furthermore, most customers stop answering demographic questions after a few orders.
Question no.	Question	Answer form	Mandatory?
-	IP address	-	Yes
2	First name and last initial	Free form	Yes
3	Duke email	Free form	Yes
4	Phone number	Free form	No
6	Address	Free form	No
8	Food allergies	Free form	No
9	Number of members in household	1-2 or 3+	Yes
10	Want baby bag?	Yes or no	Yes
30	Degree	Multiple choices or Other	No
31	School	Multiple choices or Other	No
32	Year in graduate school	Multiple choices	No
33	Number of adults in household	Multiple choices	No
34	Number of children in household	Multiple choices	No

Gaining the most insight from this data requires linking order records from the same customer. Identifying individual customers and associating them with an order history allows us to investigate shopping recurrence patterns and identify potential issues with the pantry’s offering. For instance, we can know who stopped ordering from the pantry after the home delivery program ended. These are people who, most likely, do not have a car to get to the pantry but might benefit from new programs, such as a ride-share program or a gift card program.

This blog post describes the way in which records are linked at the Community Pantry. As we will see, the record linkage problem is not particularly difficult. It is not trivial either, however, and it does require care to ensure that it runs reliably and efficiently, and that it is intelligible and properly validated. This post goes in detail into these two aspects of the problem.

Regarding efficiency and reliability of the software system, I describe the development of a Python module, called GroupByRule, for record linkage at the pantry. This Python module is maintainable, documented and tested, ensuring reliability of the system and the potential for its continued use throughout the years, even as technical volunteers change at the pantry. Regarding validation of the record linkage system, I describe simple steps that can be taken to evaluate model performance.

Before jumping into the technical part, let’s take a step back to discuss the issue of food insecurity on campus.

Food Insecurity on Campus

It is often surprising to people that some Duke students might struggle having access to food. After all, Duke is one of the richest campuses in the US with its 12 billion endowment, high tuition and substantial research grants. Prior to the covid-19 pandemic, this wealth could be seen on campus and benefit many. Every weekday, there were several conferences and events with free food. Me and many other graduate students would participate in these events, earning 3-4 free lunches every week. Free food on campus is now a thing of the past, for the most part.

However, free lunch or not, it’s important to realize the many financial challenges which students can face. International students on F-1 and J-1 visas have limited employment opportunities in the US. Many graduate students are married, have children or have other dependents which may not be eligible to work in the US either. Even if they are lucky enough to be paid a 9 or 12-month stipend, this stipend doesn’t go very far. For other students, going to Duke means living on a mixture of loans, financial aid, financial support from parents, and side jobs. Any imbalance in this rigid system can leave students having to compromise between their education and their health.

A 2019 study from the World Food Policy Center reported that about 19% of graduate and professional students at Duke experienced food insecurity in the past year. This means they were unable to afford a balanced and sufficient diet, they were afraid of not having enough money for food, or they skipped meals and went hungry due to lack of money. The GPSG Community Pantry has been leading efforts to expand food insecurity monitoring on campus – we are hoping to have more data in 2022 and in following years.

The Record Linkage Approach

The bag order form contains email addresses which are highly reliable for linkage. If two records have the same email, we know for certain that they are from the same customer. However, customers do not always enter the same email address when submitting orders. Despite the request to use a Duke email address, some customers use personal emails. Furthermore, Duke email addresses have two forms. For instance, my duke email is both ob37@duke.edu and olivier.binette@duke.edu. Emails are therefore not sufficient for linkage. Phone numbers can be used as well, but these are only available for the period when home delivery was available.

First name and last initial can be used to supplement emails and phone numbers. Again, agreement on first name and last initial provides strong evidence for match. On the other hand, people do not always enter their names in the same way.

Combining the use of emails, phone numbers, and names, we may therefore link records which agree on any one of these attributes. This is a simple deterministic record linkage approach which should be reliable enough for the data analysis use of the pantry.

Deterministic Record Linkage Rule

To be more precise, record linkage proceeds as follows:

Records are processed to clean and standardize the email, phone and name attributes. That is, leading and trailing whitespace are removed, capitalization is standardized, phone numbers are validated and standardized, and punctuation is removed from names.
Records which agree on any of their email, phone or name attributes are linked together.
Connected components of the resulting graph are computed in order to obtain record clusters.

This record linkage procedure is extremely simple. It relies the fact that all three attributes are reliable indicators of a match and that, for two matching records, it is likely that at least one of these three attributes will be in agreement.

Also, the simplicity of the approach allows the use of available additional information (such as IP address and additional questions) for model validation. If the use of this additional information does not highlight any flaws with the simple deterministic approach, then this means that the deterministic approach is already good enough. We will come back to this when discussing model validation techniques.

Implementation

Our deterministic record linkage system is implemented in Python with some generality. The goal is for the system to be able to adapt to changes in data or processes.

The fundamental component of the system is a LinkageRule class. LinkageRule objects can be fitted to data, providing either a clustering or a linkage graph. For instance, a LinkageRule might be a rule to link all records which agree on the email attribute. Another LinkageRule might summarize a set of other rules, such as taking the union or intersection of their links.

The interface is as follows:

from abc import ABC, abstractmethod


class LinkageRule(ABC):
    """
    Interface for a linkage rule which can be fitted to data.

    This abstract class specifies three methods. The `fit()` method fits the 
    linkage rule to a pandas DataFrame. The `graph` property can be used after 
    `fit()` to obtain a graph representing the linkage fitted to data.  The 
    `groups` property can be used after `fit()` to obtain a membership vector 
    representing the clustering fitted to data.
    """
    @abstractmethod
    def fit(self, df):
        pass

    @property
    @abstractmethod
    def graph(self):
        pass

    @property
    @abstractmethod
    def groups(self):
        pass

Note that group membership vectors, our representation for cluster groups, are meant to be a numpy integer array with entries indicating what group (cluster) a given record belongs to. Such a “groups” vector should not contain NA values; rather it should contain distinct integers for records that are not in the same cluster.

We will now define two other classes, Match and Any, which allow us to implement deterministic record linkage. The Match class implements an exact matching rule, while Any is the logical disjunction of a given set of rules. Our deterministic record linkage rule for the pantry will therefore be defined as follows:

rule = Any(Match("name"), Match("email"), Match("phone"))

Following the LinkageRule interface, this rule will then be fitted to the data and used as follows:

rule.fit(data)
data.groupby(rule.groups).last() # Get last visit data for all customers.

The benefit of this general interface is that it is extendable. By default, the Any class will return connected components when requesting group clusters. However, other clustering approaches could be used. Exact matching rules could also be relaxed to fuzzy matching rules based on string distance metrics or probabilistic record linkage. All of this can be implemented as additional LinkageRule subclasses in a way which is compatible with the above.

Let’s now work on the Match class. For efficiency, we’ll want Match to operate at the groups level. That is, if Match is called on a set of rules, then we’ll first compute groups for these rules, before computing the intersection of these groups. This core functionality is implemented in the function _groups_from_rules() below. The function _groups() is a simple wrapper to interpret strings as a matching rule on the corresponding column.

import pandas as pd
import numpy as np
import itertools
from igraph import Graph

def _groups(rule, df):
    """
    Fit linkage rule to dataframe and return membership vector.

    Parameters
    ----------
    rule: string or LinkageRule
        Linkage rule to be fitted to the data. If `rule` is a string, then this 
        is interpreted as an exact matching rule for the corresponding column.
    df: DataFrame
        pandas Dataframe to which the rule is fitted.

    Returns
    -------
    Membership vector (i.e. integer vector) u such that u[i] indicates the 
    cluster to which dataframe row i belongs. 

    Notes
    -----
    NA values are considered to be non-matching.

    Examples
    --------
    >>> import pandas as pd
    >>> df = pd.DataFrame({"fname":["Olivier", "Jean-Francois", "Alex"], 
      "lname":["Binette", "Binette", pd.NA]})

    Groups specified by distinct first names:
    >>> _groups("fname", df)
    array([2, 1, 0], dtype=int8)

    Groups specified by same last names:
    >>> _groups("lname", df)
    array([0, 0, 3], dtype=int8)

    Groups specified by a given linkage rule:
    >>> rule = Match("fname")
    >>> _groups(rule, df)
    array([2, 1, 0])
    """
    if (isinstance(rule, str)):
        arr = np.array(pd.Categorical(df[rule]).codes, dtype=np.int32) # Specifying dtype avoids overflow issues
        I = (arr == -1)  # NA value indicators
        arr[I] = np.arange(len(arr), len(arr)+sum(I))
        return arr
    elif isinstance(rule, LinkageRule):
        return rule.fit(df).groups
    else:
        raise NotImplementedError()


def _groups_from_rules(rules, df):
    """
    Fit linkage rules to data and return groups corresponding to their logical 
    conjunction.

    This function computes the logical conjunction of a set of rules, operating 
    at the groups level. That is, rules are fitted to the data, membership 
    vector are obtained, and then the groups specified by these membership 
    vectors are intersected.

    Parameters
    ----------
    rules: list[LinkageRule]
        List of strings or Linkage rule objects to be fitted to the data. 
        Strings are interpreted as exact matching rules on the corresponding 
        columns.

    df: DataFrame
        pandas DataFrame to which the rules are fitted.

    Returns
    -------
    Membership vector representing the cluster to which each dataframe row 
    belongs.

    Notes
    -----
    NA values are considered to be non-matching.

    Examples
    --------
    >>> import pandas as pd
    >>> df = pd.DataFrame({"fname":["Olivier", "Jean-Francois", "Alex"], 
      "lname":["Binette", "Binette", pd.NA]})
    >>> _groups_from_rules(["fname", "lname"], df)
    array([2, 1, 0])
    """

    arr = np.array([_groups(rule, df) for rule in rules]).T
    groups = np.unique(arr, axis=0, return_inverse=True)[1]
    return groups

We can now implement Match as follows. Note that the Graph representation of the clustering is only computed if and when needed.

class Match(LinkageRule):
    """
    Class representing an exact matching rule over a given set of columns.

    Attributes
    ----------
    graph: igraph.Graph
        Graph representing linkage fitted to the data. Defaults to None and is 
        instantiated after the `fit()` function is called.

    groups: integer array
        Membership vector for the linkage clusters fitted to the data. Defaults 
        to None and is instantiated after the `fit()` function is called.

    Methods
    -------
    fit(df)
        Fits linkage rule to the given dataframe.

    Examples
    --------
    >>> import pandas as pd
    >>> df = pd.DataFrame({"fname":["Olivier", "Jean-Francois", "Alex"], 
    "lname":["Binette", "Binette", pd.NA]})

    Link records which agree on both the "fname" and "lname" fields.
    >>> rule = Match("fname", "lname")

    Fit linkage rule to the data.
    >>> _ = rule.fit(df)

    Construct deduplicated dataframe, retaining only the first record in each cluster.
    >>> _ = df.groupby(rule.groups).first()
    """

    def __init__(self, *args):
        """
        Parameters
        ----------
        args: list containing strings and/or LinkageRule objects.
            The `Match` object represents the logical conjunction of the set of 
            rules given in the `args` parameter. 
        """
        self.rules = args
        self._update_graph = False
        self.n = None

    def fit(self, df):
        self._groups = _groups_from_rules(self.rules, df)
        self._update_graph = True
        self.n = df.shape[0]

        return self

    @property
    def groups(self):
        return self._groups

One more method is needed to complete the implementation of a LinkageRule, namely the graph property. This property returns a Graph object corresponding to the matching rule. The graph is built as follows. First, we construct an inverted index for the clustering. That is, we construct a dictionary associating to each cluster the nodes which it contains. Then, an edge list is obtained by linking all pairs of nodes which belong to the same cluster. Note that the pure Python implementation below if not efficient for large clusters. This is not a problem for now since we will generally avoid computing this graph.

# Part of the definition of the `Match` class:
    @property
    def graph(self) -> Graph:
        if self._update_graph:
            # Inverted index
            clust = pd.DataFrame({"groups": self.groups}
                                 ).groupby("groups").indices
            self._graph = Graph(n=self.n)
            self._graph.add_edges(itertools.chain.from_iterable(
                itertools.combinations(c, 2) for c in clust.values()))
            self._update_graph = False
        return self._graph

Finally, let’s implement the Any class. It’s purpose is to take the union (i.e. logical disjunction) of a set of rules. Just like for Match, we can choose to operate at the groups or graph level. Here we’ll work at the groups level for efficiency. That is, given a set of rules, Any will first compute their corresponding clusters before merging overlapping clusters.

There are quite a few different ways to efficiently merge clusters. Here we’ll merge clusters by computing a “path graph” representation, taking the union of these graphs, and then computing connected components. For a given clustering, say containing records a, b, and c, the “path graph” links records as a path a–b–c.

First, we define the functions needed to compute path graphs:

def pairwise(iterable):
    """
    Iterate over consecutive pairs:
        s -> (s[0], s[1]), (s[1], s[2]), (s[2], s[3]), ...

    Note
    ----
    Current implementation is from itertools' recipes list available at 
    https://docs.python.org/3/library/itertools.html
    """
    a, b = itertools.tee(iterable)
    next(b, None)
    return zip(a, b)


def _path_graph(rule, df):
    """
    Compute path graph corresponding to the rule's clustering: cluster elements 
    are connected as a path.

    Parameters
    ----------
    rule: string or LinkageRule
        Linkage rule for which to compute the corresponding path graph 
        (strings are interpreted as exact matching rules for the corresponding column).
    df: DataFrame
        Data to which the linkage rule is fitted.

    Returns
    -------
    Graph object such that nodes in the same cluster (according to the fitted 
    linkage rule) are connected as graph paths.
    """
    gr = _groups(rule, df)
    
    # Inverted index
    clust = pd.DataFrame({"groups": gr}
                         ).groupby("groups").indices
    graph = Graph(n=df.shape[0])
    graph.add_edges(itertools.chain.from_iterable(
        pairwise(c) for c in clust.values()))

    return graph

We can now implement the Any class:

class Any(LinkageRule):
    """
    Class representing the logical disjunction of linkage rules.

    Attributes
    ----------
    graph: igraph.Graph
        Graph representing linkage fitted to the data. Defaults to None and is 
        instantiated after the `fit()` function is called.

    groups: integer array
        Membership vector for the linkage clusters fitted to the data. Defaults 
        to None and is instantiated after the `fit()` function is called.

    Methods
    -------
    fit(df)
        Fits linkage rule to the given dataframe.
    """

    def __init__(self, *args):
        """
        Parameters
        ----------
        args: list containing strings and/or LinkageRule objects.
            The `Any` object represents the logical disjunction of the set of 
            rules given by `args`. 
        """
        self.rules = args
        self._graph = None
        self._groups = None
        self._update_groups = False

    def fit(self, df):
        self._update_groups = True
        graphs_vect = [_path_graph(rule, df) for rule in self.rules]
        self._graph = igraph.union(graphs_vect)
        return self

    @property
    def groups(self):
        if self._update_groups:
            self._update_groups = False
            self._groups = np.array(
                self._graph.clusters().membership)
        return self._groups

    @property
    def graph(self) -> Graph:
        return self._graph

The complete Python module (still under development) implementing this approach can be found on Github at OlivierBinette/GroupByRule.

Limitations

There are quite a few limitations with this simple deterministic approach. We’ll see in the model evaluation section that these do not affect performance to a large degree. However, for a system used with more data or over a longer timeframe, these should be carefully considered.

First, the deterministic linkage does not allow the consideration of contradictory evidence. For instance, if long-form Duke email addresses are provided on two records and do not agree (e.g. “olivier.binette@duke.edu” and “olivier.bonhomme@duke.edu” are provided), then we know for sure that the records do not correspond to the same individual, even if the same name was provided (here Olivier B.). The consideration of such evidence could rely on probabilistic record linkage, where each record pair is associated a match probability.

Second, the use of connected components to resolve transitivity can be problematic, as a single spurious link could connect two large clusters by mistake. More sophisticated graph clustering techniques, in combination with probabilistic record linkage, would be required to mitigate the issue.

Model Evaluation

I cannot share any of the data which we have at the Pantry. However, I can describe general steps to be taken to evaluate model performance in practice.

Pairwise Precision and Recall

Here we will evaluate linkage performance using pairwise precision and recall . The precision is defined as the proportion of predicted links which are true matches, whereas is the proportion of true matches which are correctly predicted. That is, if is the number of true positive links, the number of predicted links, and the number of true matches, then we have

Estimating Precision

It is helpful to express precision and recall in cluster form, where cluster elements are all interlinked. Let be the set of true clusters and let be the set of predicted clusters. For a given cluster , let be the restriction of the clustering to . Then we have

The denominator can be computed exactly, while the numerator can be estimated by randomly sampling clusters , breaking them up into true clusters , and then computing the sum of the combinations . Importance sampling could be used to reduce the variance of the estimator, but it does not seem necessary for the scale of the data which we have at the pantry, where each predicted cluster can be examined quite quickly.

In practice, the precision estimation process can be carried out as follows:

Sample predicted clusters at random (in the case of the pantry, we can take all predicted clusters).
Make a spreadsheet with all the records corresponding to the sampled clusters.
Sort the spreadsheet by predicted cluster ID.
Add a new empty column to the spreadsheet, called “trueSubClusters”.
Separately look at each predicted cluster. If the cluster should be broken up in multiple parts, use the “trueSubClusters” column to provide identifiers for true cluster membership. Note that these identifiers do not need to match across predicted clusters.

The spreadsheet can then be read-in and processed in a straightforward way to obtain an estimated precision value.

Estimating Recall

Estimating recall is a bit trickier than estimating precision, but we can make one assumption to simplify the process. Assume that precision is exactly 1, or very close to 1, so that all predicted clusters can roughly be taken at face value. Estimating recall then boils to the problem of identifying which predicted clusters should be merged together.

Indeed, using the same notations as above, we can write If precision is 1, then the denominator can be computed from the sizes of predicted clusters which are identified to be merged. On the other hand, the nominator simplifies to which can be computed exactly from the sizes of predicted clusters. In the case of the Pantry, wrongly separated clusters are likely to be due to small differences in names and emails. Our procedure to identify clusters which should have been merged together is as follows:

Make a spreadsheet containing canonical customer records (one representative record for each predicted individual customer).
Create a new empty column named “trueClustersA”.
Sort the spreadsheet by name.
Go through the spreadsheet from top to bottom, looking at whether or not consecutive predicted clusters should be merged together. If so, write a corresponding cluster membership ID in the “trueClustersA” column.
Create a new empty column named “trueClustersB”.
Sort the spreadsheet by email
Go through the spreadsheet from top to bottom, looking at whether or not consecutive predicted clusters should be merged together. If so, write a corresponding cluster membership ID in the “trueClustersB” column.

This process might not catch all wrongly separated clusters, but it is likely to find many of the errors due to different ways of writing names and different email addresses. The resulting spreadsheet can then easily be processed to obtain an estimated recall. If we were working with a larger dataset, we’d have to use further blocking to restrict our consideration to a more manageable subset of the data.

Results

I used the above procedures to estimate precision and recall of our simple deterministic approach to deduplicate the Pantry’s data. There was a total of 3281 bag order records for 689 estimated customers. The results are below.

Estimated Precision: 92%

Precision is somewhat low due to about 3 relatively large clusters (around 30-50 records each) which should have been broken up in a few parts. 2% precision was lost due to a couple that shared a phone number, where each had about 20 order records. The vast majority of spurious links were tied to bag orders for which only the first name was provided (e.g. “Sam”). The use of negative evidence to distinguish between individuals would help resolve these cases.

Estimated Recall: 99.6%

This is certainly an overestimate, but it does show that missing links are not obviously showing up. Given the structure of the Pantry data, it is likely that recall is indeed quite high.

Final thoughts

There are many ways in which the record linkage approach could be improved. As previously discussed, probabilistic record linkage would allow the consideration of negative evidence and the use of additional quasi-identifying information (such as IP addresses and other responses on the bag order forms). I’m looking forward to building on the GroupByRule Python module to provide a user-friendly and unified interface to more flexible methodology.

However, it is important to ensure that any record linkage approach is intelligible and rooted in a good understanding of the underlying data. In this context, the use of a well-thought deterministic approach can provide good performance, at least as a first step or baseline for comparison. Furthermore, it is important to spend sufficient time investigating the results of the linkage to evaluate performance. I have highlighted simple steps which can be taken to estimate precision and make a good effort at identifying missing links. This is highly informative for model validation, improvement, and for the interpretation of any following results.

References

Campbell, Kevin M., Dennis Deck, and Antoinette Krupski. 2008. “Record Linkage Software in the Public Domain: A Comparison of Link Plus, the Link King, and a ’Basic’ Deterministic Algorithm.” Health Informatics Journal 14 (1): 5–15.

Gomatam, Shanti, Randy Carter, Mario Ariet, and Glenn Mitchell. 2002. “An Empirical Comparison of Record Linkage Procedures.” Statistics in Medicine 21 (10): 1485–96. https://doi.org/10.1002/sim.1147.

Monge, Alvaro E., and Charles P. Elkan. 1997. “An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records.” Proceedings of the SIGMOD 1997 Workshop on Research Issues on Sata Mining and Knowledge Discovery, 23–29. https://doi.org/10.1.1.28.8405.

Potosky, Arnold L., Gerald F. Riley, James D. Lubitz, Renee M. Mentnech, and Larry G. Kessler. 1993. “Potential for Cancer Related Health Services Research Using a Linked Medicare-Tumor Registry Database.” Medical Care 31 (8): 732–48. https://doi.org/10.1097/00005650-199308000-00006.

Tromp, Miranda, Anita C. Ravelli, Gouke J. Bonsel, Arie Hasman, and Johannes B. Reitsma. 2011. “Results from Simulated Data Sets: Probabilistic Record Linkage Outperforms Deterministic Record Linkage.” Journal of Clinical Epidemiology 64 (5): 565–72. https://doi.org/10.1016/j.jclinepi.2010.05.008.

Reuse

CC BY 4.0

Copyright

Olivier Binette

Validating function arguments in R

Sun, 15 Nov 2020 05:00:00 GMT

Update: The assert package is now available on CRAN:

install.packages("assert")

I was programming a Gibbs sampler the other day and all hell broke loose: small errors were hard to trace back to the source of the problem and debugging was a pain.

The bugs could have been caught much more early if I had properly validated the input arguments of my various helper functions. So I decided it was time for me to learn how to do this properly.

Validating function input arguments in R

The easiest way is to manually incorporate checks.

mySum <- function(a, b) {
  if (!is.numeric(a) | !is.numeric(b)) {
    stop("Arguments should be numeric.")
  }
  if (length(a) != length(b)) {
    stop("Arguments should be of the same length.")
  } 
  
  return(a+b)
}

This works well enough, but it takes up a lot of space and you have to manually write up the description of the errors.

A first solution

Let’s use the assertthat package.

mySum <- function(a, b) {
  assert_that(is.numeric(a), is.numeric(b))
    assert_that(length(a) == length(b))
  
  return(a+b)
}

This is neater, but the error messages are not very descriptive.

> mySum(1, "1")
        Error: b is not a numeric or integer vector

What is b here? What arguments in the function call caused the error? It’s a bit hard to tell, especially if the call to this function is hidden in some large Gibbs sampler.

The `assert` function

My solution is the assert function which you can find on my Github Gist.

source("assert.R")

Usage is similar to what we did above:

mySum <- function(a, b) {
  assert(is.numeric(a), is.numeric(b))
    assert(length(a) == length(b))
  
  return(a+b)
}

But now we have much more descriptive error messages.

> mySum(1, "1")
        Error: in mySum(a = 1, b = "1")
        Failed checks: 
            is.numeric(b)

Reuse

CC BY 4.0

Copyright

Olivier Binette

The Credibility of confidence intervals

Wed, 11 Sep 2019 04:00:00 GMT

Andrew Gelman and Sander Greenman went “head to head” in a discussion on the interpretation of confidence intervals in The BMJ. Greenman stated the following, which doesn’t seem quite right to me.

The label “95% confidence interval” evokes the idea that we should invest the interval with 95/5 (19:1) betting odds that the observed interval contains the true value (which would make the confidence interval a 95% bayesian posterior interval). This view may be harmless in a perfect randomized experiment with no background information to inform the bet (the original setting for the “confidence” concept); more often, however […]

It’s not true that “this view may is harmless in perfect randomized experiments”, and I’m not sure where this “original setting of the confidence concept” is coming from. In fact, even in the simplest possible cases, the posterior probability of a confidence interval can be pretty much anything.

Imagine a “perfect randomized experiment”, where we use a test of the hypothesis for which, for some reason, has zero power. If , meaning that the associated confidence interval excludes , then we are certain that holds and the posterior probability of the confidence interval is zero.

Let this sink in. For some (albeit trivial) statistical tests, observing brings evidence in favor of the null.

The power of the test carries information, and the posterior probability of a confidence interval (or of an hypothesis), depends on this power among other things, even in perfect randomized experiments.

Reuse

CC BY 4.0

Copyright

Olivier Binette

Global Bounds for the Jensen Functional

Sun, 19 May 2019 04:00:00 GMT

Given a convex function and a random variable on , the Jensen functional of and is defined as

The well-known Jensen inequality states that . For instance, if , then . If and are two probability measures, and is convex with , then is a so-called -divergence between probability measures such as the total variation distance, the Kullback-Leibler divergence, the divergence, etc.

If is bounded, then a converse to the Jensen inequality can be easily obtained as follows: let and be the infimum and maximum of , and write for some random variable taking values in . Then and consequently with ,

When is unknown in practice, then maximizing the above over all possibilities is the bound

which is Theorem C in Simic (2011).

Some examples

Variance bound. Consider for example the case where , so that . Then for taking values in say , the above bounds read as

which is a well-known elementary result.

-divergence bounds. In (Binette, 2019), I show how we can use similar ideas to get best-possible reverse Pinsker inequalities: upper bounds on -divergences in terms of the total variation distance and likelihood ratio extremums. In particular, with the Kullback-Leibler divergence between the probability measures and , we find that if and , then

Applying again the Jensen functional bound to , we obtain

and this implies the range of values theorem

Variations

In cases where is unknown and optimizing over all possibilities is not quite feasible, we can use the following trick.

Let be the term involved in the maximization step of . Then is concave with , and hence for any we have that

In particular, taking , we obtain the result of Simic (2008) stating that

When is differentiable (this assumption is not strictly necessary but it facilitate the statements), then we can use the concavity of (using the fact that ) to very easily obtain

which is an inequality attributed to S.S. Dragomir (1999), although I haven’t managed to find the original paper yet.

Reuse

CC BY 4.0

Copyright

Olivier Binette

Two sampling algorithms for trigonometric densities

Mon, 15 Apr 2019 04:00:00 GMT

Trigonometric densities (or non-negative trigonometric sums) are probability density functions of circular random variables (i.e. -periodic densities) which take the form

for some real coefficients which are such that and . These provide flexible models of circular distributions. Circular density modelling comes up in studies about the mechanisms of animal orientation and also come up in bio-informatics in relationship to the protein structure prediction problem (the secondary structure of a protein - the way its backbone folds - is determined by a sequence of angles).

Here I am discussing two simple sampling algorithms for such trigonometric densities. The first is the rejection sampling algorithm proposed in Fernández-Durán et al. (2014) and the second uses negative mixture sampling.

Parametrizing trigonometric densities

By Féjer’s Theorem, the conditions on the coefficients and can be stated as follows: there exists a vector of complex coefficients with and satisfying

This provides an explicit parametrization of the space of trigonometric densities in terms of a complex hypersphere. See Fernandez-Duran (2004) for more details.

Density basis of the trigonometric polynomials

In Binette & Guillotte (2019), we studied the De la Vallée Poussin density basis of the trigonometric polynomials given by

These can be used to express trigonometric densities as mixtures of probability density functions (instead of the functions and , and the change of basis formula follows from the expression

where

We’re using the complex functions instead of and simply because they are neater to work with; it doesn’t change much otherwise.

We also show in our paper that if and , then

This provide an easily formula to sample from the basis functions and their mixtures.

Algorithm 1: Naive rejection sampling

Given an uniform upper bound on the family of trigonometric densities, we can sample from a given using simple rejection sampling as follows:

Let be uniformly distributed over ;
If , then return ; otherwise return to step 1.

Now the problem is to figure out a good upper bound . The most basic idea is to do as in Fernandez-Duran et al. (2014) and to apply the Cauchy-Schwarz inequality

Can we find a better bound? I think that would work, but I have no clue how to prove it….

Let’s implement this in R.

Implementation

First we need a trigonometric density model.

trig_function <- function(c_real, complex=NULL) {
  # Returns the trigonometric function defined as either:
  #     f(u) = 1/(2\pi) + \sum_{k=1}^{n} c_real[2*k-1] \sin(k u) + c_real[2*k] \cos(ku),
  # or
  #   f(u) = \| \sum_{k=0}^n complex e^{i k u} \|^2,
  # where n is the degree of the polynomial.
  #
  # Args
  #   c_real: Vector of 2*n real numbers, where n is the degree of 
  #           the trigonometric polynomial.
  #   complex: Vector of (n+1) complex numbers.
  
  if (!is.null(complex)) {
    lambd <- function(u) {
      n = length(complex) - 1
      k = 0:n
      return(abs(sum(complex * exp(u * k * 1i)))**2)
    }
  }
  else {
    lambd <- function(u) {
      n = length(c_real)/2
      k = 1:n
      return(1/(2*pi) + sum(c_real[2*k - 1] * cos(k*u)) + sum(c_real[2*k] * cos(k*u)))
    }
  }
  return(Vectorize(lambd));
}

We can also generate random trigonometric densities of a fixed degree as follows.

rtrig <- function(n) {
  u = rnorm(n);
  v = rnorm(n);
  c_comp = u + v*1i;
  c_comp = c_comp / (sqrt(2*pi*sum(abs(c_comp)**2)));
  return(trig_function(complex=c_comp))
}

Usage is like this:

u = seq(0, 2*pi, 0.005)
plot(u, rtrig(10)(u), type="l")

And finally we can implement the naive rejection sampling algorithm.

naive_rejection_sampling <- function(f, n) {
  # Returns a random variate following the trigonometric density f of degree n.
  drawn = FALSE
  while(!drawn) {
    x = runif(1)*2*pi
    y = runif(1)*(n+1) / (2*pi)
    if (y < f(x)) {
      drawn = TRUE
    }
  }
  return(x);
}

Algorithm 2: Negative Mixture Sampling

Another approach to simulate from trigonometric densities relies on the De la Vallée Poussin mixture representation. That is, any can be written as

where $ $, and . We can assume that for every $j $; i.e. there is no redundancy in the components of and . The density accounts for negative weights in the mixture representation of using the De la Vallée Poussin densities .

We can now sample from using samples from and a simple rejection method.

Algorithm 2.

Let .
Return with probability ; otherwise return to step 1.

Implementation

De la Vallée Poussin densities and its random variate generator.

dvallee <- function(u, j, n) {
  # De la Vallée Poussin density $C_{j,n}(u)$
  
  return(2^n * (1+cos(u - (2*pi*j)/(2*n+1)))^n / (2*pi*choose(2*n, n)))
}

rvallee <- function(j, n, m) {
  # Returns m random variates following the De la Vallée Poussin density $C_{j,n}$.
  
  V = runif(m) > 0.5
  W = rbeta(m, 1/2, 1/2 + n)
  return((1-2*V)*acos(1-2*W) + (2*pi*j)/(2*n + 1))
}

Usage:

s = rvallee(2, 5, 10000)
u = seq(-pi, pi, 0.05)
hist(s, prob=TRUE, xlim=c(-pi,pi))
lines(u, dvallee(u, 2, 5), col=2)

De la Vallée Poussin mixtures.

dValleeMixture <- function(coeffs) {
  # De la Vallée Poussin mixture densities
  
  n = (length(coeffs) - 1)/2;
  
  lambd <- function(u) {
    j = 0:(2*n)
    return(sum(dvallee(u, j, n) * coeffs))
  }
  
  return(Vectorize(lambd))
}

rValleeMixture <- function(coeffs) {
  # Random sample from a De la Vallée Poussin mixture density. The mixture weights are allowed to take negative values.
  
  f = dValleeMixture(coeffs)
  n = (length(coeffs) - 1)/2

  a = coeffs * (coeffs > 0)
  b = coeffs * (coeffs < 0)
  
  alpha = sum(a)
  a = a / alpha
  b = b / (1-alpha)
  fa = dValleeMixture(a)
  
  drawn = FALSE
  while(!drawn) {
    # Sample from f_a
    i = sample(0:(2*n), 1, prob = a)
    x = rvallee(i, n, 1)
    if ( runif(1) <  f(x)/(alpha*fa(x))) {
      drawn = TRUE
    }
  }
  
  return(x %% (2*pi))
}

Example:

coeffs = c(0.55, -0.15, 0.55, 0, 0, 0,0.05)
f = dValleeMixture(coeffs)
u = seq(0, 2*pi, 0.05)
s = replicate(50000, rValleeMixture(coeffs))
hist(s, prob=T, ylim=c(0, 0.6))
lines(u, f(u), col=2)

Other things we could do:

The black box Lipschitz sampling algorithm can also be used to sample from trigonometric densities. This requires to compute good upper bounds on the Lipchitz constant on the density, which should be doable using the De la Vallée Poussin mixture representation.

Reuse

CC BY 4.0

Copyright

Olivier Binette

The Significance of the adjusted R squared coefficient

Wed, 10 Apr 2019 04:00:00 GMT

My friend Anthony Coache and I have been curious about uses and misuses of the adjusted coefficient which comes up in linear regression for model comparison and as a measure of “goodness of fit”. We were underwhelmed by the depth of the literature arguing for its use, and wanted to show exactly how it behaves under certain sets of assumptions. Investigating the issue brought us to re-interpret the adjusted and to highlight a new distribution-free perspective on nested model comparison which is equivalent, under Gaussian assumptions, to Fisher’s classical -test. This generalizes to nested GLMs comparison and provides exact comparison tests that are not based on asymptotic approximations. We still have many questions to answer, but here’s some of what we’ve done.

So, in the context of least squares linear regression, the model for relating a vector of observed responses to independent covariates is , where is the design matrix and is the vector of random errors. One of many summary statistics arising from data analyses based on this model is the adjusted coefficient, defined as

where is the vector of residual errors and is the mean of (Cramer, 1987; Ohtani, 2004). The coefficient and its adjusted counterpart are widely used as measures of goodness of fit, as model selection criteria and as estimators of the squared multiple correlation coefficient of the parent population. While their properties have been thoroughly studied in these contexts (Olkin, 1958; Helland, 1987; Cramer, 1987; Meepagala, 1992; Ohtani, 2004), the literature is scarce in explanations as to what, exactly, adjusts for in non-trivial cases. It is not an unbiased estimator of and the degrees of freedom adjustment heuristic (Theil, 1971) is of limited depth.

Here we show in what sense the adjusted coefficient may be considered “unbiased”. For nested models comparison, we also suggest how to test the significance of a difference between two nested models which is equivalent to Fisher’s -test under Gaussian assymptions. The test is however done from a largely distribution-free perspective which is conditional on the observation of . The results are then reinterpreted under classical Gaussian assumptions, which emphasize the dual perspectives between those two tests.

Model and notations

Given a matrix , let denote the subspace spanned by its columns and be the (orthogonal) projection on .

The commonly used linear regression model is where is a vector of observed responses, the design matrix consists of a constant column vector followed by column vectors of covariates, is the vector of parameters to be estimated and is the vector of random errors. The fixed design matrix is supposed to be non-random and of full rank . Let also denote the residuals errors obtained by linear least squares fitting.

Testing for an increase of

Suppose we have two design matrices and , where . Let and . Given the vector of observations , we observe two values and associated to the nested models. The classical way to test for a significant increase of is to carry out Fisher’s -test based on the statistics

where and . This is a function of both and , which, under the assumption

for , has an -distribution.

This is, however, a rather convoluted way of going about comparing the two numbers and . Can we do simpler, and can we drop the Gaussian assumption? The answer is yes, although we’ll have to change a bit our point of view on the problem.

A Dual perspective on nested model comparison

The whole point of nested model comparison is to see if the new covariates in , i.e. those that are not part of , bring new information about . In the context of an exploratory analysis where the observations and predictors are all observed, we propose to change our perspective to the following testing procedure:

condition on the observation of and (consider them fixed, observed values);
tests if the new covariates in are random noise.

Hence, rather than testing the model under a Gaussian noise assumption, we test for covariate randomness, our null hypothesis becomes

This test can be carried out using any test statistic , and obviously the distribution of under (and conditionally on ), will not depend on the unknown parameter nor on the noise structure (which has been conditionned out of randomness). In particular, we can take .

Does it make any sense? Well it does not change anything! The test obtained in this framework is entirely equivalent to Fisher’s -test we reviewed before: for any given observation of , and , the two tests will give the same results.

Let me make all of this more precise.

Some precisions

Let be the concatenation of with a matrix of new covariates. The goal is to test whether or not has significantly increased from . Henceforth, we shall assume that both and are fixed and the null hypothesis is

By saying that has a uniformly distributed direction, we mean that is uniformly distributed on the -sphere. This is satisfied, for instance, if and this represents the augmentation of the covariate space through random directions. It is equivalent to saying that the complement of in is a random subspace. The following proposition shows that the expected value of is invariant under the addition of such covariates and provides the distribution of under .

Proposition 1. Let and be fixed and let be the concatenation of with independent random vectors of uniformly distributed directions. Then

and, more precisely, under we have that is distributed as

where is a Beta random variable of parameters and .

Proof. Let be the projection of on the orthogonal of and denote by the orthogonal projection onto . By the Pythagorean theorem we have and hence we may write

We now derive the distribution of . This term is the squared norm of projection of the unit vector on the random subspace . Let us now introduce a random unitary matrix obtained by orthonormalizing random vectors of uniformly distributed directions, so that is distributed as the first components of the vector . Since is uniformly distributed on the unit sphere of , it follows that the squared norm of its first components has a distribution. In other words, we have shown that .

The expectation of is obtained from this distributional expression.

Reinterpretation under Gaussian hypotheses

While the preceding analysis was conditional on the observation of , suppose now that , where for some . The distribution of is then intricately related to the unknown parameter , preventing a direct analysis.

However, as shown in Cramer (1987), the adjusted coefficient can still be understood as compensating for irrelevant covariates: in a correctly specified model, its expected value is invariant under the addition of covariates. This is formalized in Proposition 2 below. We preferred a more elementary proof than found therein, avoiding the rather involved explicit expression of the expected value that depends on the unknown parameter .

Proposition 2. Suppose , where and is Gaussian noise. If is another design matrix of rank such that , then

Remark. More precisely, we know the conditional distribution of given : it is the same as the distribution which appears in the context of Proposition 1. The above results then follows from a simple computation.

Proof. Let and write . Then is distributed as

for independent and a noncentral random variable of parameter . Hence

where and is a new and independent noncentral random variable. It follows that

depends on only through and must equal .

Relationship with Fisher’s -test

In the context of Proposition 2, suppose in particular that , where is a matrix of additional fixed regressors. Recall that the -statistic for Fisher’s test with nested models of and parameters respectively is given by

where and are the vector of predicted values for the models corresponding to and . The test of significance devised in Section 2, based on , is then equivalent to Fisher’s -test of the hypothesis

To see this, let be, as in the proof of Proposition 1, the projection of on the orthogonal of and denote by the projection on . Then the -statistic can be written as

This is a monotonous invertible transform of which, under , follows a Beta distribution of parameters and . Yet in the framework of Section 2 and under , where now is random and fixed, the test statistic is also a monotonous invertible function of . This shows that the two unilateral tests are equivalent: the same observations yield the same -values.

Discussion

We have highlighted dual perspectives on nested models comparison. An increase of may be due to random noise that correlates with fixed regressors, or to random regressors that correlate with fixed observations. Fisher’s test of the first hypothesis is equivalent to the test of the second. Furthermore, we showed that compensates properly, on the average, for both types of inflation of . We suggest this provides a clear explanation of what exactly adjusts for and how it can properly be used for models comparison.

Furthermore, the fact that random covariate tests, conditional on the observations, can be carried out exactly using any measure of goodness of fit (e.g. the likelihood or the AIC) suggests that our approach may be helpful in devising nested model comparison tests for GLMs. Testing at a chosen confidence level also provides more flexibility than using a rule-based procedure such as the AIC.

References

Cramer, J. S. (1987). Mean and variance of r2 in small and moderate samples. Journal of Econometrics 35(2), 253 – 266.
Helland, I. S. (1987). On the interpretation and use of r2 in regression analysis. Biometrics 43(1), 61–69.
Meepagala, G. (1992). The small sample properties of r2 in a misspecified regression model with stochastic regressors. Economics Letters 40(1), 1 – 6.
Ohtani, K. and H. Tanizaki (2004). Exact distributions of r2 and adjusted r2 ina linear regression model with multivariate t error terms. Journal of the Japan Statistical Society 34(1), 101–109.
Olkin, I. and J. W. Pratt (1958). Unbiased estimation of certain correlation coefficients. The Annals of Mathematical Statistics 29(1), 201–211.
Theil, H. (1971). Principles of econometrics (1 ed.). New York: J. Wiley.

Reuse

CC BY 4.0

Copyright

Olivier Binette

3D data visualization with WebGL/Three.js

Sun, 06 Jan 2019 05:00:00 GMT

Javascript app to visualize the positions and depths of earthquakes of magnitude greater than 6 from January 1st 2014 up to January 1st 2019. Data is from the US Geological Survey (usgs.gov). Code is on GitHub.

The original motivation was to make a web tool for high-dimensional data exploration through spherical multidimensional scaling (S-MDS). The basic idea of S-MDS is to map a possibly high-dimensional dataset on the sphere while approximately preserving a matrix of pairwise distances (or divergences). An interactive visualization tool could help explore the mapped dataset and translate observations back to the original data domain. To be continued…

Reuse

CC BY 4.0

Copyright

Olivier Binette

Sampling Lipschitz Continuous Densities

Sun, 05 Nov 2017 04:00:00 GMT

Full code: https://github.com/OlivierBinette/LipSample

function [sample, x, y] = lipsample(f, L, limits, m, varargin)
% Random variates from a Lipschitz continuous probability density function on [a,b].
%
%   s = lipsample(@f, L, [a b], m)
%       Draws _m_ random variates from the probability density _f_ on [_a_, _b_] 
%       which is Lipchitz continuous of order _L_. If _f_ is continuously 
%       differentiable, then the best choice of _L_ is the maximum value 
%       of its derivative.
%
%   s = lipsample(..., 'N', n)
%       ... Uses _n_ mixtures components in the spline envelope of _f_. 
%       The default choice is n = ceil(2*_L_), although increasing _n_ may
%       improve performance in some cases.
%
%       
%   [s, x, y] = lipsample(@f, L, [a b], m)
%       ... Returns the spline envelope constructed by the algorithm: the
%       envelope linearly interpolates the points (x,y).
%
%   Dependencies
%   ------------
%     - Function discretesample.m
%
%   Examples
%   --------
%   % In file myfunc.m
%       function y = myfunc(x)
%           y = 1 + cos(2*pi*x)
%       end
%
%   % A few exact samples
%       sample = lipsample(@myfunc, 2*pi, [0 1], 10000);
%
%   % Plot 10 million variates.
%       sample = lipsample(@myfunc, 2*pi, [0 1], 10000000);
%       hold on
%       pretty_hist(sample, [0 1]);
%       plot(linspace(0,1), myfunc(linspace(0,1)));
%       hold off
%
%   % Plot the envelope constructed by the algorithm
%       [sample, x, y] = lipsample(@myfunc, 4*pi, [0 1], 10000);
%       u = linspace(0, 1, 200);
%       hold on
%       pretty_hist(sample, [0 1]);
%       plot(u, myfunc(u));
%       plot(u, interp1(x,y,u));
%       hold off
%       
%
%   Implementation details
%   ----------------------
%     - Acceptance-rejection sampling. A first degree spline envelope of _f_
%       is constructed. The number of components is a function of _L_, chosen
%       as to maximize expected efficiency.
%
%   Warnings
%   --------
%       _L_ must be greater or equal to the best Lipschitz continuity constant
%       of _f_. Otherwise the algorithm may fail to yield exact samples.
%
%     - Efficiency bottleneck is the evaluation of _f_ at O(m) points. 
%
%   CC-BY O.B. sept. 15 2017

    % Parse input arguments.
    a = limits(1);
    b = limits(2);

    p = inputParser;
    addOptional(p, 'N', ceil(200*L) + 200);
    
    parse(p, varargin{:});
    n = p.Results.N;
        
    % Construct the spline envelope.
    s = (b-a) * L / (2*n);
    x = linspace(0,1,n+1);
    y = arrayfun(f, x*(b-a) + a);
    ylow = arrayfun(f, x*(b-a) + a);
    
    % Use the Lipschitz constant to locally adjust the spline.
    alpha = atan(L);
    d = diff(y);
    beta = abs(atan(n*d/(b-a)));
    r = 0.5*sqrt(((b-a)/n )^2 + d.^2).*sin(pi-alpha-beta)./sin(alpha);
    h = r.*(L - abs(n*d/(b-a)));
    y(1) = y(1) + h(1); ylow(1) = ylow(1) - h(1);
    y(n+1) = y(n+1) + h(n); ylow(n+1) = ylow(n+1) - h(n);
    for i = 2:n
        y(i) = y(i) + max(h(i-1), h(i));
        ylow(i) = ylow(i) - max(h(i-1), h(i));
    end
            
    % Generate random variates following the envelope.
    nProp = ceil((1+s)*m);
    U1 = rand(1, nProp);
    U2 = rand(1, nProp);

    y(1) = y(1)/2;
    y(end) = y(end)/2;
    I = discretesample(y, nProp);
    y(1) = 2*y(1);
    y(end) = 2*y(end);

    U = abs((U1 + U2 + I - 2)/n);
    U(U > 1) = 2 - U(U > 1); % The sample.

    % Generate from  f
    V = rand(1, nProp);
    B = interp1(x, ylow, U);
    passlow = lt(V .* interp1(x,y,U), B);
    sample1 = U(passlow);
    U = U(~passlow); V = V(~passlow);
    sample2 = U(lt(V.*interp1(x,y,U), arrayfun(f, U*(b-a)+a)));
    sample = (b-a)*cat(2, sample1, sample2) + a;
    
    if numel(sample) < m
        sample = cat(2, sample, lipsample(f, L, [a b], m - numel(sample)));
    else
        sample = sample(1:m);
    end
    
    x = x *(b-a) + a;
end

Reuse

CC BY 4.0

Copyright

Olivier Binette

Short Proof: Critical Points in Invariant Domains

Sat, 29 Apr 2017 04:00:00 GMT

Let be a vector field and denote by its stream. That is, and . A domain is said to be invariant (under the stream of ) if for all and . The curve is said to be a closed orbit of if there exists such that .

Theorem.
If is invariant and diffeomorphic to a closed ball of , then has a zero in .

Corollary.
If , then any closed orbit of encloses a zero of .

Proof of the theorem.
Suppose that for all and let . Since is uniformly continuous on , there exists such that implies . Also, by Brouwer’s fixed point theorem, there exists such that . This yields a closed orbit such that any two points on are at distance at most from each other. Since is closed, there must exist such that . Hence we find that , even though . This is impossible. Thus is not bounded away from zero and must have a zero in the compact .

Proof of the corollary.
When , the Jordan-Brouwer theorem implies that closed orbits separate the plane in two connected components, one of which is bounded. Schoenflies’ theorem, strengthening the above, ensures that the union of bounded component with the closed orbit is diffeomorphic to the closed disk. Invariance follows from the unicity of the solution to initial value problems when is .

This can be generalized as follows. For the sake of mixing things up, we state the result in topological terms.

Theorem (Particular case of the Poincaré-Hopf theorem).
Let be a compact submanifold of with non-zero Euler characteristic , and let be a smooth isotopy. Then for all , there exists distinct points such that

Reuse

CC BY 4.0

Copyright

Olivier Binette