Validation and Uncertainty
What an extraordinary conversation I had recently on Twitter. It started with Neil Killick’s statement that we should not consider our stories truly done until validated by actual use. This is a lovely thing, if we can manage it. While I’ve not set such a bold declaration of “done,” I’ve certainly advocated for testing the benefit of what we build in actual use. Deriving user stories from hypothesis of what the users will do, and then measuring actual use against that hypothesis is a very powerful means of producing the right thing—more powerful than any Product Owner or Marketing Manager could otherwise dream up.
While I often recommend such an approach as part of “Lean Startup in the Enterprise,” when I hear someone else say it, it’s easier to think of potential problems. Paul Boos says it’s “balanced insight.” I fear it’s me being contrary, but I do like to examine multiple sides of a question when I can. In any event, such conversations help me to think deeper about an issue.
The first situation I considered was the solar calibrator for Landsat VII. When you only get one launch date, it’s a bit hard to work incrementally, validating the value with each increment. Instead, we must validate as best we can prior to our single irrevocable commitment. This involves quite a bit of inference from previous satellites, and from phenomena we can measure on earth. We must also predict future conditions as best we can, so that we can plan for both expected conditions and anomalies that we can envision. This is an extreme situation, and it’s quite possible we’ll utterly fail.
So, the conversation turned to ecommerce systems. Surely we can measure user behavior and know what effect a change makes. We can measure the behavior, all right, but even without making any change, we may notice a lot of variance in that behavior. If a variance of 5% to 10% can be expected week-over-week in some measured behavior, then a change that might produce a 1% to 2% change is very hard to detect.
The obvious answer is to maintain a control group. If we present some users with an unchanged system and others with the change, then we can measure the normal variation in the control group and the normal variation plus the variation specific to the change in the experimental group. Given a sufficient number of users, the normal variation should be equal between the two groups.
Is, however, the specific variation a lasting phenomena in response to the essence of the change, or is it transient behavior triggered by the fact that there was a change? In spite of the Hawthorne Effect being a legend based on poor experimental design, novelty does attract interest.
Another difficulty is that the people whose behavior we are measuring are individuals, and vary considerably from one another. When we measure in aggregate, we are averaging across those individuals, and smoothing the rough bumps off of our data. Unfortunately, much of the data is in those rough bumps. What motivates one person to purchase our product may drive another one off. Can’t we segregate our customers into different segments so that we can at least average over smaller, more-targeted groups? If we’re measuring behavior of customers who are logged into our site and whose background and behavioral history we know, then we certainly can. If they are anonymous users, then no, we can’t. Not unless we can spy on them by tracking them around the web through widespread cookies, surreptitious javascript, or other means.
When we gain in one respect, we lose in another. With either of these methods, there may be errors in what we think we know about them. And any time we segment, we’re reducing the quantity in a study group, increasing the chance that our measurements may be due to random chance rather than the change we are studying.
The use of statistics can help us estimate whether or not a variation is specific or random. It can tell us whether the change is statistically significant, and what is our confidence level that it’s not random. It cannot, however, assure us of our conclusions.
Statistics also can only alert us to correlation, not to causation. The behavior change we notice may be due to an overlooked difference that accompanies the difference we’re attempting to measure. If we’re not very careful, then there may be systemic bias that we don’t notice, and we make the wrong presumption from the evidence.
In the world of commerce, we rarely have the opportunity to repeat our experiments to see if they’re repeatable. We want to keep progressing, and so we lose the baseline conditions that would allow us to repeat. Also, our system is just a small part of the daily life of a potential customer. It’s just a small part of the larger system made up of all the other small systems they use. Those other systems also have the capability to change the user’s behavior with our system. Perhaps they change how the user perceives our system can be operated, or sets a different standard for what is considered acceptable.
The world is a bit unpredictable. In the end, we seem to be measuring what happened against what we estimate would have happened, otherwise. Sometimes it may seem very clear to us; sometimes murky. Always we have the possibility of fooling ourselves.
The final paragraph suggests we might want to measure what happened against what we desire, rather than against what we estimated would happen. Then we can focus on getting close to what we desire, rather than the (probably fruitless) goal of “estimating better.”
Good thought, Dave. I suspect I’d want to do both. Not to “estimate better” but to better understand what I’m seeing, and to develop better hypotheses in the future.
Thanks for this article – many of these issues strike a chord. I once worked an on road traffic management project where we wanted to measure the impact of new information systems. However, we found that the city couldn’t even measure the impact of their entire traffic control centre due to variance, lack of baseline, impossibility of a control group. A further issue is that users (drivers) may initially modify their behaviour in the expected way, but over a few weeks they drift back again (finding new ‘rat-runs’, for example), reversing the initial change. So measurement over surprisingly long timescales may be needed to draw any valid conclusion.