Validation and Uncertainty
What an extraordinary conversation I had recently on Twitter. It started with Neil Killick’s statement that we should not consider our stories truly done until validated by actual use. This is a lovely thing, if we can manage it. While I’ve not set such a bold declaration of “done,” I’ve certainly advocated for testing the benefit of what we build in actual use. Deriving user stories from hypothesis of what the users will do, and then measuring actual use against that hypothesis is a very powerful means of producing the right thing—more powerful than any Product Owner or Marketing Manager could otherwise dream up.
While I often recommend such an approach as part of “Lean Startup in the Enterprise,” when I hear someone else say it, it’s easier to think of potential problems. Paul Boos says it’s “balanced insight.” I fear it’s me being contrary, but I do like to examine multiple sides of a question when I can. In any event, such conversations help me to think deeper about an issue.
The first situation I considered was the solar calibrator for Landsat VII. When you only get one launch date, it’s a bit hard to work incrementally, validating the value with each increment. Instead, we must validate as best we can prior to our single irrevocable commitment. This involves quite a bit of inference from previous satellites, and from phenomena we can measure on earth. We must also predict future conditions as best we can, so that we can plan for both expected conditions and anomalies that we can envision. This is an extreme situation, and it’s quite possible we’ll utterly fail.
So, the conversation turned to ecommerce systems. Surely we can measure user behavior and know what effect a change makes. We can measure the behavior, all right, but even without making any change, we may notice a lot of variance in that behavior. If a variance of 5% to 10% can be expected week-over-week in some measured behavior, then a change that might produce a 1% to 2% change is very hard to detect.
The obvious answer is to maintain a control group. If we present some users with an unchanged system and others with the change, then we can measure the normal variation in the control group and the normal variation plus the variation specific to the change in the experimental group. Given a sufficient number of users, the normal variation should be equal between the two groups.
Is, however, the specific variation a lasting phenomena in response to the essence of the change, or is it transient behavior triggered by the fact that there was a change? In spite of the Hawthorne Effect being a legend based on poor experimental design, novelty does attract interest.
When we gain in one respect, we lose in another. With either of these methods, there may be errors in what we think we know about them. And any time we segment, we’re reducing the quantity in a study group, increasing the chance that our measurements may be due to random chance rather than the change we are studying.
The use of statistics can help us estimate whether or not a variation is specific or random. It can tell us whether the change is statistically significant, and what is our confidence level that it’s not random. It cannot, however, assure us of our conclusions.
Statistics also can only alert us to correlation, not to causation. The behavior change we notice may be due to an overlooked difference that accompanies the difference we’re attempting to measure. If we’re not very careful, then there may be systemic bias that we don’t notice, and we make the wrong presumption from the evidence.
In the world of commerce, we rarely have the opportunity to repeat our experiments to see if they’re repeatable. We want to keep progressing, and so we lose the baseline conditions that would allow us to repeat. Also, our system is just a small part of the daily life of a potential customer. It’s just a small part of the larger system made up of all the other small systems they use. Those other systems also have the capability to change the user’s behavior with our system. Perhaps they change how the user perceives our system can be operated, or sets a different standard for what is considered acceptable.
The world is a bit unpredictable. In the end, we seem to be measuring what happened against what we estimate would have happened, otherwise. Sometimes it may seem very clear to us; sometimes murky. Always we have the possibility of fooling ourselves.