Insanity and Illusion: Testing Test Cases

The cliche goes that the definition of insanity is to repeat the same action over and over again, and to expect a different result.

In the context of testing though, I would posit a different definition. In testing, insanity is to instruct different people to repeat the same actions and to expect the same result.

Yesterday Katrina Clokie, Aaron Hodder and I tested this theory. As part of some training we are preparing for new, graduate testers, we developed an exercise to highlight the inattentional blindness that can occur when executing scripted test cases.

The concept was simple. We wrote a set of eight detailed test cases which instruct the tester to check that the sort function on a particular auction based shopping website worked as it should. Each tester (or pair of testers) had the exact same set of eight test cases, identical in every detail. We gave them 20 minutes to execute them and, upon completion, asked them for the number of bugs they had found, as well as the number of tests which had passed, failed or did not get done.

These were our results:

As you can see, when we asked different people to follow the same test cases we got very different results.

Some groups found no bugs. One group found five. Most found two bugs, but based on the subsequent discussion these were not necessarily the same two bugs. The same phenomenon occurred when the testers were asked to allocate black or white pass/fail adjudication to their tests. In some cases, the presence of bugs appeared to result in that test case failing. In others, the tester identified a bug but still felt that the test had passed.

To me, this highlights two really important things.

The first is the inattentional blindness that can occur when executing test cases, which is what we hoped the exercise would show. This is the phenomenon that occurs when you are specifcally focusing on one thing, that means you are more than likely to miss other things that may be going on right in front of your eyes. The most famous example of this is the “monkey business illusion“.

In our exercise, and in the execution of scripted testing, inattentional blindness means that because the documented procedure is focusing your attention on a specific line of enquiry, you are liable to miss bugs or interesting behaviours that occur around the functions that the script specifically highlights. In our teaching to our graduates, we want to highlight this so that they are aware of the fallibility of this method and can remember to defocus regularly when executing scripted tests, lest they miss the bigger picture.

The second crucial takeaway from this exercise, and one we didn’t initially intend it to demonstrate, is the inherent danger of test cases and the implied level of equivalence that they represent. A great many testers still happily report on their testing by counting the number of tests which passed and the number which failed. This is an incredibly reductive way of representing the information that testing ought to be providing, by representing all test cases as equal.

If we were to try and report on the above set of test results in this manner, what on earth would we do? To say that we had 48 test cases, of which 15 passed and 13 failed would be criminally misleading. Clearly the actual testing which occurred from group to group here was unique. Each group brought different insights, made different judgements and elicited different information relative to the sort functionality on the website under test. To reduce these unique activities into a numeric ranking out of eight would be absurd and provide zero useful information to anyone.

As a further side note, you’ll notice that when asked to report on their findings one group (3rd row) had contrived to lose a test case, while another (5th row) had gained one – they reported seven and nine “results” respectively. In both cases, they legitimately had eight test cases the same as everyone else, which further highlights the inevitable human error involved in such a practice even with relatively small volumes.

As such, the concept of a test case is a dangerous one. There is no weighting to the idea of a test case. It does not truly reflect the cognitive and temporal effort of the testing that occurred, instead reducing that effort – however great or small – to a contrived and equivalent rank. The only way to truly represent the outcomes of the testing that was done in our exercise was to talk to the testers about what they did and what they had found. The above numbers are meaningless.

Well, perhaps not entirely meaningless.

These numbers have helped us to highlight the fallibility of the test case. The exercise itself was effective in allowing our attendees to experience the inattentional blindess that comes from over focusing, but the collation of the groups’ results refuted one of the oft-cited benefits of scripted test cases – that they provide an opportunity for repeatible, consistent and unambiguous test results.

8 thoughts on “Insanity and Illusion: Testing Test Cases

  1. Adam,

    Wonderful exercise to gather this data. Insightful, thoughtful write up! Thank you. Your thought-provoking findings are going to be discussed with interest by thoughtful testers all over the globe. The photo you included in your post is sure to feature in testing conference presentations in 2014 and be one of those topics at those conversations that sticks with attendees and spurs hallway conversations. Seriously, this is really cool stuff you’ve done.

    In thinking about your findings, I’ve had the following thoughts. I’d be interested to know your reaction to them.

    If 100 groups of testers conducted similar experiments with 8 test scripts and 6 small groups of testers, I would suspect their findings would be broadly similar to yours (e.g., very few, if any, of the 100 experiments would have all 6 teams report the same number of bugs / failed tests / passed tests). I suspect the data from those 100 experiments would tend to be impacted by the following factors.

    These 4 factors would tend to make the number of bugs found and the number of tests passed relatively less consistent among the groups:

    (a) the testers were trained in Exploratory Testing methods and, more broadly, Context Driven Testing – (training that would help them analyze systems relatively more efficiently and effectively – and makes them less reliant on / less limited by – the specific content of the test scripts in front of them).
    (b) testers were explicitly encouraged to depart from the specific confines of the test scripts if they think it would be a good use of time to do so (e.g., “if you, mid-script, think ‘Hmmm… I wonder what would happen if I do this?’… do it and see!”)
    (c) different groups of testers in the room have highly varied backgrounds (e.g., worked in different companies, worked in different areas of testing specialty, different numbers of years of experience, different analytical abilities)
    (d) the test scripts were highly detailed “click on the mouse button to add a black iPhone 5s with 32 GB of memory into the shopping cart, next check out and type the following credit card number…”)

    Conversely, the following factors would tend to make the number of bugs found and the number of tests passed relatively more consistent among the groups:

    (a) testers have not been formally taught Exploratory Testing methods or about Context Driven Testing
    (b) testers are explicitly encouraged to “stay on script!”
    (c) testers in the room had relatively similar backgrounds, levels of experience and analytical abilities
    (d) the test scripts were more general (e.g., “buy an electronic item from the site”)

    In short, the top 4 factors would be negatively correlated with consistency of bug finding reports and passed test cases; the bottom 4 factors would be positively correlated.

    I’d be interested in your thoughts on these musings.

    – Justin Hunter

    • I accidentally reversed the content in the points I made in (d)! More tightly scripted tests would lead, I suspect, to relatively more consistent numbers from teams about bug finding / test passing / test failing; the more vague the script the less consistent those numbers would probably be.

      • Hi Justin – thanks for your very thoughtful comment!

        I completely agree with your analysis, a diverse group of testers with a more exploratory background and general instructions would very likely produce more varied results than testers of a similar background and experience with little training or experience in exploratory methods.

        Reflecting on the group of testers and the way we instructed them, I would say we were pretty split across your two sets. On the one hand, the group didn’t have much experience in exploratory testing methods and the scripts were deliberately written to be quite detailed. On the other hand, the group was pretty diverse in terms of experience and skill sets and we weren’t too insistent that they “stay on script”, though they were asked to just execute the scripts.

        As I mentioned in the write up, this exercise is part of a training programme we’re developing for new graduates who join our organisation, and this was a practice run. I will be very interested to see the results when we run it on that occasion, as the testers will very much fit your second set – all having practically zero testing experience and no formal exploratory testing training (yet!) so I would expect them to generally produce less varied results.

        Aaron and I did discuss yesterday that we might consider exactly how we “instruct” them for this exercise, and potentially look at repeating it with a more experienced group in the same way (Aaron and Katrina run a local meet up which I think would be a fantastic testing ground) to enable us to be a bit more scientific about the whole process.

        One thing that I’m sure influenced the above results was that the group generally knew that Katrina, Aaron and I all favour an exploratory approach to testing – and so probably more readily went “off-script” than they might have if instructed by someone whose preference wasn’t as obvious. This is another element we might consider trying to negate to heighten the reliability of the results.

        Thanks again for reading and commenting, interested to hear any other thoughts you have!

  2. Why do you draw the conclusion that inattentional blindness occurred? Did the testers go “Ahhh, why didn’t I see that!?” when they compared their findings?

    • Hi Magnus,

      Yes, upon completion of the exercise we discussed the different things that each group had observed and there were moments when some of the testers had that sort of revelation.

      The exercise was designed to test the sort functionality of the chosen website because there are some problems or inconsistencies with that functionality that become apparent if you investigate with a wider lens, but which will not necessarily cause any of the test scripts to “fail” – because of the focus and level of detail with which they are written.

      Thanks for your question, I hope you enjoyed the article.

  3. This entry should be required reading for ANYONE managing a development project. Every project I’ve been on, managers think it’s a great idea to count how many test cases passed/failed, weren’t executed. There’s also pressure to execute as many test cases as possible – so the pressure is on to just stay on script, and NOT spend time exploring other things you may see that indicate there may be issues, because those explorations involved straying from the script! Almost all managers fail to understand that generally the most valuable testing is exploratory, at least when done my smart testers. But I don’t see managers developing a sophisticated, realistic view of testing anytime soon.

Have your say...

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s