Bayes Theorem for lawyers - Part 1
A mid-length explanation of the Ultimate Guide to the power of evidence
“Bayes Theorem describes what makes something ‘evidence’ and how much evidence it is” – Eliezer Yudkowsky
Trials are all about the probative power of evidence. It is surprising, therefore, how few lawyers understand – or have even heard of – the Ultimate Guide to evidential power: Bayes Theorem. Most criminal advocates faintly recall that the Court of Appeal once said something about DNA evidence needing to be set out in a particular way, but that’s about it.
Perhaps it’s not so surprising. Most lawyers don’t like numbers. And most explanations of Bayes are confusing. I have tried here to bring together the best aspects of the best explanations I have read. And I use as the main illustration a scenario from criminal law, rather than the more commonly used examples about disease diagnosis.
This post deals with the ‘probability form’ of Bayes theorem. The ‘odds form’ will come in a Part 2. If, like me, you often need to read several different explanations to understand something properly, I will post links at the end of Part 2.
Why is Bayes’ Theorem important for lawyers?
“To introduce Bayes Theorem, or any similar method, into a criminal trial plunges the jury into inappropriate and unnecessary realms of theory and complexity deflecting them from their proper task.” - Rose LJ, R v Denis Adams No. 1 (1996).
You can’t explain Bayes to a jury. But you can, I think, use an understanding of the theorem to give better closing speeches, with valid and intuitively appealing arguments about the strength of the evidence. Likewise when arguing admissibility.
One useful way to think about Bayes theorem is as a map. It is a criminal advocate’s job, after all, to show the jury the way. The other side tries the same, with a different path. But knowing the Bayesian map – that is, the guide to how evidence affects belief – you are better equipped to spot an opponent’s invalid route, and emphasise the wisdom of your own.
“What a surprise, members of the jury!” a prosecutor exclaims. This is what prosecution evidence amounts to: presenting facts that would be surprising if the defendant were innocent, and would not be surprising if he were guilty. Bayesian inference is all about surprise. But an initial feeling of surprise, or its absence, might not be justified. Bayes shows us when we should, and should not, be surprised.
Bayes’ theorem enables us to work out how the probability of a piece of evidence affects the probability of a hypothesis. That is, it shows us how to update our degree of belief in a proposition, in light of an observation.
One way of putting the clever insight of Bayes is to say that the mathematical relationship between evidence and hypothesis is described by the relationship between two overlapping areas:
One circle is the hypothesis and the other is the evidence. We’ll return to this diagram. For now, it is best to use an example, and a slightly different – though functionally equivalent – diagram.
Island Riot
Imagine an island of 100 people. There is a riot. CCTV shows that there were 20 people rioting, but it is of such poor quality that no one can be identified.
In the following diagram, the 100 people are represented by 100 squares. The twenty rioters have a ‘G’, for Guilty.
We could label the innocent people ‘not guilty’ if we wanted, but at this stage it would create unnecessary visual clutter, so we will leave them blank.
Now, suppose the police investigate all 100 islanders. What is the probability, for any given person, of the hypothesis that they are guilty of riot?
Well, it is – fairly obviously – the area covered by the Gs as a proportion of the total area: 20/100. That is, 20%, or 0.20.
Now, for the sake of the example let’s imagine that everyone on the island owns a motorbike (and clever technology means people can only ever ride their own).
Police obtain footage from another CCTV camera. This shows all 20 rioters fleeing the scene by motorbike. The quality is just good enough to determine that 15 of their motorbikes are Yamahas, and the other 5 are not.
Further, DVLA records show that of the 100 motorbikes on the island, 25 are Yamahas.
Let’s add this new information to the diagram, by colouring Yamaha riders yellow. We need to colour 25 squares yellow, for the 25 Yamaha riders on the island. 15 of them were rioters, so 15 of the yellow squares must be squares with G in.
At this point, the probability of any given islander being guilty is still 20%.
New evidence
The police select at random the first of the one hundred islanders to be investigated. An industrious cop thinks it might be a good idea to look in this man’s garage to see what kind of motorbike he has. He wonders, though, whether it would help prove or disprove anything. How might the probability of guilt change?
The original probability of the hypothesis of guilt was 20% - based on the mere fact that there were twenty rioters and one hundred inhabitants. If our suspect’s bike turns out to be a Yamaha, then he must be one of the yellow squares. So the question of our new probability would become: given that he is a yellow square, what is the probability that he is a G square?
Well, the answer to that is just the proportion of yellow squares that have a G in: 15/25. Which is 3/5, or 60%.
The probability of our hypothesis of guilt was 20%. Given he rides a Yamaha, the probability is now 60%. We just used Bayes’ theorem!
Let’s now write out what we’ve just done in proper notation.
Note 1: I prefer an upper case ‘H’ and a lower case ‘e’ because they are easier to distinguish at a glance.
Note 2: That straight line symbol, ‘|’, meaning ‘given’, could do with further explication. Imagine the hypothesis, H, was “he is American”, and the evidence, e, was “he is a professional baseball player”. P(H|e) would mean, “The probability that he is American, given that he is a professional baseball player”. And p(e|H) would mean, “The probability that he is a professional baseball player, given that he is American”.
Back to the rioters.
Our hypothesis, H, was “he is guilty”:
P(H) = 0.20 (because there were 20 rioters out of 100 inhabitants)
Our evidence, e, is “he rides a Yamaha”:
P(e) = 0.25 (because there are only 25 Yamahas on the island)
The probability that our suspect rides a Yamaha, given the fact that he is guilty:
P(e|H) = 15/20 (i.e. 15 of the 20 rioters rode Yamahas)
Now, in order to calculate our updated probability of guilt in light of the new evidence – i.e. P(H|e) – the calculation we just performed was:
And that is Bayes theorem:
It is worth looking again at the very first diagram.
Instead of H and e, I have used A and B. They are just areas. I have drawn them as circles. The overlap I have labelled as “AB” (to mean “A and B”).
What now follows, in case useful, is a description of what we have just been doing – that is, the simple maths behind Bayes theorem – uncomplicated by any thoughts of evidence or hypotheses. The fundamental mathematical principle is so simple, so trivially true, that it can get lost in worked examples.
The size of the overlap, AB, can be described, or calculated, in two ways:
as the proportion of B that A takes up, multiplied by the size of B; or,
as the proportion of A that B takes up, multiplied by the size of A.
These two ways to calculate the same thing must be equal to each other, so:
AB/A x A = AB/B x B
If this isn’t obvious, try some numbers on it. Imagine you are told (very roughly in line with the drawn sizes of the circles and overlap) that AB is a third the size of A, and A is 60 square meters. In that case, AB is 1/3 of 60, which is 20.
Or, you’re told that AB is one tenth the size of B, and that B is 200 square meters. In that case, AB can be calculated in a similar way: 1/10 x 200 = 20.
If A and B are events, or propositions (or hypotheses, or bits of evidence...) then AB as a proportion of B – that is, AB/B – is, by intuition, the probability of A given B: P(A|B).
So P(A|B) x P(B) = P(B|A) x A
Rearranging slightly (dividing both sides by P(B)) gives:
and now using H and e instead of A and B:
P(e) written as its two parts
In some descriptions of Bayes’ theorem you will see the denominator, P(e), expanded so as to describe itself in terms of its two parts: the part where it overlaps with H, and the part where it does not. That is to say, “e-and-H” plus “e-and-not-H” – or in the rioters example, yellow-and-G plus yellow-and-not-G.
Here is the rioters diagram again with all areas fully labelled. Not-G is written as “¬G”. Purple, the opposite of yellow on the colour wheel, I have used to mean “not yellow”.
There are four distinct areas. Using the symbol ∩ to mean ‘and’, and using Y for Yamaha, the four areas and their sizes are as set out in the table below.
So the expanded version of Bayes theorem is:
An advantage of the expanded version is that sometimes the information you obtain about the probability of the new evidence makes it easier to calculate P(H|e) if you think of P(e) as consisting of these two separate parts.
For example, suppose that instead of “is guilty of riot” and “rides a Yamaha”, our hypothesis and evidence were “is infected with the disease” and “tests positive on our imperfectly reliable test”. Doctors might discover, by testing samples of a large population, that:
- 20% of people have the disease
- of the people do not have the disease, 1/8 nevertheless test positive, and
- of the people who do have the disease, 3/4 test positive
The diagram of that information would be identical to our rioters diagram in Fig. 4. But it would be easier to calculate an updated probability, P(H|e), if P(e|H) x P(H) and P(e|¬H) x P(¬H) are written out as separate terms, because that is the form in which we were given the information about P(e).
Some final points to note:
1 - In our rioters example, the updated probability of guilt increased. But if the numbers were different, it could – obviously – decrease. For example, if of the 25 Yamahas on the island, only two were ridden by rioters, then the finding of a Yamaha in a suspect’s garage would decrease the probability of guilt (from 0.20 to 0.16).
2 - If the probability of finding the evidence if the hypothesis is true is the same as the probability of finding the evidence if the hypothesis is false, then the evidence has no effect on the probability of the hypothesis. So for example, if of the island’s 25 Yamahas five were ridden by rioters (and therefore the other 20 by innocent people), the updated probability of guilt, P(H|e), would be (5/20 x 20) / 25 = 0.20, which is the same as the prior probability of guilt, P(H). And that is the test for all truly irrelevant evidence: it is equally as likely to be found among the guilty as among the innocent.
3 - For a given prior probability, the updated probability does not depend on the absolute probability of the evidence; rather, it just depends on the ratio of P(e|H) to P(e|¬H). For example, suppose in our rioters example – see Fig. 4 – the DVLA showed that there were only 5 Yamahas on the island, and only 3 were ridden by rioters. Our diagram would instead look like this:
Here, with many fewer Yamahas on the island than in our first example, our prior probability, P(H), is 0.20, as before. But P(e) is much smaller: 5/100 instead of 25/100. Let’s use the Bayes equation to calculate the updated probability, P(H|e), with these new numbers.
Which is the same as it was in the first example!
P(e) is different. As is both P(e|H) (3/15 = 0.15, instead of 15/20 = 0.75), and P(e|¬H) (2/80 = 0.025, instead of 10/80 = 0.125). But because the ratio of P(e|H) to P(e|¬H) is the same as it was in our first example (that is, 6:1), the evidence updates the probability of the hypothesis to exactly the same extent: from 0.20 to 0.60.
This fact about the supreme importance of the sizes of P(e|H) and P(e|¬H) relative to each other, rather than their absolute sizes, leads us to a re-arrangement of the Bayes equation, known as the “odds form” of Bayes. This is easier to use when, as is often the case, our information about P(e) comes split into P(e|H) and P(e|¬H). And it is much easier to use when – as is often the situation in the courts – we must consider several different pieces of evidence in concert.
I will explain and explore the ‘odds form’ of the Bayes equation in a Part 2, soon.
Typo before fig 5? Should be "only 3 were ridden" to match diagram and formulae?