Thursday, November 26, 2020

Removing Systematic Bias in Opinion Polls

For polsters, a challenge that can be difficult to overcome is a systematic bias. There are voters who don't like to admit that they will vote for a particular candidate. Also supporters of a given candidate may be difficult for the polsters to find, but they will come out and vote. If we look at the US presidentail poles in Florida in the weeks before the 2020 US Presidential election, we can see that the vast majority had Biden ahead. According to the pole aggregation by Five Thirty Eight, Biden was leading by 2.5% in the poles. However, in the end, Trump won convincingly. The poles were wrong by about 6%, which is a huge error. Many of the poles were better at predicting the other poles than they were at predicting the actual election result.

One approach to deal with the systematic error would be to ask the respondents how they voted in a previous election. In the run-up to the Presidential 2020 election, the most relevant past election would be the 2016 Presidentail election. We know the result of that past election election, if the polster were to use the results from his respondents to estimate that past election, he could then compare that with the actual result and he could see if he has big error. If a methodology can't predict the past, then it will struggle with predicting the future.

We can make an adjustment to the weighting of each respondent so that we (more or less) recover the previous election result. We can then apply that same adjusted weighting to the respondents' answers about who they will vote for in the next election. This adjustment will work well if for example there are many people who are reluctant to admit that vote for Trump, but when they have anonimity in the real vote, they do so.

A polster may have his own methodology to obtain his results from his data, but the approach outlined here may be useful when working out the confidence interval. Rather than using one model, with its assumptions, to work out the confidence interval, it may be wise to consider more than one set of modeling assumptions and then report a confidence interval that is wide enough to include the results from all the different models.

Now, to delve into some maths. I'm going to use the same notation as I used in my earlier post. After we've carried out a pole. We convert the answers to probabilities. We'll use the notation that P(r,c) is the probability that respondent r voted for candidate c in the given election.

So \( \sum_c P(r,c) \) is the probability that the respondent (r) will vote at all.

When we ask voters about how they voted in the past, most will give a definitive response, either that they voted for a particular candidate or that that didn't vote at all. However in some cases repondents will say (in many cases with honesty) that they can't remember how they voted or indeed if they voted. In the model we'll work with here, we'll use the same probabilities as before (P(r,c)) even though in many cases there will be certainty, in that case we'll have 0's and 1's. But we can stick with the general case, when there may be come uncertainty. Using our data, we'll have a raw election result prediction. The proportion of votes for candidate c will be: \[ V(c) = \frac{ \sum_{r} P(r,c)}{ N } \] and the expected total number of votes that will be cast by our respondents is calculated: \[N = \sum_{r} \sum_{c} P(r,c) \] For a past election, when we have asked people about how they voted, we already know the results. We'll label the actual portion of votes allocated to candidate c: \(A(c) \)
For a given candidate say \( \kappa \) we may find a significant discrepancy between the actual result \(A(c) \) and the raw result, from our data \(V(c) \).
We can introduce an adjustment and use the parameter \(\alpha\) to obtain the adjusted probabilities \(P'(r,c,\alpha)\) and adjusted vote allocation estimates \(V'(c,\alpha)\).
When we have no adjustment \(\alpha=0\) and in that case: \[P'(r,c,0)= P(r,c)\] and \[V'(c,0)=V(c)\] The vote allocation to candidate \(\kappa\) will be zero when \(\alpha=-1\) so \(V'(c,-1)=0\)
and we'll have a full vote allocation to candidate \(\kappa\) when \(\alpha=1\) so \(V'(c,1)=1\)
We can achieve this with adjustments that are linear in \(\alpha\) both in the case when we want to increase the estimate for candidate \(\kappa\) ( \(\alpha\) is positive) and in the case when we want to decrease the allocation for candidate ( \( \alpha \) is negative ).
We'll deal with those two cases separately.
1: When we want to increase the estimate for candidate \( \kappa \).
In this case \[ P'(r,\kappa, \alpha) = P(r, \kappa) + \alpha Q(r,\kappa) \] where \[ Q(r,\kappa) = \left( \sum_c P(r,c) \right) - P(r, \kappa)\] and for \( c \ne \kappa\) \[ P'(r, c, \alpha) = P(r, c) (1 - \alpha ) \] with \[0 \leq \alpha \leq 1 \]

2: When we want to decrease the estimate for candidate \( \kappa \).
In this case we set: \[ P'(r,\kappa, \alpha) = P(r, \kappa) (1 + \alpha) \] with \[-1 \leq \alpha \leq 0 \] and for \( c \ne \kappa \) when \( Q(r, \kappa) > 0 \): \[ P'(r,c, , \alpha) = P(r, c) (1 - \alpha) \frac{P(r, \kappa) }{Q(r, \kappa)} \] and when \( Q(r, \kappa) = 0 \) we have a similar formual for \( P'(r, c, \alpha) \) using \( V(c) \).

We have chosen to adjust the probabilities in such a way that the expected number of votes (N) does not change, i.e. N' = N

We can choose the adjustment \( \alpha \) such that our adjusted vote estimate for candidate \( \kappa \) is equal to the actual (historical) election result.
In that case: \( V'(\kappa,\alpha) = A(\kappa) \) That can be achieved by choosing \[ \alpha = \frac{A(\kappa) - V(\kappa)}{V(\kappa)} \hspace{5mm} when \hspace{2mm} V(\kappa) > A(\kappa) \] and \[ \alpha = \frac{A(\kappa) - V(\kappa)}{A(\kappa)} \hspace{5mm} when \hspace{2mm} V(\kappa) < A(\kappa) \]

This approach could be used as follows, when we have the responses from an opion poll which includes both questions about a historical poll and a future poll:
For each candidate in the historical pole, choose the adjustment \( \alpha \) such that our data will match the historical result. Then apply that same adjustment to the data for the future poll and record the result. Then choose a confidence interval to be wide enough to include all the adjusted results.

Tuesday, November 24, 2020

Dealing with people's uncertainty in opinion poles

When asked about their voting preferences, some people will answer with honesty that they definitely do intend to vote and that they have absolutely made up their mind who to vote for. Other people will answer that they will probably vote and that they will most likely vote for a given candidate, but could possibly be persuaded to change to another between now and election day. For the polster, it may seem tempting to treat those who say they will most likely vote for a given candidate the same as those who say they will definitely vote for that candidate. However by ignoring that stated uncertainty, some useful information is being thrown away.

Suppose a polster were to ask a member of the public "will you vote in the next election?" The repondent could be given a list of possible answers to chose from: definitely yes, probably yes, probably, may be, definitely not. Those answers in natural language can be converted to apropriate probabilities. When we combine all the answers we can find the expected number of respondents who will vote.

We can also ask the respondents the likelyhood that they would vote for each of the candidates. It would be wise to offer answers in English rather than as probabilities, since some people won't be familiar with probabilities. From the answers, we can then obtain estimates for the probabilities that they will vote for each candidate, conditional on turning up and voting on election day. We can combine (multiply) the probability of voting (at all) with the probability of voting for a given candidate (conditional on voting) to get the overall probability that respondent (r) will vote for a candidate (c), we'll denote this probability P(r,c).

The expected portion of votes for candidate c will be: \[ V(c) = \frac{ \sum_{r=1}^{n} P(r,c)} { N } \] where n is the number of respondents
and the expected total number of votes that will be cast by our respondents is calculated: \[N = \sum_{r} \sum_{c} P(r,c) \]
For example, suppose we examine the answers from a respondent and we deem there to be a 20% chance that she will not vote, a 60% chance that she'll vote for candidate 1 and a 20% chance that she'll vote for candidate 2. Rather than treat her as a full supporter of candidate 1, we can use those probabilities as her contribution to each outcome.

Friday, August 7, 2020

Graphics for on target and misses

A football game has come to and end and we have the data on the number of shots taken by each team that were on target and the number that were misses. If we want to present that in a nice graphic, we could display it like a goal, with the area in the goal representing the number of shots on target and the area outside to represent the misses. The graphic is divided into a left and right section, one for each team. But how do we choose the area of each?
Team A Team B
Misses
On Target

Let's start by naming our variables:
\(t_a \) is the number of shots that A got on target.
\(t_b \) is the number of shots that B got on target.
\(m_a \) is the number of misses by A.
\(m_b \) is the number of misses by B.
h is the height of the overall rectangle in which we will draw.
w is the width of the rectangle.
l is the distance from the bottom left corner to the bottom of the left post.
r is the distance from the bottom left corner to the bottom of the right post.
d is the distance from the bottom left corner to the bottom of the divider between the area for team A on the left and the area for team B on the right.
c is the height of the crossbar.

We would like the area representing each team to be proportional to the number of shots that they took and so we'll set: \[ \frac{d}{w-d} = \frac{t_a + m_a}{t_b + m_b} \] and so \[ d = \frac{w(t_a + m_a)}{t_a + m_a + t_b + m_b} \hspace{22 mm} eqn(1)\]
For team A we want the ratio:
in-goal area (for team A) to total area (for team A)
to be equal to the ratio:
shots on target to total shots (for team A).
Writing that mathematically: \[ \frac{c(d-l)}{hd} = \frac{t_a}{t_a + m_a} \] we can now get an expresses for l: \[ l = d - \frac{t_a h d }{c (t_a + m_a)} \hspace{22 mm} eqn(2)\] Similarly for the right hand side areas representing team B's shots: \[ \frac{c(r-d)}{h(w-d)} = \frac{t_b}{t_b + m_b} \] and we can now get an expression for r: \[r = d + \frac{t_b h (w - d) }{c(t_b + m_b)} \hspace{22 mm} eqn(3) \] We now have expressions for l and r that use the crossbar height c, but that has not yet been determined.
It turns out that there is more than one height that the crossbar could be set to. We could for example try to keep fixed the ratio of the height of the crossbar to the width of the goal (r-l). However one challenge with that fixed ratio is that we sometimes find that the goal then protrudes outside the containing rectangle, (mathematically either r>w or l<0)
One approach that seems to work quite consistently is to determine which team is more accurate at shooting and then set the corner of the goal on that team's side to be directly in line between the base of the divider and the top corner of the outer rectangle. In the diagram above, see the green line segment goes through both top right of the goal and the top right of the outer rectangle.
We can write that mathematically as:
\[If \hspace{24 mm} \frac{t_a}{m_a} > \frac{t_b}{m_b} \hspace{80 mm} \] \[then \hspace{20 mm} \frac{c}{d-l} = \frac{h}{d} \hspace{22 mm} \Longrightarrow c = h \sqrt{\frac{t_a}{t_a + m_a}} \hspace{4 mm} eqn(4) \] \[otherwise \hspace{ 8 mm} \frac{c}{r-d} = \frac{h}{w-d} \hspace{16 mm} \Longrightarrow c = h \sqrt{\frac{t_b}{t_b + m_b}} \hspace{4 mm} eqn(5) \]
So, using either equation 4 or 5 and then 1,2 and 3 above we can calculate the dimensions of the goal and the position of the divider. We can code it up and draw our graphic!
If you'd like to see some javascript code that does the job, then you can find the source code in the GoalAttemptsInfoGraphic repository in github or indeed you could look at the source code for this page.
On the other hand if you want to see what a badly drawn graphic looks like which does not appropriately size areas, then have a look at this article in The Guardian .
Anyway, if you'd like to use the ideas suggested in this blog-post, then please do and indeed help yourself to the embedded javascript. I won't come after you looking for money.