A bit of maths: Removing Systematic Bias in Opinion Polls

For polsters, a challenge that can be difficult to overcome is a systematic bias. There are voters who don't like to admit that they will vote for a particular candidate. Also supporters of a given candidate may be difficult for the polsters to find, but they will come out and vote. If we look at the US presidentail poles in Florida in the weeks before the 2020 US Presidential election, we can see that the vast majority had Biden ahead. According to the pole aggregation by Five Thirty Eight, Biden was leading by 2.5% in the poles. However, in the end, Trump won convincingly. The poles were wrong by about 6%, which is a huge error. Many of the poles were better at predicting the other poles than they were at predicting the actual election result.

One approach to deal with the systematic error would be to ask the respondents how they voted in a previous election. In the run-up to the Presidential 2020 election, the most relevant past election would be the 2016 Presidentail election. We know the result of that past election election, if the polster were to use the results from his respondents to estimate that past election, he could then compare that with the actual result and he could see if he has big error. If a methodology can't predict the past, then it will struggle with predicting the future.

We can make an adjustment to the weighting of each respondent so that we (more or less) recover the previous election result. We can then apply that same adjusted weighting to the respondents' answers about who they will vote for in the next election. This adjustment will work well if for example there are many people who are reluctant to admit that vote for Trump, but when they have anonimity in the real vote, they do so.

A polster may have his own methodology to obtain his results from his data, but the approach outlined here may be useful when working out the confidence interval. Rather than using one model, with its assumptions, to work out the confidence interval, it may be wise to consider more than one set of modeling assumptions and then report a confidence interval that is wide enough to include the results from all the different models.

Now, to delve into some maths. I'm going to use the same notation as I used in my earlier post. After we've carried out a pole. We convert the answers to probabilities. We'll use the notation that P(r,c) is the probability that respondent r voted for candidate c in the given election.

So \( \sum_c P(r,c) \) is the probability that the respondent (r) will vote at all.

When we ask voters about how they voted in the past, most will give a definitive response, either that they voted for a particular candidate or that that didn't vote at all. However in some cases repondents will say (in many cases with honesty) that they can't remember how they voted or indeed if they voted. In the model we'll work with here, we'll use the same probabilities as before (P(r,c)) even though in many cases there will be certainty, in that case we'll have 0's and 1's. But we can stick with the general case, when there may be come uncertainty. Using our data, we'll have a raw election result prediction. The proportion of votes for candidate c will be: \[ V(c) = \frac{ \sum_{r} P(r,c)}{ N } \] and the expected total number of votes that will be cast by our respondents is calculated: \[N = \sum_{r} \sum_{c} P(r,c) \] For a past election, when we have asked people about how they voted, we already know the results. We'll label the actual portion of votes allocated to candidate c: \(A(c) \)
For a given candidate say \( \kappa \) we may find a significant discrepancy between the actual result \(A(c) \) and the raw result, from our data \(V(c) \).
We can introduce an adjustment and use the parameter \(\alpha\) to obtain the adjusted probabilities \(P'(r,c,\alpha)\) and adjusted vote allocation estimates \(V'(c,\alpha)\).
When we have no adjustment \(\alpha=0\) and in that case: \[P'(r,c,0)= P(r,c)\] and \[V'(c,0)=V(c)\] The vote allocation to candidate \(\kappa\) will be zero when \(\alpha=-1\) so \(V'(c,-1)=0\)
and we'll have a full vote allocation to candidate \(\kappa\) when \(\alpha=1\) so \(V'(c,1)=1\)
We can achieve this with adjustments that are linear in \(\alpha\) both in the case when we want to increase the estimate for candidate \(\kappa\) ( \(\alpha\) is positive) and in the case when we want to decrease the allocation for candidate ( \( \alpha \) is negative ).
We'll deal with those two cases separately.
1: When we want to increase the estimate for candidate \( \kappa \).
In this case \[ P'(r,\kappa, \alpha) = P(r, \kappa) + \alpha Q(r,\kappa) \] where \[ Q(r,\kappa) = \left( \sum_c P(r,c) \right) - P(r, \kappa)\] and for \( c \ne \kappa\) \[ P'(r, c, \alpha) = P(r, c) (1 - \alpha ) \] with \[0 \leq \alpha \leq 1 \]

2: When we want to decrease the estimate for candidate \( \kappa \).
In this case we set: \[ P'(r,\kappa, \alpha) = P(r, \kappa) (1 + \alpha) \] with \[-1 \leq \alpha \leq 0 \] and for \( c \ne \kappa \) when \( Q(r, \kappa) > 0 \): \[ P'(r,c, , \alpha) = P(r, c) (1 - \alpha) \frac{P(r, \kappa) }{Q(r, \kappa)} \] and when \( Q(r, \kappa) = 0 \) we have a similar formual for \( P'(r, c, \alpha) \) using \( V(c) \).

We have chosen to adjust the probabilities in such a way that the expected number of votes (N) does not change, i.e. N' = N

We can choose the adjustment \( \alpha \) such that our adjusted vote estimate for candidate \( \kappa \) is equal to the actual (historical) election result.
In that case: \( V'(\kappa,\alpha) = A(\kappa) \) That can be achieved by choosing \[ \alpha = \frac{A(\kappa) - V(\kappa)}{V(\kappa)} \hspace{5mm} when \hspace{2mm} V(\kappa) > A(\kappa) \] and \[ \alpha = \frac{A(\kappa) - V(\kappa)}{A(\kappa)} \hspace{5mm} when \hspace{2mm} V(\kappa) < A(\kappa) \]

This approach could be used as follows, when we have the responses from an opion poll which includes both questions about a historical poll and a future poll:
For each candidate in the historical pole, choose the adjustment \( \alpha \) such that our data will match the historical result. Then apply that same adjustment to the data for the future poll and record the result. Then choose a confidence interval to be wide enough to include all the adjusted results.

A bit of maths

Thursday, November 26, 2020

Removing Systematic Bias in Opinion Polls

No comments:

About Me