Thursday, November 26, 2020

Removing Systematic Bias in Opinion Polls

For polsters, a challenge that can be difficult to overcome is a systematic bias. There are voters who don't like to admit that they will vote for a particular candidate. Also supporters of a given candidate may be difficult for the polsters to find, but they will come out and vote. If we look at the US presidentail poles in Florida in the weeks before the 2020 US Presidential election, we can see that the vast majority had Biden ahead. According to the pole aggregation by Five Thirty Eight, Biden was leading by 2.5% in the poles. However, in the end, Trump won convincingly. The poles were wrong by about 6%, which is a huge error. Many of the poles were better at predicting the other poles than they were at predicting the actual election result.

One approach to deal with the systematic error would be to ask the respondents how they voted in a previous election. In the run-up to the Presidential 2020 election, the most relevant past election would be the 2016 Presidentail election. We know the result of that past election election, if the polster were to use the results from his respondents to estimate that past election, he could then compare that with the actual result and he could see if he has big error. If a methodology can't predict the past, then it will struggle with predicting the future.

We can make an adjustment to the weighting of each respondent so that we (more or less) recover the previous election result. We can then apply that same adjusted weighting to the respondents' answers about who they will vote for in the next election. This adjustment will work well if for example there are many people who are reluctant to admit that vote for Trump, but when they have anonimity in the real vote, they do so.

A polster may have his own methodology to obtain his results from his data, but the approach outlined here may be useful when working out the confidence interval. Rather than using one model, with its assumptions, to work out the confidence interval, it may be wise to consider more than one set of modeling assumptions and then report a confidence interval that is wide enough to include the results from all the different models.

Now, to delve into some maths. I'm going to use the same notation as I used in my earlier post. After we've carried out a pole. We convert the answers to probabilities. We'll use the notation that P(r,c) is the probability that respondent r voted for candidate c in the given election.

So \( \sum_c P(r,c) \) is the probability that the respondent (r) will vote at all.

When we ask voters about how they voted in the past, most will give a definitive response, either that they voted for a particular candidate or that that didn't vote at all. However in some cases repondents will say (in many cases with honesty) that they can't remember how they voted or indeed if they voted. In the model we'll work with here, we'll use the same probabilities as before (P(r,c)) even though in many cases there will be certainty, in that case we'll have 0's and 1's. But we can stick with the general case, when there may be come uncertainty. Using our data, we'll have a raw election result prediction. The proportion of votes for candidate c will be: \[ V(c) = \frac{ \sum_{r} P(r,c)}{ N } \] and the expected total number of votes that will be cast by our respondents is calculated: \[N = \sum_{r} \sum_{c} P(r,c) \] For a past election, when we have asked people about how they voted, we already know the results. We'll label the actual portion of votes allocated to candidate c: \(A(c) \)
For a given candidate say \( \kappa \) we may find a significant discrepancy between the actual result \(A(c) \) and the raw result, from our data \(V(c) \).
We can introduce an adjustment and use the parameter \(\alpha\) to obtain the adjusted probabilities \(P'(r,c,\alpha)\) and adjusted vote allocation estimates \(V'(c,\alpha)\).
When we have no adjustment \(\alpha=0\) and in that case: \[P'(r,c,0)= P(r,c)\] and \[V'(c,0)=V(c)\] The vote allocation to candidate \(\kappa\) will be zero when \(\alpha=-1\) so \(V'(c,-1)=0\)
and we'll have a full vote allocation to candidate \(\kappa\) when \(\alpha=1\) so \(V'(c,1)=1\)
We can achieve this with adjustments that are linear in \(\alpha\) both in the case when we want to increase the estimate for candidate \(\kappa\) ( \(\alpha\) is positive) and in the case when we want to decrease the allocation for candidate ( \( \alpha \) is negative ).
We'll deal with those two cases separately.
1: When we want to increase the estimate for candidate \( \kappa \).
In this case \[ P'(r,\kappa, \alpha) = P(r, \kappa) + \alpha Q(r,\kappa) \] where \[ Q(r,\kappa) = \left( \sum_c P(r,c) \right) - P(r, \kappa)\] and for \( c \ne \kappa\) \[ P'(r, c, \alpha) = P(r, c) (1 - \alpha ) \] with \[0 \leq \alpha \leq 1 \]

2: When we want to decrease the estimate for candidate \( \kappa \).
In this case we set: \[ P'(r,\kappa, \alpha) = P(r, \kappa) (1 + \alpha) \] with \[-1 \leq \alpha \leq 0 \] and for \( c \ne \kappa \) when \( Q(r, \kappa) > 0 \): \[ P'(r,c, , \alpha) = P(r, c) (1 - \alpha) \frac{P(r, \kappa) }{Q(r, \kappa)} \] and when \( Q(r, \kappa) = 0 \) we have a similar formual for \( P'(r, c, \alpha) \) using \( V(c) \).

We have chosen to adjust the probabilities in such a way that the expected number of votes (N) does not change, i.e. N' = N

We can choose the adjustment \( \alpha \) such that our adjusted vote estimate for candidate \( \kappa \) is equal to the actual (historical) election result.
In that case: \( V'(\kappa,\alpha) = A(\kappa) \) That can be achieved by choosing \[ \alpha = \frac{A(\kappa) - V(\kappa)}{V(\kappa)} \hspace{5mm} when \hspace{2mm} V(\kappa) > A(\kappa) \] and \[ \alpha = \frac{A(\kappa) - V(\kappa)}{A(\kappa)} \hspace{5mm} when \hspace{2mm} V(\kappa) < A(\kappa) \]

This approach could be used as follows, when we have the responses from an opion poll which includes both questions about a historical poll and a future poll:
For each candidate in the historical pole, choose the adjustment \( \alpha \) such that our data will match the historical result. Then apply that same adjustment to the data for the future poll and record the result. Then choose a confidence interval to be wide enough to include all the adjusted results.

Tuesday, November 24, 2020

Dealing with people's uncertainty in opinion poles

When asked about their voting preferences, some people will answer with honesty that they definitely do intend to vote and that they have absolutely made up their mind who to vote for. Other people will answer that they will probably vote and that they will most likely vote for a given candidate, but could possibly be persuaded to change to another between now and election day. For the polster, it may seem tempting to treat those who say they will most likely vote for a given candidate the same as those who say they will definitely vote for that candidate. However by ignoring that stated uncertainty, some useful information is being thrown away.

Suppose a polster were to ask a member of the public "will you vote in the next election?" The repondent could be given a list of possible answers to chose from: definitely yes, probably yes, probably, may be, definitely not. Those answers in natural language can be converted to apropriate probabilities. When we combine all the answers we can find the expected number of respondents who will vote.

We can also ask the respondents the likelyhood that they would vote for each of the candidates. It would be wise to offer answers in English rather than as probabilities, since some people won't be familiar with probabilities. From the answers, we can then obtain estimates for the probabilities that they will vote for each candidate, conditional on turning up and voting on election day. We can combine (multiply) the probability of voting (at all) with the probability of voting for a given candidate (conditional on voting) to get the overall probability that respondent (r) will vote for a candidate (c), we'll denote this probability P(r,c).

The expected portion of votes for candidate c will be: \[ V(c) = \frac{ \sum_{r=1}^{n} P(r,c)} { N } \] where n is the number of respondents
and the expected total number of votes that will be cast by our respondents is calculated: \[N = \sum_{r} \sum_{c} P(r,c) \]
For example, suppose we examine the answers from a respondent and we deem there to be a 20% chance that she will not vote, a 60% chance that she'll vote for candidate 1 and a 20% chance that she'll vote for candidate 2. Rather than treat her as a full supporter of candidate 1, we can use those probabilities as her contribution to each outcome.