Tuesday, December 28, 2021

False positives and false negatives

When testing for the presence of a disease, the results obtained are often not 100% reliable. In this article we explain false positives and false negatives in test results. The graphics are dynamic, if you change the input numbers, then they'll update.
Let's start with the prevalence of disease, this is the proportion of positive cases in the population.
Disease prevalence input: %


Light red shows the proportion of the population that have the disease, i.e. are positive.
Light blue shows the proportion of the population that don't have the disease, i.e. are negative.


The false negative rate is the proportion of the positive cases that are incorrectly deemed to be negative.
False negative input: %


Dull red shows the true positives: the proportion of the population that are positive and get positive test results.
Orange shows the false negatives: the population that are positive but get negative test results.


The false positive rate is the proportion of the negative cases that are incorrectly deemed to be positive.
False positive input: %


Yellow shows the false positives: the population that are negative but get positive test results.
Dull blue shows the true negatives: the proportion of the population that are negative and get negative test results.


One challenge when considering an individual positive test result is that it is not clear whether it is a true positive or a false positive. We can look at the likelihood of each:

Dull red shows the true positives: the proportion of the population that are positive and get positive test results.
Yellow shows the false positives: the population that are negative but get positive test results.

Using the values that were input above for prevalence, false positives and false negative,
we find that of the positives:
True positives:
False positives:
When the prevalence is low, it is common to have a situation where there are more false positives than true positives. You can try this out for yourself by reducing the prevalence number above.

Wednesday, December 8, 2021

The Mathematics of a Ponzi scheme

A Ponzi Scheme is a form of investment where profits are paid to early investors with funds from later investors. It often involves fraud and as a result it is of course illegal.
This article will explore the mathematics of a Ponzi Scheme which has been transparent about its intentions. Suppose someone were to set up a fund that aims to pay investors a return of \(\rho\) (rho).
I.e. for each unit of currency invested, it will aim to return \(1+\rho\).
The fund pays out on a first in, first out basis. Early investor money is returned when (or if) later money arrives.
Suppose, at a point in time, the cumulative money invested is N.
The money that's been paid out is P and the fund has remaining money R.
We known that
(Investment in) = (Payouts made) + (Remaining money)
\[N = P + R\] We can classify the investment so far into two categories. There is the money that has already been settled S, i.e. the early investors that have already received their funds returned with a positive return \(\rho\) .
And there is the balance outstanding B, that remains unsettled as yet.
So we can write:
(Investment in) = (Settled amount) + (Balance outstanding) \[N = S + B\]
The early investors who have received their settlement all got a return of \(\rho\).
So we know that the payout: \[P=S(1+\rho)\] The fund administrator has guaranteed that the maximum loss that could be imposed on an investor is \(\lambda\) (lambda).
So for each unit of currency invested, the worst case scenario for the investor is that \((1-\lambda )\) is returned.
Hence we require the remaining funds (R) to be able to cover a payment of the balance (B) with an imposed loss. \[R= B( 1 - \lambda )\] We now have 4 equations and 4 unknowns (P,R,S,B) along with three knowns (N, \(\rho\), \(\lambda\) ).
If we do the algebra, we can find our unknows:
Settled amount: \(S = N \frac{\lambda}{\lambda + \rho} \)
Balance outstanding: \(B=N-S\)
Payouts made: \(P=S (1+ \rho) \)
Remaining money: \(R=N-P \)


We can try that out with some numbers:
Inputs
Total invested funds (N) units of currency
Investment return (\(\rho\)) %
Acceptable loss (\(\lambda\)) %

Outputs
Settled amount (S) -
Balance outstanding (B) -
Payouts made (P) -
Remaining money (R) -

With such a scheme, any investor would be allowed to demand an immediate repayment of funds at any time, though in that case a loss of \(\lambda\) would be imposed. In other words, for each unit of currency invested, \(1-\lambda\) would be returned.

One problem for early investors is that they don't know how long they will have to wait until they'll be repaid their investment with the positive return (\(\rho\)). But it is not just a question of when they'll be paid, it is uncertain if they'll be paid. When later investors don't arrive, the early investors just wait and wait, until they eventually give up and request a return of funds, in that case they'll just have to accept a loss.

It is also worth noting that this model above ignores any administrative fees that may be imposed on funds on the way in and or the way out.

Tuesday, October 19, 2021

Plotting dates

Suppose you're plotting some attribute against time and so, on the horizontal axis you have years or dates. You might produce a graph such as the following which is of a currency exchange rate:

However one problem is that a year such as 2021 is an interval of time, indeed it is a full year, but on the graph it is presented as a point in time. When an interval of time is presented as a point in time, it reduces the clarity of the graph.
We could improve matters by displaying years as intervals. For example look at the following graph showing 5 years of data:

Sometimes we show graphs of data and the total interval of time is shorter. The graph below shows 12 months worth of data:

Again we have a familiar problem. A month is an interval in time, but on the graph above it is presented as a point in time.
Similarly we can be a bit more diciplined about showing an interval of time as an interval, rather than a point. For example we could have the following graph:

The aim is to be both precise and to be clear.
We often have a similar issue when we have just a few days of data. In that case we need to remember that a date represents a day which is 24 hours long, it is not a point in time!
There are a few caveats:
If we're showing many years of data, then relatively speaking, one year is short, so it is approximately a point. Similarly if we have hundreds of days of data, then one day is almost just a point.
There are also cases when the data is aggregated over a period. We might show monthly rain-fall in that case it may make more sense to display the month as a point, since it represents one data point and not a segment of the graph.
On the other hand if we are showing data within one day, then a time such as 10am is indeed just a point, it is not an interval. 10am does not last for one hour, it is just a point in time. In that respect it is unlike a date, which represents a full day. A date is not a point in time, it is an interval of 24 hours.
But overall, the moral of the story is: if you're using a time interval (year, month, day) as a label on a graph, then it should be diplayed as a time interval and not a point in time.

Friday, September 17, 2021

Intuitions about exponential growth

Suppose you fold over a sheet of paper, you'll then have something twice as thick as the original. Fold it again and you'll have something 4 sheets thick. With each fold the thickness doubles. This is an exponential growth and unlike linear growth, it accelerates.
Well, what if we start with paper 0.001 meters thick, how high would it be if it were folded 100 times? It turns out the answer is a thickness which is a bit more than the diameter of the universe. When written in meters, the number has 28 digits (before the decimal point).

Initially that may seem rather counter intuitive. However with a little bit of training, it can become intuitive. We actually do have lots of experience in dealing with this kind of dramatic exponential growth. Think about how we represent regular integers. Adding a zero will increase the number by a factor of 10. That's an example of exponential growth. The value of the number increases exponentially with the number of digits. We all know that a number with 4 digits, say 1,000 is much much smaller than a number with 8 digits, say 10,000,000. In this case when we have doubled the number of digits and the value increased by a factor of 10,000.
So, the moral of the story is that our experience with representing numbers using the standard decimal digits, gives us a good understanding of exponential growth.

Coming back to our folding example, we note that 3 foldings, causes 3 doublings, which increases the thickness by a factor of 8, which is pretty close to 10. So as a very rough rule of thumb, we'll almost be adding a new digit to the thickness after every three foldings.
If you want to be a bit more precise, then note that 2 to the power of 10 is 1,024 which is close to 1,000. So ten doublings (2 to the power of ten), causes an increase of just over 1,000, which means we add 3 digits. When we have 100 doublings, that is 10 times 10 doublings: \[2^{100} = (2^{10})^{10} = (1,024)^{10} \approx (10^3)^{10} = 10^{30} \] We started with something \(0.001 m\) thick and ended with something approximately \(10^{27} m\) thick.

Is exponential growth always very fast? Actually no. It depends on how long it takes to double. Suppose you are paid an annually compounding interest rate of 0.1% on your deposits in a bank. That is an form of exponential growth. But it will take 693 years for your money to double. That's the bad news. But the good news is that if you start with one euro it will grow to over a billion after it has doubled 30 times. \[ 2^{30} = (2^{10})^3 = (1,024)^3 \approx (10^3)^3 = 10^9\] But that will take more than 20,000 years.