Suppose we have a set of data from which we wish to train a machine learning algorithm and the data keeps growing and growing. We'd like an efficient algorithm that will put more weight on the recent results and we would like it to be straight forward to keep up dated, even if we have hundreds of millions of rows of data. What can we do?
Well, in this post I will suggest an algorithm that is very efficient at dealing with large data sets. It can learn from old data, but it doesn't need the old data to be stored.
Let's start with the basics. Suppose we have a set of T input vectors \(\underline{x}^t \in \mathbb{R}^n\) with
\[1 \le t \le T\]
and for each \(\underline{x}^t\) there is a resultant \(y^t \in \mathbb{R} \).
We want to find the optimal vector \(\underline{\theta}\) so that \( \underline{\theta } . \underline{x}^t \) is a good predictor of \( y^t \).
We choose the \(\underline{\theta}\) which minimises the cost function:
\[
J( \underline{\theta} ) = \frac{1}{2T} \sum^{T}_{t=1} ( \underline{\theta} . \underline{x}^t -y^t)^2
\]
We could expand the inner product \( \underline{\theta} . \underline{x}^t \), which would give us:
\[
J( \underline{\theta} ) = \frac{1}{2T} \sum^{T}_{t=1}
\big[
\sum_{j=1}^n ( \theta_j x^t_j ) -y^t
\big]^2
\]
When we have an optimal \( \underline{\theta} \), the partial derivatives of J will be zero:
\[
0 = \frac{\partial J}{ \partial \theta_i}
= \frac{1}{T} \sum^{T}_{t=1} x_i^t
\big[
\sum_{j=1}^n ( \theta_j x^t_j ) -y^t
\big]
\]
We can change the order of the summation and do some rearranging to obtain:
\[
0 = \sum_{j=1}^n \big[ \frac{1}{T} \sum_{t=1}^T ( x^t_i x^t_j ) \theta_j \big]
- \frac{1}{T} \sum_{t=1}^T ( y^t x_i^t)
\hspace{24 mm} (eqn 1)
\]
Now, if we define the matrix \(\underline{A}\):
\[
A_{ij} = \frac{1}{T} \sum_{t=1}^T ( x^t_i x^t_j ) \hspace{24 mm} (eqn 2)
\]
and we define vector \(\underline{B}\)
\[
B_i = \frac{1}{T} \sum_{t=1}^T ( y^t x_i^t) \hspace{24 mm} (eqn 3)
\]
Plugging those into equation 1, we find:
\[
0 = \sum_{j=1}^n \big[ A_{ij} \theta_j \big] - B_i
\]
If we can invert the matrix \(\underline{A}\) then we can obtain \( \theta \)
\[
\underline{\theta} = \underline{A}^{-1} \underline{B}
\]
If we look back at equation 2, we see that each vector \( \underline{x}^t\) effectively has the same weighting.
We could rewrite equations 2 and 3 as:
\[
A_{ij} = \sum_{t=1}^T ( x^t_i x^t_j \omega_t) \hspace{24 mm} (eqn 4)
\]
and
\[
B_i = \sum_{t=1}^T ( y^t x^t_i \omega_t) \hspace{24 mm} (eqn 5)
\]
where
\[
\omega_t = \frac{1}{T} \hspace{24 mm} \forall t: 1 \le t \le T
\]
However, suppose we wanted more recent values of \( \underline{x}^t\) to have higher weighting.
We could fix the ratio of consecutive weights:
\[
\frac{\omega_{t+1}}{\omega_t} = 1 + \lambda
\]
with \( \lambda > 0\)
So we would have:
\[
\omega_t = \frac{\omega_T} { (1+\lambda)^{T-t}}
\]
Suppose want to find the speed at which the weight drops down to half the weight of the most recently added entry, then we would seek \(\tau\) such that
\[
\frac{\omega_T}{\omega_{T - \tau}} = 2
\]
which implies:
\[
(1 + \lambda)^{\tau} = 2
\]
so:
\[
\lambda = 2^{1/ \tau } -1 \hspace{24 mm} (eqn 6)
\]
So, if we want the most recent data to have double the weighting of a point 1,000 rows back,
then we would set \(\tau = 1000 \) and use equation 6 to determine \( \lambda \).
The parameter \( \lambda \) determines how much bigger the weights of the recent data will have, when compared with older data. When choosing it, it may be helpful to first choose \( \tau\) ( which is something like a half-life ) and then use equation 6 to evaluate \( \lambda \)
If we wish to preserve the condition that the sum of the weights is 1,
then we can do the geometric sum and after a little algebra we find that:
\[
\omega_T = \frac{\lambda}{(1+\lambda)^T-1} \hspace{24 mm} (eqn 7)
\]
For a given set of \(\underline{x}^t \) and \(y^t\) with \(1 \le t \le T \)
we can evaluate \(\underline{A} \) and \( \underline{B}\).
Since they were generated with T rows of data, we could label them
\(\underline{A}^T \) and \( \underline{B}^T\).
Now suppose we have already evaluated \(\underline{A}^{(T-1)} \) and \( \underline{B}^{(T-1)}\)
with T-1 rows of data and we want to introduce one more,
then we find:
\[
A^T_{ij} = \omega_T x^T_i x^T_j + ( 1 - \omega_T ) A^{(T-1)}_{ij} \hspace{24 mm} (eqn 8)
\]
and
\[
B^T_i = \omega_T y^T x^T_i + ( 1 - \omega_T ) B^{(T-1)}_i \hspace{24 mm} (eqn 9)
\]
where \( \omega_T \) has been defined in equation 7.
So when we have evaluated \( \underline{A} \) and \( \underline{B} \) for a given set of data
and then later want to include the contribution from a new \( \underline{x}^T \) and \( y^T \), we can amend the existing \( \underline{A} \) and \( \underline{B} \)
using equations 8 and 9, without the need to retrieve all the past \( \underline{x}^t \) and \( y^t \).
As a result, when data ( \( \underline{x}^T \) and \( y^T \) ) comes in, we can use it to update
\( \underline{A} \) and \( \underline{B} \) and then we can discard the \( \underline{x}^T \) and \( y^T \).
They have made their contribution to \( \underline{A} \) and \( \underline{B} \) and we can continually update \( \underline{A} \) and \( \underline{B} \) without the need to look back at old values of \( \underline{x}^t \) and \( y^t \). When T is very large, say in the hundreds of millions, using this algorithm to continually update the machine learning results is rather efficient. Since we don't need to keep retrieving the old data. We just keep the \( \underline{A} \) and \( \underline{B} \) up to date.