Inductive Regression

Purpose of Web Page
This page deals with with the probem of learning from values that have a continuous range of values. In other words this page describes Regression Analysis but from an inductive viewpoint.

Summary
Inductive Inference gives a framework for predicting future events based on past history. However there are particular problems when dealing with Real Numbers. A real number has a zero probability of any particular value, and it requires an infinite amount of information to represent most real numbers (the Irrational Numbers).

You would never see a real number in the data history. The Real Numbers are inferred from approximations based on Rational Numbers.

Bayes theorem may be applied to continuous probability variables (Bayes' Theorem for Probability Distributions) to infer probability distributions for events based on past history. Unfortunately this theorem needs prior distributions.

As for Inductive Infereence Information Theory gives us a basis for chosing prior distributions.

The inclusion of the probabilities of the models is necessary to avoiding over-fitting.

Inductive Regression
Inductive Regression attempts to estimate the probability of continuous variables (Real Numbers) based on past history.

Bayes theorem for Continuous Variables
From Bayes' theorem alternate form,


 * $$P(M_i|D) = \frac{P(D | M_i)\, P(M_i)}{P(D)} = \frac{P(D | M_i)\, P(M_i)}{\sum_j P(D|M_j)\,P(M_j)}  \!$$.

where j indexed over all models.

Now look at the case where the models are parameterised by constants. We may consider parameters that can be represented with a finite amount of data to be part of the code. However this approach breaks down when dealing with real numbers.

A real number is a very strange thing. Unless there is special way of representing the real number (e, Π, rational number, surd) a single real number has an infinite amount of information. Almost all real numbers have an infinite amount of information.

In order to deal with real numbers probability theory needs to be extended to use probability measures. A measure is a simple idea with a subtle twist. Length, Area, and Volume are all measures. The subtle thing is that a measure is a property of a set of points, that does not directly depend on the number of points in the set.

You will never see a real number in a data set. You will see only approximations. Normally an instrument will record a number to a certain number of digits. A real number is then an idea about the underlying universe that is inferred rather than seen directly.

Reslicing the Pie
To deal with real numbers we will re-index the models so that the parameters to a model are used to identify the model,
 * $$\{M_k : k\in K\} = \{m_i(v_i) : i\in W \and v_i\in V_i\} \!$$

where,
 * $$K$$ is the set of all models. Note that we have implicitly made this set uncountable.
 * $$W$$ is the set of parameterised models (with the real numbers removed and turned into parameters).
 * $$V_i$$ is a vector space of real numbers which parameterise $$m_i$$.

Taking limits
Now we might like to plug our new list of models into the equation above but this is not possible. The probability $$P(D | m_i(v_i))$$ would be zero. The standard approach to take in this case is to divide the vector space up into areas of finite size, and consider the probability of the area. Then take the limit as the size d → 0.

A limit allows us to consider a number that is as small as we need it to be, but not zero. We can think of a limit as an approximation that we can make as accurate as we need.

Partition of Vector Space
In a small region the probability is the probability density function times the size of the region. To proceed we need to divide each vector space $$V_i$$ up into equal sized pieces (lengths, squares, cubes etcetera, according to the dimension of the vector space). The dimension of the vectors space is the number of real parameters to the model $$m_i$$. Dividing the vector space up in this way is a Partition of the vector space.

For simplicity here we will assume the existence of the probability density function and that it has an integral. A more rigorous proof would examine these issues.

Each of the equal sized pieces of $$V_i$$ is named $$R_i(n_i)$$. The pieces are numbered by the vector of integers $$n_i$$ where $$n_i\in N_i$$ and $$N_i$$ is a vector space of integers.

We can define the partition by a condition on each dimension of the vector. For each dimension $$k_i$$ in $$Dimensions(V_i)$$,
 * $$R_{i,k_i}(n_{i,k_i}) = \{x : n_{i,k_i}*dr <= x \and x < (n_{i,k_i}+1)*dr\}\!$$

the size of $$R_i(n_i)$$ is,
 * $$d = dr^{Dimension(V_i(n_i))}\!$$

Probability Density Function
Firstly let,
 * $$v_i = n_i*d\!$$

This multication of a vector by a scalar and means the sames as for $$k_i \in Dimensions(V_i)$$,
 * $$v_{i,k_i} = n_{i,k_i}*d\!$$

The probability density function $$p$$ is then,


 * $$P(\{D | m_i(w_i):w_i\in R(n_i)\}) = p(D | m_i(v_i))\ *\ d$$

The probability function $$p$$ is the amount of probability per unit of the vector space.

Putting it all Together

 * $$\lim_{d\to 0}(\ p(m_i|D, v_i)*d = \frac{p(D | m_i(v_i))*d\ p(m_i(v_i))*d}{\sum_j \sum_{w_j\in V_j} p(D|m_j(w_j))*d\ p(m_j(w_j))*d}\ ) \!$$

or,
 * $$p(m_i|D, v_i) = \frac{p(D | m_i(v_i))\ p(m_i(v_i))}{\sum_j \lim_{d\to 0}(\sum_{w_j\in V_j} p(D|m_j(w_j))\ p(m_j(w_j))*d)} \!$$

but,
 * $$\lim_{d\to 0}(\sum_{w_j\in V_j} p(D|m_j(w_j))\ p(m_j(w_j))\ d) = \int_{w_j\in V_j} p(D|m_j(w_j))\ p(m_j(w_j))\ dw_j\!$$

Bayes' Law for Real Parameters
Putting the results together we have,


 * $$p(m_i|D, v_i) = \frac{p(D | m_i(v_i))\ p(m_i(v_i))}{\sum_j \int_{w_j\in V_j} p(D|m_j(w_j))\ p(m_j(w_j)) dw_j} \!$$

We use $$m_i(v_i)$$ in two ways,
 * As an event which identifies a set of outcomes.
 * As a function that gives the probability of $$D$$ given the model $$m_i(v_i)$$.

The second use gives,
 * $$p(D | m_i(v_i)) = m_i(v_i, D)\!$$


 * $$p(m_i|D, v_i) = \frac{m_i(v_i, D)\ p(m_i(v_i))}{\sum_j \int_{w_j\in V_j} m_i(w_i, D)\ p(m_j(w_j)) dw_j} \!$$

A-Priori distributions for the model
What then is $$p(m_i(v_i))$$. We can break this down as the probability of the model, times the probability of the value $$v_i$$. The probability is related to the message length. To communicate the model, we need describe the model and then its parameters.
 * $$l(m_i(v_i)) = l(m_i) + l(v_i)\!$$

so,
 * $$p(m_i(v_i)) = P(m_i)\ p(v_i)\!$$

We might want to use a uniform distribution for p(v_i). However this does not work. A uniform distribution is zero everywhere, and because of the sum over different models (with different numbers of parameters), the limits to zero cant be made to cancel out.

This problem needs further investigation.

Predictive Models
Some models of the data have a higher probability than the data,


 * $$\int_{v_i\in V_i} m_i(v_i, D)\ p(m_i(v_i)) dv_i > P(D)$$

These are the models that provide information about the behaviour of D. They are predictive models.

Other models are unpredictive models. Rather than deal with each unpredictive model separately we want to group them together and handle them all as the set U. Two sets of indexes J and K are created for the good and bad models,


 * $$J = \{j: \int_{v_j\in V_j} m_j(v_j, D)\ p(m_j(v_j)) dv_j > P(D) \}\!$$

is the set of indexes of models that are predictive.


 * $$K = \{k: \int_{v_k\in V_k} m_k(v_k, D)\ p(m_k(v_k)) dv_k <= P(D) \}\!$$

is the set of indexes of models that are not predictive.

U as the union of the model events for which no data compression happens on the data history D.


 * $$U = {\bigcup_{k\in K} \bigcup_{v_k\in V_k} m_k(v_k)}\!$$

Then we need to know,
 * $$P(D, U) = 2^{-l(D)}\!$$
 * $$P(U) = \text{constant}\!$$

Bayes law with all the models that do not compress the data merged together becomes,


 * $$p(M_i(v_i)|D) = \frac{m_i(v_i, D) * p(m_i(v_i))}{s}\!$$
 * $$P(U|D) = \frac{P(D|U) * P(U)}{s}\!$$

where
 * $$s = P(D|U) * P(U) + \sum_j \int_{w_j\in V_j} m_i(v_i, D)\ p(m_i(v_i)) dw_j \!$$

Real Numbers for Data
As stated before we will never see real numbers in the data set. However we will see approximations to real data.