Cover's theorem is a statement in computational learning theory and is one of the primary theoretical motivations for the use of non-linear kernel methods in machine learning applications. It is so termed after the information theorist Thomas M. Cover who stated it in 1965, referring to it as counting function theorem.

The theorem expresses the number of homogeneously linearly separable sets of

N

{\displaystyle N}

points in

D

{\displaystyle D}

dimensions as an explicit counting function

C

(

N

,

D

)

{\displaystyle C(N,D)}

of the number of points

N

{\displaystyle N}

and the dimensionality

D

{\displaystyle D}

.

It requires, as a necessary and sufficient condition, that the points are in general position. Simply put, this means that the points should be as linearly independent (non-aligned) as possible. This condition is satisfied "with probability 1" or almost surely for random point sets, while it may easily be violated for real data, since these are often structured along smaller-dimensionality manifolds within the data space.

The function

C

(

N

,

D

)

{\displaystyle C(N,D)}

follows two different regimes depending on the relationship between

N

{\displaystyle N}

and

D

{\displaystyle D}

.

A consequence of the theorem is that given a set of training data that is not linearly separable, one can with high probability transform it into a training set that is linearly separable by projecting it into a higher-dimensional space via some non-linear transformation, or:

A complex pattern-classification problem, cast in a high-dimensional space nonlinearly, is more likely to be linearly separable than in a low-dimensional space, provided that the space is not densely populated.

The proof of Cover's counting function theorem can be obtained from the recursive relation

C

(

N

+

1

,

D

)

=

C

(

N

,

D

)

+

C

(

N

,

D

−

1

)

.

{\displaystyle C(N+1,D)=C(N,D)+C(N,D-1).}

To show that, with fixed

N

{\displaystyle N}

, increasing

D

{\displaystyle D}

may turn a set of points from non-separable to separable, a deterministic mapping may be used: suppose there are

N

{\displaystyle N}

points. Lift them onto the vertices of the simplex in the

N

−

1

{\displaystyle N-1}

dimensional real space. Since every partition of the samples into two sets is separable by a linear separator, the property follows.

This statistics-related article is a stub. You can help Wikipedia by expanding it.