Short and Deep: Sketching and Neural Networks
Abstract
Data-independent methods for dimensionality reduction such as random projections,
sketches, and feature hashing have become increasingly popular in recent
years. These methods often seek to reduce dimensionality while preserving the
hypothesis class, resulting in inherent lower bounds on the size of projected data.
For example, preserving linear separability requires Ω(1/γ2
) dimensions, where γ
is the margin, and in the case of polynomial functions, the number of required dimensions
has an exponential dependence on the polynomial degree. Despite these
limitations, we show that the dimensionality can be reduced further while maintaining
performance guarantees, using improper learning with a slightly larger
hypothesis class. In particular, we show that any sparse polynomial function of a
sparse binary vector can be computed from a compact sketch by a single-layer neural
network, where the sketch size has a logarithmic dependence on the polynomial
degree. A practical consequence is that networks trained on sketched data are
compact, and therefore suitable for settings with memory and power constraints.
We empirically show that our approach leads to networks with fewer parameters
than related methods such as feature hashing, at equal or better performance.
sketches, and feature hashing have become increasingly popular in recent
years. These methods often seek to reduce dimensionality while preserving the
hypothesis class, resulting in inherent lower bounds on the size of projected data.
For example, preserving linear separability requires Ω(1/γ2
) dimensions, where γ
is the margin, and in the case of polynomial functions, the number of required dimensions
has an exponential dependence on the polynomial degree. Despite these
limitations, we show that the dimensionality can be reduced further while maintaining
performance guarantees, using improper learning with a slightly larger
hypothesis class. In particular, we show that any sparse polynomial function of a
sparse binary vector can be computed from a compact sketch by a single-layer neural
network, where the sketch size has a logarithmic dependence on the polynomial
degree. A practical consequence is that networks trained on sketched data are
compact, and therefore suitable for settings with memory and power constraints.
We empirically show that our approach leads to networks with fewer parameters
than related methods such as feature hashing, at equal or better performance.