Generalization and Learnability in Multiple Instance Regression
Abstract
Multiple instance regression (MIR) introduced by (Ray & Page, 2001) as an analogue of multiple
instance learning (MIL) in which we are given bags of feature-vectors (instances) and for each bag there is a bag-label which is matches the label of one (unknown) primary instance from that bag. The goal is to compute a hypothesis regressor consistent with the underlying instance-labels. A natural approach is to find the best primary instance assignment and regressor optimizing (say) the mse loss on the bags though no formal generalization guarantees were known. Our work is the first to prove generalization error bounds for MIR when the bags are drawn iid at random. Essentially, w.h.p. any MIR regressor with low error on sampled bags also has low error on the underlying instance-label distribution. We next study the complexity of linear regression on MIR bags, shown to be NP-hard in general by (Ray & Page, 2001) who however left open the possibility of arbitrarily good approximations. Significantly strengthening previous work, we prove a strong inapproximability bound: even if there exists zero-loss MIR linear regressor on a collection of 2-sized bags with labels in [−1, 1], it is NP-hard to find an MIR linear regressor with loss < C for some absolute constant C > 0.
Our work also proposes two novel model training methods on MIR bags based on (i) weighted assignment loss and, (ii) EM pseudo-labeling, handling the case of overlapping bags which has not previously been studied. We conduct extensive empirical evaluations on synthetic and real-world datasets showing that our method outperforms the baseline MIR methods.
instance learning (MIL) in which we are given bags of feature-vectors (instances) and for each bag there is a bag-label which is matches the label of one (unknown) primary instance from that bag. The goal is to compute a hypothesis regressor consistent with the underlying instance-labels. A natural approach is to find the best primary instance assignment and regressor optimizing (say) the mse loss on the bags though no formal generalization guarantees were known. Our work is the first to prove generalization error bounds for MIR when the bags are drawn iid at random. Essentially, w.h.p. any MIR regressor with low error on sampled bags also has low error on the underlying instance-label distribution. We next study the complexity of linear regression on MIR bags, shown to be NP-hard in general by (Ray & Page, 2001) who however left open the possibility of arbitrarily good approximations. Significantly strengthening previous work, we prove a strong inapproximability bound: even if there exists zero-loss MIR linear regressor on a collection of 2-sized bags with labels in [−1, 1], it is NP-hard to find an MIR linear regressor with loss < C for some absolute constant C > 0.
Our work also proposes two novel model training methods on MIR bags based on (i) weighted assignment loss and, (ii) EM pseudo-labeling, handling the case of overlapping bags which has not previously been studied. We conduct extensive empirical evaluations on synthetic and real-world datasets showing that our method outperforms the baseline MIR methods.