Noise robustness remains a challenging problem in on-device keyword spotting, which can be improved by using multiple microphones. While this increases accuracy, it inevitably pushes up computational complexity and tends to require for more memory space. In this paper, we propose a new neural-network based architecture which takes multiple microphone signals as inputs. It can achieve better accuracy and incurs just a minimum increase in model size. Compared with a single-channel baseline which runs in parallel on each channel, the proposed architecture reduces the false reject (FR) rate relatively by 36.3\% and 46.4\% on dual-microphone clean and noisy test sets, respectively, at a rate of 0.1 false accepts (FA) per hour.