Community Detection in the Heterogeneous Stochastic Block Model
Abstract
We consider the Heterogeneous Stochastic Block
Model (HeSBM) consisting of two unknown
planted communities C1 and C2 . Each node u
has two (unknown) parameters pu and qu with
pu > qu , describing the probability of having a
directed edges with nodes from the same com-
munity and with nodes from the other commu-
nity, respectively. This model allows for almost
arbitrary degree distributions.
We present a simple, linear time algorithm
that recovers the communities whenever
(pu −qu)/sqrt(pu) = Ω(log n/n). We also show
that this condition is necessary. Note that this
recovery threshold is the same as from the
standard Stochastic Block Model, in which all
nodes have the same pu and qu .
Our algorithm is based on local greedy improvement,
where nodes are moved into the part where
it has more edges. We show how to implement
those moves in parallel, such that each vertex
needs to be considered at most twice. We believe
that this algorithm can be easily implemented in
parallel, making it highly practical; and the theoretical
guarantee we demonstrate show a desirable robustness
to perturbations in the degree distribution
Model (HeSBM) consisting of two unknown
planted communities C1 and C2 . Each node u
has two (unknown) parameters pu and qu with
pu > qu , describing the probability of having a
directed edges with nodes from the same com-
munity and with nodes from the other commu-
nity, respectively. This model allows for almost
arbitrary degree distributions.
We present a simple, linear time algorithm
that recovers the communities whenever
(pu −qu)/sqrt(pu) = Ω(log n/n). We also show
that this condition is necessary. Note that this
recovery threshold is the same as from the
standard Stochastic Block Model, in which all
nodes have the same pu and qu .
Our algorithm is based on local greedy improvement,
where nodes are moved into the part where
it has more edges. We show how to implement
those moves in parallel, such that each vertex
needs to be considered at most twice. We believe
that this algorithm can be easily implemented in
parallel, making it highly practical; and the theoretical
guarantee we demonstrate show a desirable robustness
to perturbations in the degree distribution