# Background: Nonparametric statistical inference

A common task when analyzing networks is to characterize their
structures in simple terms, often by dividing the nodes into modules or
["communities"](https://en.wikipedia.org/wiki/Community_structure).

A principled approach to perform this task is to formulate [generative
models](https://en.wikipedia.org/wiki/Generative_model) that include
the idea of modules in their descriptions, which then can be detected
by [inferring](https://en.wikipedia.org/wiki/Statistical_inference)
the model parameters from data. More precisely, given the partition
$\boldsymbol b = \{b_i\}$ of the network into $B$ groups,
where $b_i\in[0,B-1]$ is the group membership of node $i$,
we define a model that generates a network $\boldsymbol A$ with a
probability

$$
P(\boldsymbol A|\boldsymbol\theta, \boldsymbol b)
$$ (model-likelihood)

where $\boldsymbol\theta$ are additional model parameters that control how the
node partition affects the structure of the network. Therefore, if we observe a
network $\boldsymbol A$, the likelihood that it was generated by a given
partition $\boldsymbol b$ is obtained via the
[Bayesian](https://en.wikipedia.org/wiki/Bayesian_inference) posterior
probability

$$
P(\boldsymbol b | \boldsymbol A) = \frac{\sum_{\boldsymbol\theta}P(\boldsymbol A|\boldsymbol\theta, \boldsymbol b)P(\boldsymbol\theta, \boldsymbol b)}{P(\boldsymbol A)}
$$ (model-posterior-sum)

where $P(\boldsymbol\theta, \boldsymbol b)$ is the [prior
probability](https://en.wikipedia.org/wiki/Prior_probability) of the
model parameters, and

$$
P(\boldsymbol A) = \sum_{\boldsymbol\theta,\boldsymbol b}P(\boldsymbol A|\boldsymbol\theta, \boldsymbol b)P(\boldsymbol\theta, \boldsymbol b)
$$ (model-evidence)

is called the *evidence*, and corresponds to the total probability of the data
summed over all model parameters. The particular types of model that will be
considered here have "hard constraints", such that there is only one choice for
the remaining parameters $\boldsymbol\theta$ that is compatible with the
generated network, which means Eq. {eq}`model-posterior-sum` simplifies to

$$
P(\boldsymbol b | \boldsymbol A) = \frac{P(\boldsymbol A|\boldsymbol\theta, \boldsymbol b)P(\boldsymbol\theta, \boldsymbol b)}{P(\boldsymbol A)}
$$ (model-posterior)

with $\boldsymbol\theta$ above being the only choice compatible with
$\boldsymbol A$ and $\boldsymbol b$. The inference procedures considered
here will consist in either finding a network partition that maximizes
Eq. {eq}`model-posterior`, or sampling different partitions according
its posterior probability.

As we will show below, this approach also enables the comparison of
*different* models according to statistical evidence (a.k.a. *model
selection*).

## Minimum description length (MDL)

We note that Eq. {eq}`model-posterior` can be written as

$$
P(\boldsymbol b | \boldsymbol A) = \frac{\exp(-\Sigma)}{P(\boldsymbol A)}
$$

where

$$
\Sigma = -\ln P(\boldsymbol A|\boldsymbol\theta, \boldsymbol b) - \ln P(\boldsymbol\theta, \boldsymbol b)
$$ (model-dl)

is called the **description length** of the network $\boldsymbol A$. It measures
the amount of [information](https://en.wikipedia.org/wiki/Information_theory)
required to describe the data, if we
[encode](https://en.wikipedia.org/wiki/Entropy_encoding) it using the particular
parametrization of the generative model given by $\boldsymbol\theta$ and
$\boldsymbol b$, as well as the parameters themselves. Therefore, if we choose
to maximize the posterior distribution of Eq. {eq}`model-posterior` it will be
fully equivalent to the so-called [minimum description
length](https://en.wikipedia.org/wiki/Minimum_description_length) method. This
approach corresponds to an implementation of [Occam's
razor](https://en.wikipedia.org/wiki/Occam%27s_razor), where the *simplest*
model is selected, among all possibilities with the same explanatory power. The
selection is based on the statistical evidence available, and therefore will not
[overfit](https://en.wikipedia.org/wiki/Overfitting), i.e. mistake stochastic
fluctuations for actual structure. In particular this means that we will not
find modules in networks if they could have arisen simply because of stochastic
fluctuations, as they do in fully random graphs
{cite}`inf-guimera_modularity_2004`.

# The stochastic block model (SBM)

The [stochastic block model](https://en.wikipedia.org/wiki/Stochastic_block_model) is arguably
the simplest generative process based on the notion of groups of
nodes {cite}`inf-holland_stochastic_1983`. The [microcanonical](https://en.wikipedia.org/wiki/Microcanonical_ensemble) formulation
{cite}`inf-peixoto_nonparametric_2017` of the basic or "traditional" version takes
as parameters the partition of the nodes into groups
$\boldsymbol b$ and a $B\times B$ matrix of edge counts
$\boldsymbol e$, where $e_{rs}$ is the number of edges
between groups $r$ and $s$. Given these constraints, the
edges are then placed randomly. Hence, nodes that belong to the same
group possess the same probability of being connected with other
nodes of the network.

An example of a possible parametrization is given in the following
figure.

```{testcode} sbm-example
:hide:
mkchdir(DOC_DIR + "/demos/inference/output")

g = gt.load_graph("../blockmodel-example.gt.gz")
gt.graph_draw(g, pos=g.vp.pos, vertex_size=10, vertex_fill_color=g.vp.bo,
             vertex_color="#333333",
             edge_gradient=g.new_ep("vector<double>", val=[0]),
             output="sbm-example.svg")

ers = g.gp.w

from pylab import *
figure()
matshow(log(ers))
xlabel("Group $r$")
ylabel("Group $s$")
gca().xaxis.set_label_position("top")
savefig("sbm-example-ers.svg")
```

```{eval-rst}
.. table::
    :align: center

    +-----------------------------------------+-------------------------------------+
    |.. figure:: output/sbm-example-ers.svg   |.. figure:: output/sbm-example.svg   |
    |   :width: 300px                         |   :width: 300px                     |
    |   :align: center                        |   :align: center                    |
    |                                         |                                     |
    |   Matrix of edge counts                 |   Generated network.                |
    |   :math:`\boldsymbol e` between         |                                     |
    |   groups.                               |                                     |
    +-----------------------------------------+-------------------------------------+
```

:::{note}
With the SBM, no constraints are imposed on what *kind* of modular structure is
allowed, as the matrix of edge counts $e$ is unconstrained. Hence, we can detect
the putatively typical pattern of assortative ["community
structure"](https://en.wikipedia.org/wiki/Community_structure), i.e. when nodes
are connected mostly to other nodes of the same group, if it happens to be the
most likely network description, but we can also detect a large multiplicity of
other patterns, such as
[bipartiteness](https://en.wikipedia.org/wiki/Bipartite_graph), core-periphery,
and many others, all under the same inference framework. If you are interested
in searching exclusively for assortative structures, see Sec.
{ref}`planted_partition`.
:::

Although quite general, the traditional model assumes that the edges are placed
randomly inside each group, and because of this the nodes that belong to the
same group tend to have very similar degrees. As it turns out, this is often a
poor model for many networks, which possess highly heterogeneous degree
distributions. A better model for such networks is called the *degree-corrected*
stochastic block model {cite}`inf-karrer_stochastic_2011`, and it is defined
just like the traditional model, with the addition of the degree sequence
$\boldsymbol k = \{k_i\}$ of the graph as an additional set of parameters
(assuming again a microcanonical formulation
{cite}`inf-peixoto_nonparametric_2017`).

## The nested stochastic block model

The regular SBM has a drawback when applied to large networks. Namely,
it cannot be used to find relatively small groups, as the maximum number
of groups that can be found scales as
$B_{\text{max}}=O(\sqrt{N})$, where $N$ is the number of
nodes in the network, if Bayesian inference is performed
{cite}`inf-peixoto_parsimonious_2013`. In order to circumvent this, we need to
replace the noninformative priors used by a hierarchy of priors and
hyperpriors, which amounts to a `nested SBM`, where the groups
themselves are clustered into groups, and the matrix $e$ of edge
counts are generated from another SBM, and so on recursively
{cite}`inf-peixoto_hierarchical_2014`, as illustrated below.

:::{figure} nested-diagram.*
:align: center
:width: 400px

Example of a nested SBM with three levels.
:::

With this model, the maximum number of groups that can be inferred
scales as $B_{\text{max}}=O(N/\log(N))$. In addition to being able
to find small groups in large networks, this model also provides a
multilevel hierarchical description of the network. With such a
description, we can uncover structural patterns at multiple scales,
representing different levels of coarse-graining.
