# Network Inference Problem

Observing a diffusion process often reduces to noting when nodes (people, blogs, etc.) reproduce a piece of information, buy a product, get infected by a virus, or, more generally, adopt a contagion. For example, blogs or media sites often publish a new piece of information without explicitly citing their sources. Marketers may note when a social media user decides to adopt a new behavior but cannot tell which neighbor in the social network influenced them to do so. Epidemiologist observe when a person gets sick but usually cannot tell who infected her. In all these cases, we observe who and when but not how or why contagions propagate through a population of individuals.

Given a set of contagions’ adoption times, called cascades, and a diffusion model, the network inference problem consists of inferring the edges (and model parameters) of the unobserved underlying network [1]. The network inference problem has attracted significant attention in recent years, since it is essential to reconstruct and predict the paths over which information can spread, and to maximize sales of a product or stop infections.

Formulation

We consider the network inference problem under the continuous-time diffusion model recently introduced by Gomez-Rodriguez et al. [2], which has been extensively validated in real diffusion data, and, due to its flexibility, has been extended to support textual information, nonparametric pairwise likelihoods, topic modeling, dynamic networks, and the influence of external events by different authors.

Given a directed contact network, $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ with $N$ nodes, the process begins with an infected source node, $s$, initially adopting certain contagion at time zero, which we draw from a source distribution $P(s)$. The contagion is transmitted from the source along her out-going edges to her direct neighbors. Each transmission through an edge entails a random transmission time, $\tau=t_j-t_j$, drawn from an associated transmission function $f(\tau ; \alpha_{ji})$. We assume transmission times are independent, possibly distributed differently across edges, and, in some cases, can be arbitrarily large, $\tau \rightarrow \infty$. Then, the infected neighbors transmit the contagion to their respective neighbors, and the process continues. We assume that an infected node remains infected for the entire diffusion process. Thus, if a node $i$ is infected by multiple neighbors, only the neighbor that first infects node $i$ will be the true parent. As a result, although the contact network can be an arbitrary directed network, each contagion induces a Directed Acyclic~Graph (DAG).

Observations from the model are recorded as a set of cascades $\{\mathbf{t}^1,\ldots,\mathbf{t}^{n}\}$. Each cascade $\mathbf{t}^c$ is an $N$-dimensional vector $\mathbf{t}^c := (t^c_1, \ldots, t^c_N)$ recording when nodes are infected, $t^c_k\in [0,T^c]\cup\{\infty\}$. Symbol $\infty$ labels nodes that are not infected during observation window $[0,T^c]$ – it does not imply they are never infected. The `clock’ is reset to 0 at the start of each cascade. We assume $T^c = T$ for all cascades; the results generalize trivially.

The Figure below illustrates an example of underlying unobserved network and two cascades $(t_a, t_b, t_c, \infty, \infty, \infty)$ and $(\infty, t_b, \infty, t_d, t_e, t_f)$. Nodes in dark red are the sources of each cascade and nodes in light red are the remaining infected nodes by network diffusion.

Gomez-Rodriguez et al. [2] showed that the likelihood of a cascade $\mathbf{t}$ under the continuous-time independent cascade model is

$f(\mathbf{t} ; \mathbf{A}) = \prod_{t_i \leq T} \prod_{t_m > T} S(T | t_i ; \alpha_{im}) \prod_{k : t_k < t_i}$ $S(t_i | t_k ; \alpha_{ki})$ $\sum_{j : t_j < t_i} H(t_i | t_j ; \alpha_{ji})$,

where $\mathbf{A}=\{\alpha_{ji}\}$ denotes the collection of parameters, $S(t_i | t_j ; \alpha_{ji}) = 1-\int_{0}^{t_i-t_j} f(\tau ; \alpha_{ji}\, d \tau$ is the survival function and $H(t_i | t_j ; \alpha_{ji}) = f(t_i-t_j ; \alpha_{ji}) / S(t_i | t_j ; \alpha_{ji})$ is the hazard function. The survival terms in the first line account for the ; probability that uninfected nodes survive to all infected nodes in the cascade up to $T$ and the survival and hazard terms in the second line account for the likelihood of the infected nodes. Then, assuming cascades are sampled independently, the likelihood of a set of cascades is the product of the likelihoods of individual cascades.

Let $C^n$ be a set of $n$ cascades sampled from the described model, where the source $s \in \mathcal{V}$ of each cascade is drawn from a source distribution $P(s)$. Then, the network inference problem consists of finding the directed edges and the associated parameters using
only the temporal information from the set of cascades $C^n$. This problem has been cast as a maximum likelihood estimation problem:

minimize $- \sum_{c \in C^n} \log f(\mathbf{t}^c ; \mathbf{A})$
subject to $\alpha_{ji} \geq 0,\, i, j=1,\ldots,N, i \neq j,$

where the inferred edges in the network correspond to those pairs of nodes with non-zero parameters, i.e., $\hat{\alpha}_{ji} > 0$.

Gomez-Rodriguez et al. [2] showed that the network inference problem defined above is convex in $\mathbf{A}$ if the survival functions are log-concave and the hazard functions are concave in the parameters $\mathbf{A}$. Thus, the optimal transmission rates can be found efficiently. Importantly, they also show that the model works well in practice for modeling memes spreading over blog and mainstream news media sites data.

If you would like to learn more about the network inference problem, try some code, and play with some real world datasets, check out http://people.tuebingen.mpg.de/manuelgr/netrate and http://snap.stanford.edu/infopath/.

References

[1] M. Gomez-Rodriguez, J. Leskovec, A. Krause. Inferring Networks of Diffusion and Influence. KDD 2010.

[2] M. Gomez-Rodriguez, D. Balduzzi, B. Schoelkopf. Uncovering the Temporal Dynamics of Diffusion Networks. ICML 2011.