Statistical Approaches for Hidden Variables in Ecology. Nathalie Peyrard

Читать онлайн книгу.

Statistical Approaches for Hidden Variables in Ecology - Nathalie Peyrard


Скачать книгу
inclusion of covariates in this model makes it possible to identify environmental variables, which favor particular states.

      1.2.2.4. Example: a three-state HMM with Gaussian emission

      Thus, in model [1.4], ν0 is a probability vector of size 3, Π is a 3 × 3 matrix such that the sum of the elements in each line is equal to 1, and, for 1 ≤ j ≤ 3, distribution (θj) is a distribution ℕ (μj, Σj), where:

       – μj is a vector of dimension 2 (the mean of V p and V r for activity j);

       – Σj is a variance–covariance matrix (of size 2 × 2).

      1.2.2.5. Inference

      Using the model defined by [1.4], inference is used to fulfill two purposes:

       – Estimation of activity: to determine the distribution of real activities given the observations, that is, for 0 ≤ t ≤ n, the distribution of the random variable Zt|Y0:n. For each time t, the estimated smoothing distribution gives the probability of being involved in each of the j activities.

       – Estimation of parameters: the distribution ν0 and the transition matrix Π characterize the dynamic of activities, and the set of parameters {θj} 1 ≤ j ≤ J indicates the way in which the activity influences the distribution of observations.

      In the case of unknown parameters, these two steps are carried out conjointly. Taking a frequentist approach, the EM algorithm may be used, as in section 1.2.1. Once again, it is easy to write an equation, analogous to [1.2], giving the full likelihood.

      For this model, step E once again consists of calculating the quantity given by [1.4]. Again, the difficulty lies in calculating the smoothing distribution. Nevertheless, the discrete character of the hidden dynamic means that explicit calculation is possible. This is carried out iteratively using the forward–backward algorithm. The equations used in this simple and efficient algorithm can be found in Rabiner (1989).

      Step M, in which the parameters are updated, is dependent on the nature of the emission distribution. In the Gaussian case, this step has an explicit expression; conversely, this is not the case when using a distribution such as von Mises.

      Bayesian estimation may be applied by using Monte Carlo Markov Chain (MCMC) algorithms, which are found in programs such as Stan (Carpenter et al. 2017), Winbugs (Lunn et al. 2000) or NIMBLE (de Valpine et al. 2017).

      1.2.2.6. Reconstruction of hidden states

      The reconstruction of hidden activities allows us to identify homogeneous phases in behaviors, and is often of considerable interest from an ecological perspective. This hidden Markov model may thus be seen as an unsupervised segmentation/ classification model for movement.

      One possibility is to reconstruct the most likely hidden activity for each time increment in turn, taking images such that

      Using a Bayesian approach, the sampling algorithms used to estimate parameters permit the use of a joint smoothing distribution, that is, samples of Z0:n|Y0:n can be obtained. Each sample produced is thus a possible sequence of activities corresponding to given observations.

      Sampling across this distribution can also be carried out in conjunction with a frequentist approach, but the combinatorial level is high and the computational effort involved rapidly becomes prohibitive as n increases. Hidden activities are most commonly reconstructed using the most probable sequence of hidden states, that is, which maximizes the overall a posteriori distribution, or, more formally,

image

      This sequence can be calculated in an efficient manner using the Viterbi algorithm, and is the version which is generally returned by libraries offering frequentist estimation. Note that the m most probable sequences can be obtained using the generalized Viterbi algorithm (Guédon 2007).

      1.2.2.7. Choosing the number of activities

      There are two very different approaches to choosing a number of behaviors or activities. The first is based on biological criteria, and a certain number of different behaviors may be identified. In the example of the red-footed booby, described later, a distinction is made between periods of rest, slow flight (corresponding to foraging) and rapid, direct flight, corresponding to trajectories between two points of interest.

      Nevertheless, in the case of a new species or study environment, it can be hard to establish an initial idea of the number of hidden states; in this case, an approach based on statistical, rather than biological, criteria may be preferred. In statistics, this is known as a model choice problem, with “model” corresponding to a number of components.

      One well-known model choice criterion is the Akaike information criteria (AIC) (Akaike 1973), which can be used to ensure that the number of parameters fits the data as well as possible. The aim is not simply to identify a parsimonious model, which fits the data; states need to be as different as possible, meaning that the problem is also one of classification. A new state should only be added if it is sufficiently distinct from other states. In this case, the integrated complete likelihood (ICL) criterion may be used (Biernacki et al. 2000; Bacci et al. 2014).

      Given a set of estimated parameters for the model and a sequence of most probable states images reconstructed using the Viterbi algorithm, for example, this criterion is defined thus:

Скачать книгу