## Original link: tecdat.cn/?p=2687

## Original source: Tuoduan Data Tribe Official Account

# What is MCMC and when should I use it?

MCMC is just an algorithm that samples from the distribution.

This is just one of many algorithms. This term stands for "Markov Chain Monte Carlo" because it is a "Monte Carlo" (ie random) method using "Markov Chain" (which we will discuss later). MCMC is just one type of Monte Carlo method, although many other commonly used methods can be regarded as simple special cases of MCMC.

# Why should I sample from the distribution?

Taking samples from the distribution is the easiest way to solve some problems.

Perhaps the most commonly used method of MCMC is to draw samples from the posterior probability distribution of a model in Bayesian inference. Through these samples, you can ask some questions: "What is the average value and reliability of the parameters?".

If these samples are independent samples from the distribution, the estimated mean will converge on the true mean.

Suppose our target distribution is a normal distribution s with mean m and standard deviation.

As an example, consider using the mean m and standard deviation s to estimate the mean of the normal distribution (here, I will use the parameters corresponding to the standard normal distribution):

We can easily use this rnorm function to sample from this distribution

seasamples<-rn 000,m,s) Copy code

The average of the sample is very close to the true average (zero):

mean(sa es) ## [1] -0. 537 Copy code

In fact, in this case, the expected variance of the $n$ sample estimate is $1/n$, so we expect most of the value to be in $/pm 2/,//sqrt {n} = 0.02.

summary(re 0,mean(rnorm(10000,m,s)))) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -0.03250 -0.00580 0.00046 0.00042 0.00673 0.03550 Copy code

This function calculates the sum of cumulative averages.

cummean<-fun msum(x)/seq_along(x) plot(cummaaSample",ylab="Cumulative mean",panel.aabline(h=0,col="red"),las=1) Copy code

Convert the x-axis to logarithmic coordinates and display another 30 random methods:

You can draw sample quantiles from your series of sampling points.

This is the point calculated by analysis, and the 2.5% of its probability density is lower than:

p<-0.025 a.true<-qnorm(p,m,s) a.true 1## [1] -1.96 Copy code

We can estimate this by direct integration in this case

aion(x) dnorm(x,m,s) g<-function(a) integrate(f,-Inf,a)$value a.int<-uniroot(function(x)g(a10,0))$roota.int 1## [1] -1.96 Copy code

And use Monte Carlo integration to estimate points:

a.mc<-unnasamples,p)) a.mc ## [1] -2.023 a.true-a.mc ## [1] 0.06329 Copy code

However, this will converge within the limit where the sample size tends to infinity. In addition, it is possible to make statements about the nature of the error; if we repeat the sampling process 100 times, then we get a series of estimates of errors of the same magnitude as those near the mean:

a.mc<-replicate(anorm(10000,m,s),p)) summary(a.true-a.mc) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -0.05840 -0.01640 -0.00572 -0.00024 0.01400 0.07880 Copy code

This kind of thing is really common. In most Bayesian inference, the posterior distribution is a function of some (possibly large) parameter vector, and you want to reason about a subset of these parameters.

In a hierarchical model, you may have a large number of random effect terms to be fitted, but you most want to make inferences about a parameter. in

In the Bayesian framework, you can calculate the marginal distribution of the parameter you are interested in on all other parameters (this is what we have to do above).

# Why does "traditional statistics" not use Monte Carlo methods?

For many problems in traditional teaching statistics, instead of sampling from the distribution, the function can be maximized or maximized. So we need some function to describe the possibility and maximize it (maximum likelihood reasoning), or some function to calculate the sum of squares and minimize it.

However, the role of Monte Carlo method in Bayesian statistics is the same as the optimization procedure in frequency statistics, which is only an algorithm for performing inference. So, once you basically know what MCMC is doing, you can treat it like most people treat their optimization program as a black box, like a black box.

# Markov Chain Monte Carlo

Suppose we want to draw some target distributions, but we cannot draw independent samples as before. There is a solution that uses Markov Chain Monte Carlo (MCMC) to do this. First of all, we must define some things so that the next sentence makes sense: what we have to do is try to construct a Markov chain whose target distribution is sampled as its stationary distribution.

# definition

Suppose we have a three-state Markov process. Let P be the transition probability matrix in the chain:

P<-rbind(a(.2,.1,.7),c(.25,.25,.5)) P ## [,1] [,2] [,3] ## [1,] 0.50 0.25 0.25 ## [2,] 0.20 0.10 0.70 ## [3,] 0.25 0.25 0.50 rowSums(P) ## [1] 1 1 1 Copy code

P[i,j] gives the probability j from state i to state.

Please note that unlike rows, columns do not necessarily sum to 1:

colSums(P) ## [1] 0.95 0.60 1.45 Copy code

This function takes a state vector x (where x[i] is the probability of being in the state i), and iterates it P by multiplying it with the transition matrix, so that the system advances to n steps.

iterate.P<-function(x,P,n){ res<-matrix(NA,n+1,len a<-xfor(iinseq_len(n)) res[i+1,]<-x<-x%*%P res} Copy code

Start from the system in state 1 (the same is true for the x vector [1,0,0], which means that the probability of being in state 1 is 100% and not in any other state)

Similarly, for the other two possible starting states:

y2<-iterate.P(c(0,1,0),P,n) y3<-iterate.P(c(0,0,1),P,n) Copy code

This shows the convergence of the stationary distribution.

ma=1,xlab="Step",ylab="y",las=1) matlines(0:n,y2,lty=2) matlines(0:n,y3,lty=3) Copy code

We can use R's eigen function to extract the main feature vector of the system (t() here transposes the matrix to get the left feature vector).

v<-eigen(t(P) ars[,1] v<-v/sum(v)# Normalized feature vector Copy code

Then add a dot to the previous number to show how close we are to convergence:

matplot(0:n,y1a3,lty=3) points(rep(10,3),v,col=1:3) Copy code

The above process iterates over the overall probabilities of different states; rather than through the actual conversion of the system. So, let's iterate the system instead of probability vectors.

run<-function(i,P,n){ res<-integer(n) for(a(n)) res[[t]]<-i<-sample(nrow(P),1,pr=P[i,]) res} Copy code

This chain runs 100 steps:

samples<-run(1,P,100) ploaes,type="s",xlab="Step",ylab="State",las=1) Copy code

Instead of plotting the time scores that we change over time in each state:

plot(cummean(samplesa2) lines(cummean(samples==3),col=3) Copy code

Run it again (5000 steps)

n<-5000 set.seed(1) samples<-run(1,P,n) plot(cummeanasamples==2),col=2) lines(cummean(samples==3),col=3) abline(h=v,lty=2,col=1:3) Copy code

So the key here is: Markov chains have some nice properties. Markov chains have a fixed distribution. If we run them long enough, we can see where the chain takes time and make a reasonable estimate of the stable distribution.

# Metropolis algorithm

This is the simplest MCMC algorithm.

# MCMC sampling 1d (single parameter) problem

This is the weighted sum of two normal distributions. This distribution is quite simple, and samples can be drawn from MCMC.

Here are some definitions of parameters and target density.

p<-0.4ma1,2) sd<-c(.5,2) f<-function(x)p*dnora],sd[1])+(1-p)*dnorm(x,mu[2],sd[2]) Copy code

Probability density plotting

Let's define a very simple algorithm that samples from a normal distribution centered on the current point with a standard deviation of 4

And this only requires a few steps to run MCMC. It will return a matrix from point x with the same number of rows and columns in nsteps as the number of columns of x elements. If you operate on a scalar, x will return a vector.

run<-funagth(x)) for(iinseq_len(nsteps)) res[i,]<-x<-step(x,f,q) drop(res)} Copy code

Here are the first 1000 steps of the Markov chain, with the target density on the right:

layout(matrix(ca,type="s",xpd=NA,ylab="Parameter",xlab="Sample",las=1) usr<-par("usr") xx<-seq(usr[a4],length=301) plot(f(xx),xx,type="l",yaxs="i",axes=FALSE,xlab="") Copy code

hist(res,5aALSE,main="",ylim=c(0,.4),las=1,xlab="x",ylab="Probability density") z<-integrate(f,-Inf,Inf)$valuecurve(f(x)/z,add=TRUE,col="red",n=200) Copy code

Run longer, and the results start to look better:

res.long<-run(-10,f,q,50000) hist(res.long,100,freq=FALSE,main="",ylim=c(0,.4),las=1,xlab Copy code

Now, run different scenarios-one with a large standard deviation (33) and the other with a small standard deviation (3).

res.fast<-run(-10action(x) rnorm(1,x,33),1000) res.slow<-run(-10,f,functanorm(1,x,.3),1000) Copy code

Note the different ways the three tracks are moving.

On the contrary, the red trace rejects most of the space.

The blue trail suggests small movements that tend to be accepted, but it walks randomly along most of the trajectory. It takes hundreds of iterations to reach most of the probability density.

You can see the effect of the different scheme steps in the autocorrelation in the subsequent parameters-these graphs show the attenuation of the autocorrelation coefficient between the different lag steps, and the blue line indicates statistical independence.

par(mfrow=c(1,3ain="Intermediate") acf(res.fast,las=1,m Copy code

From this, the effective number of independent samples can be calculated:

1coda::effectiveSize(res) 1 2## var1 ## 187 1coda::effectiveSize(res.fast) 1 2## var1 ## 33.19 1coda::effectiveSize(res.slow) 1 2## var1 ## 5.378 Copy code

This more clearly shows the longer running time of the chain:

naun(-10,f,q,n)) xlim<-range(sapply(saa100) hh<-lapply(samples,function(x) hist(x,br,plot=FALSE)) ylim<-c(0,max(f(xx))) Copy code

Display 100, 1,000, 10,000 and 100,000 steps:

for(hinhh){plot(h,main="",freq=a=300)} Copy code

# MCMC in two dimensions

Given a multivariate normal density, given a mean vector (the center of the distribution) and a variance-covariance matrix.

make.mvn<-function(mean,vcv){ logdet<-as.numeric(detea+logdet vcv.i<-solve(vcv)function(x){ dx<-x-meanexp(-(tmp+rowSums((dx%*%vcv.i)*dx))/2)}} Copy code

As mentioned above, define the target density as the sum of two mvns (unweighted this time):

mu1<-c(-1,1) mu2<-c(2,-2) vcv1<-ma5,.25,1.5),2,2) vcv2<-matrix(c(2,-.5,-.5,2aunctioax)+f2(x)x<-seq(-5,6,length=71) y<-seq(-7,6,lena-expand.grid(x=x,y=y) z<-matrix(aaTRUE) Copy code

Sampling from a multivariate normal distribution is also fairly simple, but we will use MCMC to draw samples from it.

There are a few different strategies here-we can propose actions in both dimensions at the same time, or we can sample along each axis independently. Both strategies can work, although their mixing speed will be different.

Assuming that we don't actually know how to sample from mvn, let us propose a proposal distribution that is consistent in two dimensions, sampling from a square with width "d" on each side.

Compare the sampling distribution with the known distribution:

For example, what is the marginal distribution of parameter 1?

hisales[,1],freq=FALSa",xlab="x",ylab="Probability density") Copy code

We need to integrate all possible values of the first parameter and the second parameter. Then, because the objective function itself is not standardized, we must decompose it into one-dimensional integral values.

m<-function(x1){ g<-Vectorize(function(x2)f(c(x1,ae(g,-Inf,Inf)$value} xx<-seq(mina]),max(sales[,1]),length=201) yy<-s ue hist(samples[,1],freq=FALSE,ma,0.25)) lines(xx,yy/z,col="red") Copy code

Most popular insights

1. Markov regime switching model in R language

2. Implement Markov Chain Monte Carlo MCMC model in R language

3. Matlab Bayesian Hidden Markov hmm model

5. R language hidden Markov model hmm recognizes changing market conditions

6. Implementation of Hidden Markov Model (HMM) in Matlab

7. Matlab realizes the Markov switch ARMA-GARCH model of MCMC

8. How to make markov switching model in R language

9. R language Markov transformation model to study the number of traffic casualties