After considering the 10 impressions of data we have for the facebook-yellow-dress campaign, the posterior distribution of *θ* gives us plausibility of any click-through rate from 0 to 1.

The effect of our data, or our evidence, is provided by the likelihood function, *p*(*X*|*θ*). What we are ultimately interested in is the plausibility of all proposed values of *θ* given our data or our posterior distribution *p*(*θ*|*X*). From the earlier section introducing Bayes' Theorem, our posterior distribution is given by the product of our likelihood function and our prior distribution:

Since p(X) is a constant, as it does not depend on *θ*, we can think of the posterior distribution as:

We'll now demonstrate how to estimate *p*(*θ*|*X*) using PyMC.

### Doing Bayesian Inference with PyMC

Usually, the true posterior must be approximated with numerical methods. To see why, let's return to the definition of the posterior distribution:

The denominator *p*(*X*) is the total probability of observing our data under all possible values of *θ*. A more descriptive representation of this quantity is given by:

Which sums the probability of X over all values of *θ*. This integral usually does not have a closed-form solution, so we need an approximation. One method of approximating our posterior is by using Markov Chain Monte Carlo (MCMC), which generates samples in a way that mimics the unknown distribution. We begin at a particular value, and "propose" another value as a sample according to a stochastic process. We may reject the sample if the proposed value seems unlikely and propose another. If we accept the proposal, we move to the new value and propose another.

PyMC is a python package for building arbitrary probability models and obtaining samples from the posterior distributions of unknown variables given the model. In our example, we'll use MCMC to obtain the samples.

The prototypical PyMC program has two components:

- Define all variables, and how variables depend on each other
- Run an algorithm to simulate a posterior distribution

Let's now obtain samples from the posterior. We select our prior as a Beta(11.5,48.5). Let's see how observing 7 clicks from 10 impressions updates our beliefs:

```
import pymc3 as pm
import numpy as np
#create our data:
```

clicks = np.array([n_clicks])
#clicks represents our successes. We observed 7 clicks.

impressions = np.array([n_impressions])
#this represents the number of trials. There were 10 impressions.
with pm.Model() as model:
#sets a context; all code in block "belongs" to the model object
theta_prior = pm.Beta('prior', 11.5, 48.5)
#our prior distribution, Beta (11.5, 48.5)

observations = pm.Binomial('obs',n = impressions
, p = theta_prior
, observed = clicks)

#Sampling distribition of outcomes in the dataset.
#our prior p_prior will be updated with data
start = pm.find_MAP()

#find good starting values for the sampling algorithm
#Max Aposterior values, or values that are most likely
step = pm.NUTS(state=start)

#Choose a particular MCMC algorithm

#we'll choose NUTS, the No U-Turn Sampler (Hamiltonian)
trace = pm.sample(5000
, step
, start=start
, progressbar=True)

#obtain samples

Let's walk through each line of code:

`with pm.Model() as model:`

`pm.Model`

creates a PyMC model object. `as model`

assigns it to the variable name "model", and the `with ... :`

syntax establishes a context manager. All PyMC objects created within the context manager are added to the model object.

`theta_prior = pm.Beta('prior', 11.5, 48.5)`

Theta_prior represents a random variable for click-through rates. It will serve as our prior distribution for the parameter *θ*, the click-through rate of our facebook-yellow-dress campaign. This random variable is generated from a beta distribution (pm.Beta); we name this random variable "prior" and hardcode parameter values 11.5 and 48.5. We could have set the values of these parameters as random variables as well, but we hardcode them here as they are known.

`observations = pm.Binomial('obs',n = impressions`

`, p = theta_prior`

`, observed = clicks)`

This statement represents the likelihood of the data under the model. Again we define the variable name and set parameter values with n and p. Note that for this variable, the parameter p is assigned to a random variable, indicating that we are trying to model that variable. Lastly, we provide observed instances of the variable (i.e. our data) with the `observed`

keyword. Because we have said this variable is observed, the model will not try to change its values.

`start = pm.find_MAP()`

`step = pm.NUTS(state=start)`

`trace = pm.sample(2000, step, start=start, progressbar=True)`

These three lines define how we are going to sample values from the posterior. `pm.find_MAP()`

will identify values of theta that are likely in the posterior, and will serve as the starting values for our sampler.

`pm.NUTS(state=start)`

will determine which sampler to use. The sampling algorithm defines how we propose new samples given our current state. The proposals can be done completely randomly, in which case we'll reject samples a lot, or we can propose samples more intelligently. NUTS (short for the No-U-Turn sample) is an intelligent sampling algorithm. Other choices include Metropolis Hastings, Gibbs, and Slice sampling.

Lastly, `pm.sample(2000, step, start=start, progressbar=True)`

will generate samples for us using the sampling algorithm and starting values defined above.

Let's take the histogram of the samples obtained from PyMC to see what the most probable values of *θ* are, compared with our prior distribution and the evidence (likelihood of our data for each value of *θ*):

```
#plot the histogram of click through rates
plt.rcParams['figure.figsize'] = (16, 7)
```

#get histogram of samples from posterior distribution of CTRs
posterior_counts, posterior_bins = np.histogram(trace['prior']
,bins=zero_to_one)

#normalized histogram

posterior_counts = posterior_counts / float(posterior_counts.sum())

#take the mean of the samples as most plausible value
most_plausible_theta = np.mean(trace['prior'])

#histogram of samples from prior distribution
prior_counts, bins = np.histogram(prior_samples
, zero_to_one)

#normalize
prior_counts = map(lambda x: float(x)/prior_counts.sum()
, prior_counts)

#plot
f, ax = plt.subplots(1)
ax.plot(possible_theta_values, likelihoods)
ax.plot(bins[:-1],prior_counts, alpha = .2)
ax.plot(bins[:-1],posterior_counts)
ax.axvline(most_plausible_theta, linestyle = "--", alpha = .2)
line1, line2, line3, line4 = ax.lines
ax.legend((line1, line2, line3, line4), ('Evidence'
, 'Prior Probability for Theta'
, 'Posterior Probability for Theta'
, 'Most Plausible Theta'
), loc = 'upper left')
ax.set_xlabel("Theta")
ax.grid()
ax.set_title("Prior Distribution Updated with Some Evidence")
plt.show()

Now that we have a full distribution for the probability of various values of *θ*, we can take the mean of the distribution as our most plausible value for *θ*, which is about 0.27.

The data has caused us to believe that the true click-through rate is higher than we originally thought, but far lower than the 0.7 click-through rate observed so far from the facebook-yellow-dress campaign. Why is this the case? Note how wide our likelihood function is; it's telling us that there is a wide range of values of *θ* under which our data is likely. If the range of values under which the data were plausible were narrower, then our posterior would have shifted further. See what happens to the posterior if we observed a 0.7 click-through rate from 10, 100, 1,000, and 10,000 impressions:

```
import pymc3 as pm
import numpy as np
#create our data:
traces = {}
for ad_impressions in [10, 100, 1000, 10000]:
```

#maintaining observed CTR of 0.7
clicks = np.array([ctr * ad_impressions])

#re-estimate the posterior for
impressions = np.array([ad_impressions])

#increasing numbers of impressions
with pm.Model() as model:
theta_prior = pm.Beta('prior', 11.5, 48.5)
observations = pm.Binomial('obs',n = impressions
, p = theta_prior
, observed = clicks)
start = pm.find_MAP()
step = pm.NUTS(state=start)
trace = pm.sample(5000
, step
, start=start
, progressbar=True)
traces[ad_impressions] = trace
f, ax = plt.subplots(1)
ax.plot(bins[:-1],prior_counts, alpha = .2)
counts = {}
for ad_impressions in [10, 100, 1000, 10000]:
trace = traces[ad_impressions]
posterior_counts, posterior_bins = np.histogram(trace['prior'], bins=[j/100. for j in xrange(100)])
posterior_counts = posterior_counts / float(len(trace))
ax.plot(bins[:-1], posterior_counts)
line0, line1, line2, line3, line4 = ax.lines
ax.legend((line0, line1, line2, line3, line4), ('Prior Distribution'
,'Posterior after 10 Impressions'
, 'Posterior after 100 Impressions'
, 'Posterior after 1000 Impressions'
,'Posterior after 10000 Impressions'))
ax.set_xlabel("Theta")
ax.axvline(ctr, linestyle = "--", alpha = .5)
ax.grid()
ax.set_ylabel("Probability of Theta")
ax.set_title("Posterior Shifts as Weight of Evidence Increases")
plt.show()

As we obtain more and more data, we are more certain that the 0.7 success rate is the true success rate. Conditioning on more data as we update our prior, the likelihood function begins to play a larger role in our ultimate assessment because the weight of the evidence gets stronger. This would be particularly useful in practice if we wanted a continuous, fair assessment of how our campaigns are performing without having to worry about overfitting to a small sample.

There are a lot of concepts are beyond the scope of this tutorial, but are important for doing Bayesian analysis successfully, such as how to choose a prior, which sampling algorithm to choose, determining if the sampler is giving us good samplers, or checking for sampler convergence. Hopefully this tutorial inspires you to continue exploring the fascinating world of Bayesian inference.