# Yet Another Bayesian Introduction

**José Bernardo wrote a thorough** article introducing “Bayesian Statistics,” coming in as one of the first Google results for that phrase. It is a verbose article, seemingly targetted towards an audience already familiar with the subject but needing a common, clarifying resource.

I don’t intend to provide anymore information here than Bernardo did, impossible considering my conventional ~600 word limit, but instead provide an introduction that is both educational and *human*.

Bayesian Statistics is a growing subject in Mathematics, promising a set of tools and methods to tackling complex problems. The Bayesian paradigm is itself complex, though, *because* it is growing:

- PyMC in 2011
- BayesiaLab in 2001
- Bayesian databases in 1996
- Bayesian Networks in 1985
- Monte Carlo algorithms in 1946
- and so on.

But the Bayesian paradigm really began in circa 1750 when Thomas Bayes, a minister and mathematician, solved a hot problem of the time, today known as “inverse probability.” He discovered a handy rule of probability in his solution, aptly named “Bayes’s Theorem.”

However, it’s handy only if you own a good computer.

Until recently, this theorem was overlooked or pushed away or buried time and time again. Lack of computing power kept its methods out of the attention of researchers and scientists until the advent of the computers helped break German codes in World War II—using Bayesian Statistics, of course.

To say that Bayesian Statistics is based on Bayes’s Theorem, “posterior odds are proportional to prior odds times likelihood,” would be an incomplete or inaccurate description. This is but one rule that is the glue joining a full suite of specialized equations, a lesson or two from Information Theory, and some key assumptions.

**Probability Distributions.** Included in all theories of probability are equations that specify a “distribution” of odds. For example, the odds of rolling a typical die would be a straight line over one to six, since those whole numbers are equally likely. Rolling the number seven is impossible, and thus isn’t under the line.

Another example would be the weight of male atheletes. The shape of this distibution is like a hill. Men would tend to have weights near the middle of the hill, where the probability of the average weight is higher than other points lower on the hill. Furthermore, the hill might lean to one side, showing a bias to thinner or heavier atheletes.

The higher the shape over a point, the more likely that point is. Additionally, the area under the shape must be exactly one, or 100%, since there is a 100% chance that *something* will happen. However, there is a simple fix if this isn’t the case—just multiple or divide until the area becomes one. Hence the phrase “proportional to” in the theorem.

Bayes’s Theorem connects the “prior” distribution before data has been collected and the “posterior” distribution afterwards. Using this, complex chains of distributions, some based on assumptions and some based on evidence, can be put together, and the math goes to work itself.

**Maximum Entropy.** Which is more random, a typical die or male athelete weight? a hundred-sided die or male athelete weight? This question is inaccurate, since a process either *is* or *isn’t* random—there is no more or less. But Information Theory provides a measure, called Entropy, that represents what is conversationally meant by “randomness.”

Entropy is used in Bayesian Statistics to settle the most common counterargument against it that prior distributions are *subjective*. By choosing a prior from all available priors that has the most Entropy, or randomness, the shape of the posterior distribution will be affected most by that of the evidence and not by the initial guess.

**Start Guessing.** At the heart of Bayesian Statistics is the idea of randomness. It does not deal in certain answers. Instead, a prior belief is stated, evidence is collected, and the belief is updated. Wash, rinse, and repeat until the straight line curves into a hill and tightens over the “correct” value, if such a value even exists!

This approach is conceptually simple, but may be hard to accept as rigorous in, say, a conventional, scientific journal. Scholars have been uncomfortable with the guess-and-check nature of Bayes’s methods for centuries.

But, as Bernardo argues, Bayesian Statistics is far more mathematic than its counterpart, Frequentism. In Bayesian Statistics, once assumptions, data, and other knowledge or constraints have been stated, the calculations are straightforward and difficult to refute. In both fields one must begin anyway, in some fashion, by stating assumptions—if they didn’t, all the problems they work on would be the same!

Yes, substantial care must be given to ensure the prior selected is really the best for the situation at hand. And yes, repeatedly combining mixed-and-matched distributions can produce equations of immense complexity.

But both of these are becoming simpler as better tools are developed for modeling Complex Systems.

“When the facts change, I change my opinion. What do you do, sir?” ∎

Follow me on Twitter. Let’s chat sometime.