Bayesian and Frequentist -- two worldviews of machine learning


Before graduated as Ph. D., some of my friends and I have formed a Probabilistic Graphical Models (PGM) study group. But we stopped at 1/3 of the course content since everybody had his own business to mind. Until recently I finally finished this course online given by professor Dephne Koller, the founder of Coursera, and finally I start to see how the worldview of so called "Bayesian" and "Frequentist" differs.

Most of the Machine Learning (ML) courses, for example the "Machine Learning Foundations" by professor Hsuan-Tien Lin at National Taiwan University, start from the concept of "classifiers", or supervised discriminative models to be more specific. And the core concept of classifiers can be summarized as "separating hyper-plane": Imagine we're looking for a way to separate a bunch of samples in the space so we can get the "best" outcome. Logistic regression or perceptron is simply cutting this space in half; support vector machine (SVM) is cut-in-half with middle-of-the-road policy, plus kernel trick you get hyper-dimensional cut-in-half; and multi-class classification is cutting the space into multiple chunks.

We can further utilize such way of thought to interpret unsupervised learning, and even tree based models originated from traditional AI. Unsupervised learning, besides being seen as clustering similar samples together, one can also see it as separating different samples apart; as to tree based models, going from root to leaf is equivalent to choosing 1 dimension at a time and cut the space in a nested order.

The two figures below are from "Comparing machine learning classifiers based on their hyperplanes or decision boundaries":

(a) Separating hyper-planes of various binary classifiers

(b) Separating hyper-planes of various multi-class classifiers

However, it's not as straightforward if one wants to utilize such concept of ML to generative models e.g. GMM, sequential models e.g. HMM, CRF, or reinforcement learning. There's a solution of course, which is to generalize to the concept of "objective function": I just define some sort of metric I like, to evaluate how good a model is to my problem and data set; and the goal of learning is to tune the model toward this goodness I defined. This is roughly how Frequentists see ML.

On the other hand, to get into the basics of Bayesian or PGM, one can think of all the inputs, outputs and model parameters as discrete random variables (RV). No matter the relation among variables are joint (undirectional edge) or conditional (directional edge) distribution, it's just a table lookup. There's no input or output under the worldview of PGM. There's only "observed" or "hidden". And what is called "inference" is just removing variables with helps from given conditions, combining tables and summing all possibilities, all the way to our desired hidden RV.

This figure below is from"Patterns of Inference":


One can generalize from discrete to continuous RV by replacing tables with functions for describing the relations among variables. From PGM's perspective, we're mapping the probability distribution of one or more random variables x to yet another probability distribution f(x). Note in the past people have also realized that we can visualize different classifiers with different structures of artificial neural network. The figure below is from "DeepVsShallowComparisonICML2007":


Such modeling technique is more often referred to as Bayes inference instead of PGM, in for example the job descriptions you see in recruiting ads. Due to me lack of corresponding work experience, I can only guess from the homework of PGM class that such technique is more often used in the area where you have less data but lots of heuristics, or the decision maker just believes more in explainable rules, like tracking genetic disease or calculating premium rate in insurance companies.

From a higher level of perspective, I think "decision making" is the best way to summarize the context under which one should utilize PGM. That's why I find it easier to imagine reinforcement learning with PGM (which is used in robot decision making), let along HMM and CRF.

The power of Bayesian is it offers a simple, unified and visualized explanation, applicable to any machine learning model (c.f. "Probabilistic Models for Unsupervised Learning", PDF file). Yet from Frequentists' perspective, Bayesians tend to over-explained the motivation of modeling, which should be simply finding the optimal model to fit the problem and data we have. For example:
  1. Bayesian would argue that, by choosing the model family and objective function, one has made some sort of assumption on how the real world behaves. Take linear regression for example, from Frequentists' viewpoint it's just looking for a set of weights that minimize mean square error; but from Bayesians' viewpoint, it's you assuming the residual error of your linear model be Gaussian distributed, so based on maximum likelihood, your optimal solution happens to be the minimum mean square error one.
  2. Bayesian sees model parameters as random variables. This is quite counter-intuitive for ML beginners: if parameters are nothing more than buttons on a imaginary classifier machine, why these buttons have randomness? Such perspective is best reflected in how Bayesians and Frequentists explain regularization differently: Frequentists simply see regularization as you incorporate your preference on parameters (e.g. better closer to 0) into objective function; but to Bayesians, regularization is your assumption on how likely we will end up with each possible value of parameters, following certain prior distribution. For example L1-norm means you assume the parameters follow Laplacian distribution, and L2-norm Gaussian distribution.

Check the slides below for more discussion:


So which worldview of ML is correct/better? If you want to fully understand the essence of ML, I would say you need to know both. Under different context you would find one mindset more useful than the other. Lots of ML references often mix or switch between the terming of Bayesian and Frequentist, for example you might see "maximum likelihood/maximum a posteriori objective functions".

Also, in the recently very popular "deep learning" area, lots of pivotal publications actually require Bayesian background knowledge to fully understand. Say if you ever find it difficult to understand the papers published by professor Geoffrey Hinton around 2006 on deep learning theories, feel free to check my post "From Restricted Boltzmann Machine to Deep Neural Network".

Comments

Popular posts from this blog

A Quick Guide of Multithreading with Queue in Python

From Restricted Boltzmann Machine to Deep Neural Network -- Missing Links Explained