8 Terms You Should Know about Bayesian Neural Network
你应该知道的有关贝叶斯神经网络的8个术语
8 Terms You Should Know about Bayesian Neural Network
你应该知道的有关贝叶斯神经网络的8个术语
The meaning of Prior, Posterior, Bayes’ Theorem, Negative Log-Likelihood, KL Divergence, Surrogate, Variational Inference & Evidence Lower Bound
先验,后验,贝叶斯定理,负对数似然,KL散度,代理,变分推断及证据下限的定义
Goal
目标
In the last article, we have an introduction to Bayesian Neural Network (BNN). For those who are new to BNN, make sure you have checked the link below so as to get familiar with the difference between Standard Neural Network (SNN) and BNN.
在上一篇文章中,我们对贝叶斯神经网络(BNN)进行了介绍。如果你刚接触BNN,请确保你已经浏览了下方链接中的内容,以便知晓标准神经网络(SNN)和贝叶斯神经网络(BNN)的区别。
Today, we will jump to the core and learn the mathematical formula behind it. From this article, you will learn different BNN-related terms about…
今天,我们将进入核心部分,学习构成这个神经网络背后的数学公式。在这篇文章中,你将学习到与贝叶斯神经网络BNN相关的一系列的不同术语
How we leverage the concept of Bayesian inference to update the probability distribution of model weights and outputs.
我们如何利用贝叶斯推理的概念来更新模型权重和输出的概率分布。
What specific loss function we will use for Bayesian Neural Network to optimize the model.
我们将使用什么具体的损失函数,从而优化贝叶斯神经网络模型。
Different techniques and methods in real-life scenarios to tackle the unknown distribution problem.
以及在现实生活中解决未知问题的不同技术和方法。
Bayesian Inference
贝叶斯推理
From the previous article, we know that Bayesian Neural Network would treat the model weights and outputs as variables. Instead of finding a set of optimal estimates, we are fitting the probability distributions for them.
在上一篇文章中,我们知道贝叶斯神经网络将模型权重和输出作为变量来处理。我们不是寻找一组最优估计,而是拟合它们的概率分布。
But the problem is “How can we know what their distributions look like?” To answer this, you have to learn what prior, posterior, and Bayes’ theorem are. In the following, we will use an example for illustration. Given there are two classes — science class and art class and the classmates are either wearing glasses or without glasses. And now we pick one random classmate from the classes, can you tell what is the probability of that classmate wearing glasses?
但问题是“我们如何才能知道它们是怎样分布的?”要回答这个问题,必须学习什么是先验定理,后验定理和贝叶斯定理。下面我们用一个例子来说明。假设有两个班--理科班和文科班,有的同学戴眼镜,有的同学不戴眼镜。现在我们从班上随机选出一位同学,你知道选中戴眼镜的同学的几率是多少吗?
1. Prior Probability (Prior)
1.先验概率(先验)
Prior expresses one’s beliefs before considering any evidence. So without any further information provided, you may guess the probability of the classmate wearing glasses is 0.5 since (30+20)/(30+20+15+35)=50/100=0.5. Here, we will call 0.5 the prior probability.
在寻找证据之前先通过先验的方式说出自己的想法。因此,缺乏更多信息时,可以猜测选中戴眼镜的同学的概率为0.5,因为(30+20)/(30+20+15+35)=50/100=0.5。这里,我们将0.5称为先验概率。
2. Posterior Probability (Posterior)
2.后验概率(后验)
Posterior expresses one’s beliefs after considering some evidence. Let’s continue with the above example. What if now I am telling you that the classmate is actually from the Science class? What do you think about the probability of that classmate wearing glasses now? By having more information, you may change your belief and update the probability, right? That updated probability we will call posterior probability.
在结合一些证据思考后说出自己的想法。依然是上面的例子。如果现在我告诉你那个被选中的同学其实是理科班的呢?现在你认为选中戴眼镜同学的概率是多少?获取更多信息,可能会改变你的想法及推断的概率,对吗?这个改变后的概率就称之为后验概率。
3. Bayes’ Theorem
3.贝叶斯定理
Bayes’ theorem is the mathematical formula that is used to update the prior probability to be the posterior probability based on the evidence.
贝叶斯定理是以证据为基础,将先验概率改为后验概率的数学公式。
A is our interested event which is “the classmate wear glasses” while X is the evidence which is “the classmate is in science class”.
A是我们感兴趣的事件,即“戴眼镜的同学”,X是证据,是“理科班的同学”。
So now you understand how the posterior is being updated based on evidence. For Bayesian Neural Network, the posterior probability for the weights will be computed as
所以现在你明白后验是如何根据证据进行更新的了。对于贝叶斯神经网络,权值的后验概率将计算为
Loss Function
损失函数
So you now understand the formula of updating the weights and outputs but we miss one important thing which is the evaluation of the estimated probability distribution. In the following, we will discuss two key measurements that are often used in BNN.
现在你已经了解了更新权重和结果的公式,但是我们忽略了一个重要的东西,那就是估计概率分布。下面,我们将讨论贝叶斯神经网络中经常使用的两个关键测量。
4. Negative Log-Likelihood
4.负对数似然
For regression problems, we will always use Mean Squared Error (MSE) as the loss function in SNN since we only have a point estimate. However, we will do something different in BNN. By having the predicted distribution, we will use negative log-likelihood as the loss function.
对于回归问题,我们总是用均方误差(MSE)作为标准神经网路中的损失函数,因为我们只有一个估计点。但是,通过贝叶斯神经网络,我们会做一些不一样的事情。通过预测分布,我们将用负对数似然作为损失函数。
Okay, let’s explain them one by one.
好的,让我们一一解释这几个单词的意思。
Likelihood is the joint probability of the observed data as a function of the predicted distribution. In other words, we want to find out how likely the data would be distributed just like our predicted distribution. The larger the likelihood, the more accurate our predicted distribution.
似然是作为预测分布函数的观测数据的联合概率。换句话说,我们想知道数据像我们预测的那样分布的可能性有多大。可能性越大,我们对分布的预测就越准确。
And for log-likelihood, we have it because of easy calculation. By leveraging the log properties (log ab = log a + log b), we can now use summation instead of multiplication.
对于对数似然,我们使用它是因为容易计算。通过利用log函数特性(log ab=log a+log b,现在我们可以使用求和代替乘法。
Last but not least, we add the negative sign to form the negative log-likelihood because in machine learning, we always optimize the objective function by minimizing the cost function or loss function instead of maximizing it.
最后,我们添加负符号形成负对数似然,因为在机器学习中,我们总是通过最小化代价函数或损失函数来优化目标函数,而不是将其最大化。
5. Kullback-Leibler Divergence (KL Divergence)
5.Kullback-Leibler散度(KL散度)
KL divergence is to quantify how much difference there is from one distribution to another distribution. Let say p is the true distribution while q is the predicted distribution. In fact, it is just equal to cross-entropy between two distributions minus the entropy of the true distribution p. In other words, it explains how much further the predicted distribution q can be improved.
KL散度用来量化从一种分布到另一种分布的差异有多大。假设p是真分布,q是预测分布。实际上,它正好等于两个分布之间的交叉熵减去真实分布P的熵。换句话说,它解释了q可以提高多少。
For those who have no idea what entropy and cross-entropy are, simply speaking, entropy is the lowest boundary of the “cost” to represent the true distribution p while cross-entropy is the “cost” to represent the true distribution p using the predicted distribution q. Stemming from this, KL divergence will represent how much further the “cost” for the predicted distribution q can be reduced.
对于那些不知道什么是熵和交叉熵的人,简单地说,熵是表示真分布p的“成本”的最低边界,而交叉熵是使用预测分布Q表示真分布p的“成本”。由此,KL散度代表预测分布q的“成本”可以再减少多少。
So back to today’s focus, p will refer to the true distribution of the model weights and outputs while q will be our predicted distribution. We will use KL divergence to calculate the difference between two distributions so as to update our predicted distribution.
回到今天的重点,p指模型权重和输出的真实分布,而q是预测分布。我们用KL散度来计算真实分布和预测分布之间的差异,从而更新预测分布。
Problem & Solution
问题与解决方案
Unfortunately, the marginal probability P(D) is in general intractable as it is hard to find the closed form for the below integral. Stemming from this, for a complex system, the posterior P(w | D) is also intractable.
不幸的是,边际概率P(D)通常难以解决,因为很难获取下列积分的闭合形式。因此,对于复杂系统,后验P(w D)也是一个难以解决的问题。
To tackle the problem, the statisticians have developed a method called Variational Inference to approximate the true posterior distribution with a surrogate model by minimizing the evidence lower bound.
为了解决这一问题,统计学家们找到了一种称为变分推理的方法,通过最小化证据下限,用一个替代模型来近似真实的后验分布。
Don’t worry about the bolded terms. I will explain them one by one.
不用担心那些粗体的专业术语。我将一一解释。
6. Surrogate
6.代理模型
A surrogate model is a simple model that is used to replace the complex model we are interested in. It is easy to work with and as good as the complex model. Generally speaking, a surrogate model would be in the statistical distribution family so we have the analytical solution on it.
代理模型是一个简单的模型,用来代替我们感兴趣的复杂模型。它很容易应用,并且可达到的效果和复杂模型一样。一般来说,代理模型是在统计分布族中的,所以我们有关于它的分析策略。
7. Variational Inference (VI)
7.变分推理(VI)
Variational Inference is the concept to use a variational distribution q* to replace the true posterior distribution p(w|D). But there are so many surrogate models, how can we ensure q* is good enough to represent p(w|D)? The answer is simple, we can use the KL Divergence just learned.
变分推理是指用变分分布Q*代替真实后验分布p(wD)。但是代理模型那么多,我们如何保证Q*充分代表p(wD)呢?答案很简单,我们可以用刚才学到的KL散度。
Among the surrogate models Q, we are trying to find the optimal one q* that
在代理模型Q中,我们试图找到最优的一个Q*,它
8. Evidence Lower Bound (ELBO)
8.证据下限
However, the same problem still exists since we do not have the posterior probability distribution. What can we do is to rewrite the KL divergence into
然而,同样的问题仍然存在,因为我们没有后验概率分布。我们能做的就是把KL散度进行改写
By considering
通过思考
We can summarize the below formula.
我们可以总结出如下公式。
Given the knowledge that KL divergence is a non-negative number and the evidence is the probability between 0 and 1 and therefore the log evidence must be a non-positive number, we can easily deduce that L(w) is the lower bound of the evidence. This is why we call it “Evidence Lower Bound”. In other words, we can now find the optimal one q* by optimizing
假设KL散度是非负数,证据是0到1之间的概率,因此对数证据必须是非正数,我们可以很容易地推导出L(w)是证据下限。这就是我们称之为“证据下限”的原因。换句话说,现在我们可以通过优化找到最优的一个Q*
Conclusion
结论
Congratulations if you have finished reading all the contents above! Hopefully you are now having the most fundamental understanding of the mathematics concepts behind the Bayesian Neural Network. In the upcoming articles, I will focus more on the coding perspective about how to use TensorFlow Probability to build the BNN model. Stay tuned! =)
如果你已经读完以上所有内容,恭喜你!希望你现在已经对贝叶斯神经网络背后的数学概念有了最基本的理解。在接下来的文章中,我将更多地从编码的角度,探讨如何使用TensorFlow概率构建贝叶斯神经网络模型。请继续关注!=)