How should we compare neural network representations?
我们应该如何比较神经网络表示?

李晓灿    北京理工大学
时间:2022-07-17 语向:英-中 类型:人工智能 字数:1045
  • How should we compare neural network representations?
    我们应该如何对比神经网络的表征?
  • Cross-posted from Bounded Regret.
    Cross-posted from Bounded Regret.
  • To understand neural networks, researchers often use similarity metrics to measure how similar or different two neural networks are to each other. For instance, they are used to compare vision transformers to convnets [1], to understand transfer learning [2], and to explain the success of standard training practices for deep models [3]. Below is an example visualization using similarity metrics; specifically we use the popular CKA similarity metric (introduced in [4]) to compare two transformer models across different layers:
    为了理解神经网络,研究人员经常使用相似性指标来衡量两个神经网络之间的相似程度或不同程度。例如,它们用于将视觉转换器与convnets进行比较[1],理解迁移学习[2],并解释深度模型的标准训练实践的成功[3]。以下是使用相似性指标的示例可视化;具体来说,我们使用流行的CKA相似性指标(在[4]中介绍)来比较不同层的两个变压器模型:
  • Figure 1. CKA (Centered Kernel Alignment) similarity between two networks trained identically except for random initialization. Lower values (darker colors) are more similar. CKA suggests that the two networks have similar representations.
    图 1. CKA(居中核对齐)两个网络之间的相似性,除了随机初始化之外,训练相同。较低的值(较深的颜色)更相似。CKA认为这两个网络具有相似的表示形式。
  • Unfortunately, there isn’t much agreement on which particular similarity metric to use. Here’s the exact same figure, but produced using the Canonical Correlation Analysis (CCA) metric instead of CKA:
    不幸的是,对于使用哪个特定的相似性指标并没有太多的共识。这是完全相同的数字,但使用规范相关性分析(CCA)指标而不是CKA生成:
  • Figure 2. CCA (Canonical Correlation Analysis) similarity between the same two networks. CCA distances suggest that the two networks learn somewhat different representations, especially at later layerss.
    图 2.CCA(规范相关性分析)相同两个网络之间的相似性。CCA距离表明,这两个网络学习的表示方式有些不同,特别是在后面的层。
  • In the literature, researchers often propose new metrics and justify them based on intuitive desiderata that were missing from previous metrics. For example, Morcos et al. motivate CCA by arguing that similarity metrics should be invariant to invertible linear transformations [5]. Kornblith et al. disagree about which invariances a similarity metric should have, and instead argue that metrics should pass an intuitive test - given two trained networks with the same architecture but different initialization, layers at the same depth should be most similar to each other - and their proposed metric, CKA, performs the best on their test [4].
    在文献中,研究人员经常提出新的指标,并根据以前指标所缺少的直观描述来证明它们。例如,Morcos等人通过论证相似性度量应该与可逆线性变换不变来激励CCA[5]。Kornblith等人不同意相似性度量应该具有哪些不变性,而是认为度量应该通过直观的测试 - 给定两个具有相同架构但初始化不同的训练网络,相同深度的层应该彼此最相似 - 并且他们提出的度量,CKA,在他们的测试中表现最好[4]。
  • Our paper, Grounding Representation Similarity with Statistical Testing, argues against this practice. To start, we show that by choosing different intuitive tests, we can make any method look good. CKA does well on a “specificity test” similar to the one proposed by Kornblith et al., but it does poorly on a “sensitivity test” that CCA shines on.
    我们的论文《与统计检验的基础表示相似性》反对这种做法。首先,我们表明,通过选择不同的直观测试,我们可以使任何方法看起来不错。CKA在类似于Kornblith等人提出的“特异性测试”中表现良好,但在CCA闪耀的“敏感性测试”中表现不佳。
  • To move beyond intuitive tests, our paper provides a carefully-designed quantitative benchmark for evaluting similarity metrics. The basic idea is that a good similarity metric should correlate with the actual functionality of a neural network, which we operationalize as accuracy on a task. Why? Accuracy differences between models are a signal that the models are processing data differently, so intermediate representations must be different, and similarity metrics should notice this.
    为了超越直观的测试,我们的论文提供了一个精心设计的定量基准,用于评估相似性指标。基本思想是,一个好的相似性指标应该与神经网络的实际功能相关,我们将其操作为任务的准确性。为什么?模型之间的准确性差异是模型处理数据方式不同的信号,因此中间表示形式必须不同,相似性指标应注意这一点。
  • Thus, for a given pair of neural network representations, we measure both their (dis)similarity and the difference between their accuracies on some task. If these are well-correlated across many pairs of representations, we have a good similarity metric. Of course, a perfect correlation with accuracy on a particular task also isn’t what we’re hoping for, since metrics should capture many important differences between models, not just one. A good similarity metric is one that gets generally high correlations across a couple of functionalities.
    因此,对于给定的神经网络表示对,我们测量它们的(不)相似性以及它们在某些任务上的准确性之间的差异。如果这些在许多表示对中都有很好的相关性,那么我们就有一个很好的相似性指标。当然,与特定任务的准确性的完美相关性也不是我们所希望的,因为指标应该捕获模型之间的许多重要差异,而不仅仅是一个。一个好的相似性度量是在一对功能之间获得高度相关性的度量。
  • We assess functionality with a range of tasks. For a concrete example, one subtask in our benchmark builds off the observation that BERT language models finetuned with different random seeds will have nearly identical in-distribution accuracy, but widely varying out-of-distribution accuracy (for example, ranging from 0 to 60% on the HANS dataset [6]). Given two robust models, a similarity metric should rate them as similar, and given one robust and one non-robust model, a metric should rate them as dissimilar. Thus we take 100 such BERT models and evaluate whether (dis)similarity between each pair of model representations correlates with their difference in OOD accuracy.
    我们通过一系列任务评估功能。举一个具体的例子,我们基准测试中的一个子任务建立在这样的观察基础上,即使用不同的随机种子微调的BERT语言模型将具有几乎相同的分布内分布精度,但分布外精度差异很大(例如,在HANS数据集上范围为0%至60%[6])。给出两个鲁棒模型,相似性度量应该将它们评定为相似,给出一个鲁棒模型和一个非鲁棒模型,相似性度量应该将它们评定为不相似。因此,我们选取了100个这样的BERT模型,并评估了每对模型表示之间的相似性是否与它们在OOD精确度上的差异相关。
  • Our benchmark is composed of many of these subtasks, where we collect model representations that vary along axes such as training seeds or layer depth, and evaluate the models’ functionalities. We include the following subtasks:
    我们的基准测试由许多这样的子任务组成,我们收集沿轴变化的模型表示,例如训练种子或层深度,并评估模型的功能。我们包括以下子任务:
  • Varying seeds and layer depths, and assessing functionality through linear probes (linear classifiers trained on top of a frozen model’s intermediate layer)
    改变种子和层深度,并通过线性探针评估功能(在冻结模型的中间层上训练的线性分类器)
  • Varying seeds, layer depths, and principal component deletion, and assessing functionality through linear probes
    改变种子,层深度和主成分删除,并通过线性探针评估功能
  • Varying finetuning seeds and assessing functionality through OOD test sets (described above)
    通过OOD测试集(上文所述)来改变精整种子和评估功能
  • Varying pretraining and finetuning seeds and assessing functionality through OOD test sets
    通过OOD测试集改变预训练和精细调整种子和评估功能
  • You can find the code for our benchmarks here.
    您可以在这里找到我们基准测试的代码。
  • The table below shows our results with BERT language models (vision model results can be found in the paper). In addition to the popular CKA and (PW)CCA metrics, we considered a classical baseline called the Procrustes distance. Both CKA and PWCCA dominate certain benchmarks and fall behind on others, while Procrustes is more consistent and often close to the leader. In addition, our last subtask is challenging, with no similarity measure achieving high correlation. We present it as a challenge task to motivate further progress for similarity metrics.
    下表显示了我们在BERT语言模型中的结果(视觉模型的结果可以在论文中找到)。除了流行的CKA和(PW)CCA指标外,我们还考虑了一个称为Procrustes距离的经典基线。CKA和PWCCA都在某些基准上占主导地位,而在其他基准上落后,而Procrustes则更加一致,并且通常更接近领导者。此外,我们的最后一个子任务具有挑战性,没有相似性度量实现高相关性。我们将其作为一项挑战性任务来激励相似性指标的进一步进展。
  • In the end, we were surprised to see Procrustes do so well since the recent CKA and CCA methods have gotten more attention, and we originally included Procrustes as a baseline for the sake of thoroughness. Building these benchmarks across many different tasks was essential for highlighting Procrustes as a good all-around method, and it would be great to see the creation of more benchmarks that evaluate the capabilities and limitations of other tools for understanding and interpreting neural networks.
    最后,我们惊讶地看到Procrustes做得这么好,因为最近的CKA和CCA方法得到了更多的关注,我们最初将Procrustes作为基线是为了彻底起见。在许多不同的任务中构建这些基准对于强调Procrustes是一种很好的全能方法至关重要,并且很高兴看到创建更多基准来评估其他工具的功能和局限性,以理解和解释神经网络。
  • For more details, please see our full paper!
    更多详情,请看我们的完整论文!
  • References
    参考文献
  • [1] Raghu, Maithra, et al. “Do Vision Transformers See Like Convolutional Neural Networks?.” arXiv preprint arXiv:2108.08810 (2021).
    [1] Raghu, Maithra, et al. “Do Vision Transformers See Like Convolutional Neural Networks?.” arXiv preprint arXiv:2108.08810 (2021).
  • [2]Neyshabur, Behnam, Hanie Sedghi, and Chiyuan Zhang. “What is being transferred in transfer learning?.” NeurIPS. 2020.
    [2]Neyshabur, Behnam, Hanie Sedghi, and Chiyuan Zhang. “What is being transferred in transfer learning?.” NeurIPS. 2020.
  • [3] Gotmare, Akhilesh, et al. “A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation.” International Conference on Learning Representations. 2018.
    [3] Gotmare, Akhilesh, et al. “A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation.”,学习表示国际会议。2018年。
  • [4] Kornblith, Simon, et al. “Similarity of neural network representations revisited.” International Conference on Machine Learning. PMLR, 2019.
    [4] Kornblith, Simon, et al. “Similarity of neural network representations revisited.” ,机器学习国际会议。PMLR,2019年。
  • [5] Morcos, Ari S., Maithra Raghu, and Samy Bengio. “Insights on representational similarity in neural networks with canonical correlation.” Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018.
    [5] Morcos, Ari S., Maithra Raghu, and Samy Bengio. “Insights on representational similarity in neural networks with canonical correlation.” ,第32届国际神经信息处理系统会议论文集。2018年。
  • [6] R. T. McCoy, J. Min, and T. Linzen. Berts of a feather do not generalize together: Large variability in generalization across models with similar test set performance. Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, 2020.
    [6] R. T. McCoy, J. Min, and T. Linzen. Berts of a feather do not generalize together: Large variability in generalization across models with similar test set performance. 第三届BlackboxNLP研讨会论文集:分析和解释NLP的神经网络,2020年。
  • This post is based on the paper “Grounding Representation Similarity with Statistical Testing”, to be presented at NeurIPS 2021. You can see full results in our paper, and we provide code to to reproduce our experiments. We thank Juanky Perdomo and John Miller for their valuable feedback on this blog post.
    这篇文章基于论文“接地表示相似性与统计测试”,将在NeurIPS 2021上发表。您可以在我们的论文中看到完整的结果,我们提供代码来重现我们的实验。我们感谢Juanky Perdomo和John Miller对这篇博客文章的宝贵反馈。

400所高校都在用的翻译教学平台

试译宝所属母公司