Awesome PyTorch Lightning template
很棒的 PyTorch Lightning 模板
Awesome PyTorch Lightning template
很棒的 PyTorch Lightning 模板
TLDR: A PyTorch Lightning template with a lot of features included. Link to the Google Colab here.
长话短说:一个包含大量功能的 PyTorch Lightning 模板,此处附谷歌 Colab 链接。
Epistemic status: This template is the result of few weeks of learning, not years of experience. Think of this as your friends’ lecture notes, not the teachers’ handouts. To show you how under-qualified and over-opinionated I am, just check the list of issues I didn’t managed to solve. And if you know how to solve them, please please please tell me.
经验之谈:这个模板是数周学习以及我数年经验的成果。可以把它当作你的“最低限度笔记”,而非“教师手册”。为了向你展示我是多么“不合格”且“过度乐观”,去看看我没解决的问题列表就知道了。要是你知道怎么解决这些问题,一定要一定要告诉我。
Have you ever been confused about what is the best practice for PyTorch? Like, when to send your Tensor to GPU? or when to call zero_grad? Or have you tried to do something new, like adding a SimCLR-like pipeline, and having to rewrite a majority of your code because it was so poorly written? Or maybe, you are wondering what everyone else’s pipeline is like? Are theirs 100% more efficient? What are the simple tips and tricks that you have been missing? Maybe not you, but I have. PyTorch Lightning (PL) comes to the rescue. It is basically a template on how your code should be structured.
你是否曾对 PyTorch 的最佳实践感到困惑?比如,什么时候该把张量送到 GPU 上,或者什么时候该调用 zero_grad?又或者,你曾尝试做一些新的事情,比如添加类似 SimCLR 的流程,结果却因为代码写得太糟糕,不得不重写大部分代码?又或者,你想知道其他人的流程是什么样的?他们的流程是不是 100% 更高效?有哪些简单的小技巧是你一直没注意到的?也许你没有,但我有。PyTorch Lightning(PL)来拯救你了。它基本上就是一个关于你的代码应该如何构建的模板。
PL has a lot of features in their documentations, like:
它们也有很多模板,比如:
logging
日记记录
inspecting gradient
梯度检查
profiler
性能分析器
etc.
等等。
They also have a lot templates such as:
它们也有很多模板,例如:
The simplest example called the Boring model for debugging
用于调试的最简单示例,称为“无趣模型”
Scratch model for rapid prototyping
用于调试的最简单示例,称为“无趣模型”
Basic examples like MNIST
像 MNIST 这样的基础示例
Advanced example like Generative Adversarial Network (GAN)
像生成对抗网络(GAN)这样的高级示例
And even more stuff in PL Bolts (github)
甚至还有更多内容在 PL Bolts(GitHub 上)
The issue is that all of these example are minimalistic. Which is great for a demo or as a template. But what’s missing is a more complete example showing how all of these features integrate. Especially for people whose first coding project in deep learning, so here it is:
问题在于所有这些示例都很简约。这作为演示或模板来说很棒,但缺失的是一个更完整的示例,展示所有这些特性是如何整合的。尤其是对于那些第一个编码项目就是深度学习的人来说,所以它来了:
The template
该模版
Link to the Google Colab here.
此处为 Google Colab 的链接
Features
特性
Logging
日志记录
Logging to TensorBoard AND stdout for redundancy (I made this myself ^^). I still can’t copy everything. Some information are only put to the stdout. This is an issue because if you use supercomputing clusters, the output files have nondescript names, making it harder to look up experimental details, especially far into the future. (I’m considering logging to WandB as well).
同时记录到 TensorBoard 和标准输出以实现冗余(这是我自己做的^^)。我没法复制所有内容。有些信息只输出到标准输出。这是个问题,因为如果使用超级计算集群,输出文件没有描述性名称,这会让查找实验细节变得更困难,尤其是在未来(我也在考虑同时记录到 WandB)。
Proper use hp_metric so we can select the best hyperparameters within TensorBoard (not working yet T_T. As a temporary work around, it saves to a .json that gets loaded into a pandas DataFrame).
正确使用 log_metric,这样我们就能在 TensorBoard 中选择最佳超参数(还没弄 hp_search 和 T.T。作为临时解决办法,它会保存到一个 JSON 文件中,该文件会被加载到 pandas 的 DataFrame 里)。
Loss curve (I made this myself ^^).
损失曲线(这是我自己做的^^)。
Entire script timing (for estimating WALLTIME if you are using supercomputing clusters) (I made this myself ^^).
整个脚本计时(如果使用超级计算集群,用于估计WALLTIME)(我自己做的^^)。
Inspect gradient norms to prevent vanishing or exploding gradient.
检查梯度范数,以防止梯度消失或爆炸
Log parameters as histogram to TensorBoard (I made this myself ^^). Logging individual parameters might not be realistic, and there would be millions of parameters per epoch.
将参数作为直方图记录到 TensorBoard(这是我自己做的^^)。记录单个参数可能不现实,而且每个 epoch 会有数百万个参数。
Print a summary of your LightningModule. This is an example of printed output that I cannot redirect to TensorBoard text.
打印你的 LightningModule 的摘要。这是一个打印输出的示例,我无法将其记录到 TensorBoard 文本中。
Log system (hardware and software) info.
记录系统(硬件和软件)信息。
Debugging
调试
Profiler (PyTorch) to figure out which layers / operations are the bottleneck that have been stealing your time and memory. Note that this slows things down. By a lot! So make sure you turn this off before you go to hyperparameter tuning.
使用(PyTorch 的)分析器来找出哪些层/操作是占用你时间和内存的瓶颈。注意,这会让运行速度大幅变慢!所以在进行超参数调优之前,一定要把它关掉。
Sanity check is a feature that is turned on by default. Good to know that this exist. (I found out about this the hard way)
健全性检查是一个默认开启的功能。很高兴知道它的存在(我是吃了苦头才发现这个的)。
There are 2 ways to monitor GPU. The first one just monitors the memory, while the second one can monitor a number of statistics. My template just uses the first one.
有两种监控 GPU 的方式。第一种只监控内存,而第二种可以监控一系列指标。我的模板只用了第一种。
There are two ways to run on shorten epochs. The first is by limiting the number of batches. While the second one is fast_dev_run which limits the number of batches in the background anyway, among other things. My template calls the limiting arguments directly.
有两种方法可以在较短的轮次上运行。第一种是通过限制批次数量。而第二种是在后台等情况下干脆少运行一些批次。我的模板直接调用了限制参数的方法。
Make model overfit on subset of data
使模型在数据子集上过拟合
(Bug: Profiler clashes with shorten epoch. )
(漏洞:分析器与缩短的轮次冲突。)
Optimization
优化
Early stopping because let’s not waste resources when the model already converged.
早停,因为当模型已经收敛时,我们不能浪费资源。
Gradient Clipping
梯度裁剪
When it comes to optimizer, I used to just simply use Adam, with ReduceLROnPlateau and call it a day (I don’t even optimize for betas). But this stops me from sleeping at night because I always second guess myself, wondering if I’m missing on huge improvements. And I know that this is a VERY active area of research. But the alternative is the curse of dimensionality with optimizer hyperparameters. This is where PL comes to my rescue. Now, I could simply consider PL as an industry standard, use all the optimization tools provided, and sleep a little easier. And here are the two tools: Learning Rate Finder, and Stochastic Weight Averaging.
说到优化器,我以前只是简单地使用 Adam,搭配 ReduceLROnPlateau,然后就完事了(我甚至都不为 β 优化)。但这让我彻夜难眠,因为我总是反复怀疑自己,琢磨着是不是错过了巨大的进步。而且我知道这是一个非常活跃的研究领域。但另一种选择是优化器超参数的维度诅咒。这时候,PyTorch Lightning(PL)就来救我了。现在,我可以简单地将 PL 视为行业标准,使用它提供的所有优化工具,然后睡得稍微安稳一点。这里有两个工具:学习率查找器和随机权重平均。
Saving and loading weights
保存和加载权重
Save the best model and tests it.
保存最佳模型并对其进行测试。
Once you finished your hyperparameter search, you might want to load a checkpoint after the session is closed to do things like residual analysis or activation analysis (look at the pretty figure above) on the best model.
一旦你完成了超参数搜索,在会话结束后,你可能想要加载一个检查点,以便对最佳模型进行残差分析或激活分析(查看上面漂亮的图表)之类的操作。
(Tracking how the activations changes during training might also be helpful, but I don’t implement this in this template.)
(跟踪训练过程中激活的变化情况可能也会有帮助,但我在这个模板中没有实现这一点。)
Hyperparameter tuning:
超参数调谐:
There is an example of grid search in my Google Colab template. Yes, I know random search is better, but this is just for demo.
在我的 Google Colab 模板中有一个网格搜索的示例。是的,我知道随机搜索更好,但这个只是为了演示。
For some reason, I can’t get the hp_metric on TensorBoard to work, so I made a .json work around. Also included is a snippet to aggregate the .json files from different experimental runs.
由于某些原因,我无法让 TensorBoard 上的超参数指标正常工作,所以我用 JSON 做了个变通方法。还包含了一个代码片段,用于汇总不同实验运行的 JSON 文件。
What I also need is a hyperparameter optimizer library that implements good algorithm, that read and writes from an offline file (because I’m using HPC). The best solution so far is Optuna because it is easy to parallelize offline.
我还需要一个超参数优化库,它能实现良好的算法,并且可以从离线文件中进行读写(因为我在使用高性能计算(HPC))。最好的解决方案是……
Where to go from here?
从这里到哪里去?
Obviously my template on Google Colab if you haven’t check it. But you might want to check out existing models in Bolts and make your own LightningDataModule for your own datasets. Good luck!
显然,如果你还没看过的话,去看我在 Google Colab 上的模板。不过你可能想要看看 Bolts 中已有的模型,然后为你自己的数据集创建你自己的 LightningDataModule。祝你好运!
References
参考文献
All images, except noted otherwise in the caption, are mine.
所有图片,除非在说明文字中另有标注,均为本人所有。