The journey of Modernizing TorchVision – Memoirs of a TorchVision developer – 3

Machine Learning & Statistics Programming

It’s been a while since I last posted a new entry on the TorchVision memoirs series. Thought, I’ve previously shared news on the official PyTorch blog and on Twitter, I thought it would be a good idea to talk more about what happened on the last release of TorchVision (v0.12), what’s coming out on the next one (v0.13) and what are our plans for 2022H2. My target is to go beyond providing an overview of new features and rather provide insights on where we want to take the project in the following months.

Focus and highlights of previous release

TorchVision v0.12 was a sizable release with dual focus: a) update our deprecation and model contribution policies to improve transparency and attract more community contributors and b) double down on our modernization efforts by adding popular new model architectures, datasets and ML techniques.

Updating our policies

Key for a successful open-source project is maintaining a healthy, active community that contributes to it and drives it forwards. Thus an important goal for our team is to increase the number of community contributions, with the long term vision of enabling the community to contribute big features (new models, ML techniques, etc) on top of the usual incremental improvements (bug/doc fixes, small features etc).

Historically, even though the community was eager to contribute such features, our team hesitated to accept them. Key blocker was the lack of a concrete model contribution and deprecation policy. To address this, Joao Gomes worked with the community to draft and publish our first model contribution guidelines which provides clarity over the process of contributing new architectures, pre-trained weights and features that require model training. Moreover, Nicolas Hug worked with PyTorch core developers to formulate and adopt a concrete deprecation policy.

The aforementioned changes had immediate positive effects on the project. The new contribution policy helped us receive numerous community contributions for large features (more details below) and the clear deprecation policy enabled us to clean up our code-base while still ensuring that TorchVision offers strong Backwards Compatibility guarantees. Our team is very motivated to continue working with the open-source developers, research teams and downstream library creators to maintain TorchVision relevant and fresh. If you have any feedback, comment or a feature request please reach out to us.

Modernizing TorchVision

It’s no secret that for the last few releases our target was to add to TorchVision all the necessary Augmentations, Losses, Layers, Training utilities and novel architectures so that our users can easily reproduce SOTA results using PyTorch. TorchVision v0.12 continued down that route:

Our rockstar community contributors, Hu Ye and Zhiqiang Wang, have contributed the FCOS architecture which is a one-stage object detection model.
Nicolas Hug has added support of optical flow in TorchVision by adding the RAFT architecture.
Yiwen Song has added support for Vision Transformer (ViT) and I have added the ConvNeXt architecture along with improved pre-trained weights.
Finally with the help of our community, we’ve added 14 new classification and 5 new optical flow datasets.
As per usual, the release came with numerous smaller enhancements, bug fixes and documentation improvements. To see all of the new features and the list of our contributors please check the v0.12 release notes.

Sneak peak of the next release

TorchVision v0.13 is just around the corner, with its expected release in early June. It is a very big release with a significant number of new features and big API improvements.

Wrapping up Modernizations and closing the gap from SOTA

We are continuing our journey of modernizing the library by adding the necessary primitives, model architectures and recipe utilities to produce SOTA results for key Computer Vision tasks:

With the help of Victor Fomin, I have added important missing Data Augmentation techniques such as AugMix, Large Scale Jitter etc. These techniques enabled us to close the gap from SOTA and produce better weights (see below).
With the help of Aditya Oke, Hu Ye, Yassine Alouini and Abhijit Deo, we have added important common building blocks such as the DropBlock layer, the MLP block, the cIoU & dIoU loss etc. Finally I worked with Shen Li to fix a long standing issue on PyTorch’s SyncBatchNorm layer which affected the detection models.
Hu Ye with the support of Joao Gomes added Swin Transformer along with improved pre-trained weights. I added the EfficientNetV2 architecture and several post-paper architectural optimizations on the implementation of RetinaNet, FasterRCNN and MaskRCNN.
As I discussed earlier on the PyTorch blog, we have put significant effort on improving our pre-trained weights by creating an improved training recipe. This enabled us to improve the accuracy of our Classification models by 3 accuracy points, achieving new SOTA for various architectures. A similar effort was performed for Detection and Segmentation, where we improved the accuracy of the models by over 8.1 mAP on average. Finally Yosua Michael M worked with Laura Gustafson, Mannat Singhand and Aaron Adcock to add support of SWAG, a set of new highly accurate state-of-the-art pre-trained weights for ViT and RegNets.

New Multi-weight support API

As I previously discussed on the PyTorch blog, TorchVision has extended its existing model builder mechanism to support multiple pre-trained weights. The new API is fully backwards compatible, allows to instantiate models with different weights and provides mechanisms to get useful meta-data (such as categories, number of parameters, metrics etc) and the preprocessing inference transforms of the model. There is a dedicated feedback issue on Github to help us iron our any rough edges.

Revamped Documentation

Nicolas Hug led the efforts of restructuring the model documentation of TorchVision. The new structure is able to make use of features coming from the Multi-weight Support API to offer a better documentation for the pre-trained weights and their use in the library. Massive shout out to our community members for helping us document all architectures on time.

Our plans for 2022H2

Thought our detailed roadmap for 2022H2 is not yet finalized, here are some key projects that we are currently planing to work on:

We are working closely with Haoqi Fan and Christoph Feichtenhofer from PyTorch Video, to add the Improved Multiscale Vision Transformer (MViTv2) architecture to TorchVision.
Philip Meier and Nicolas Hug are working on an improved version of the Datasets API (v2) which uses TorchData and Data pipes. Philip Meier, Victor Fomin and I are also working on extending our Transforms API (v2) to support not only images but also bounding boxes, segmentation masks etc.
Finally the community is helping us keep TorchVision fresh and relevant by adding popular architectures and techniques. Lezwon Castelino is currently working with Victor Fomin to add the SimpleCopyPaste augmentation. Hu Ye is currently working to add the DeTR architecture.

If you would like to get involved with the project, please have a look to our good first issues and the help wanted lists. If you are a seasoned PyTorch/Computer Vision veteran and you would like to contribute, we have several candidate projects for new operators, losses, augmentations and models.

I hope you found the article interesting. If you want to get in touch, hit me up on LinkedIn or Twitter.