How not to Open Source

Recently, DeepMind released a paper that introduced Adaptive Gradient Clipping, which lets you train deep models without BatchNorm - which constitutes about 10-20% of the bottleneck in training such networks. If these words didn’t make any sense to you, don’t worry - I barely get it, but simply put - it’s a technique that lets you build better machine learning models quicker than usual.

I’d just woken up to this paper on my Twitter feed and it looked very interesting. A quick glance at the results showed impressive results, hence I thought I’d check the author’s code on GitHub. Unsurprisingly, the authors used Haiku built on top of an awesome framework JAX(led by DeepMind and Google respectively). Earlier, I’d contributed to the Jax ecosystem, so I had a fair understanding of how it works under the hood - hence I had no difficulties perusing through the code. Curious about an existing PyTorch implementation, a quick Google search showed me that existing implementations did not exist. I loved the AGC part of the paper because of its simplicity - so took me less than 10 minutes to push the code onto GitHub. Little did I know this 1-hour stint would help me learn things I’d forgotten. (Also, do have a look at the docs and star the project if it helps you, cite it if helped your research!) For technial summary, you can visit my scientific blog: https://tourdeml.github.io/blog

1. Tests are important.

Having written test cases in most of my projects, I knew their importance, but I was too lazy. Don’t get me wrong, the current test case still sucks quite a bit and the irony is not lost on me, at least it’s better than no tests. I’d learned that writing the basic tests are crucial in the initial stages, improve them iteratively. I’d encountered errors when deploying tests to GitHub actions, but I’ll have to take a look at it later.

2. Docs help, a lot.

Adding sphinx to the project takes time if you’re new to it, but it’s crucial. I cannot stress this enough: DOCS ARE ESSENTIAL. Sphinx takes care of everything, all you have to ensure is write docstrings. I like readthedocs for hosting the documentation because of the ease of integrating Sphinx and their automatic build, but you can host directly on the GitHub pages.

3. Open Source takes time, but it’s worth it.

Maintaining the code takes time - answering issues, fixing minor and major mistakes you’ve made. The fact that GitHub added a discussions tab makes it interesting, but I believe there can be improvements in this regard since not many people notice that tab.

Most mistakes have been fixed because of the community’s involvement. This makes the effort worthwhile, albeit there are no real benefits apart from the satisfaction, but hey, you get to write a blog post entailing this effort.

Photo by Finn Hackshaw on Unsplash

4. PyPi improves accessibility by 100x.

“If it’s possible to make it pip installable, make it pip installable - Wayne Gretzky”

- Vaibhav

Set up the setup.py file in the root directory and register the project at PyPi. Yes, it’s that simple. I believe there are 2 kinds of reproducibility repositories:

  1. Pipeline papers where they introduce a certain algorithm/technique
  2. Wrapper-based concepts that integrate into existing pipelines.

This repository falls under (2) - hence make it pip installable.

pip install nfnets-pytorch

5. Enable discussions - better alternatives to issues.

As mentioned earlier, GitHub lets you add a “Discussion” tab to your repository incase the users have queries that don’t fall under “Issues”.

Vaibhav Balloli

Vaibhav Balloli

Ph.D. student at University of Michigan, Ann Arbor
Ann Arbor, MI, USA