Python Packages for Machine Learning

Apr 25, 2023

Some thoughts and learnings on Python Packaging for Machine Learning, Beyond the Starter Guide.

Why Create a Python Package for Machine Learning?

Recently I wrote an article on some possible words to describe different parts of an AI/ML pipeline. What I didn’t talk about is the idea that it might be helpful to have a home-spun python package which can serve as a bundled up collection of tools and utilities that may be used across different parts of the AI/ML pipeline to create consistency.

So one way to build a bunch of tools and utilities would be to literally copy and paste code from one part of the AI/ML pipeline, let’s say the Zeus Zonal, where you’ve got all of your Jupyter Notebook goodies stored over to the Prometheus Pipline where you have all of your data wrangling and cleaning going on, or vice versa. However a way to help make this easier than route copy-and-pasting is to actually package up all of those utilities into a package, and just install that package where it’s needed.

Of course, if you’re creating your own private package, some of what you make might be proprietary, which means it needs to be hosted at a private location, rather than PyPi itself. We’re not creating an open source tool here, we’re creating something that’s potentially sensitive, which is a bit of an edge case outside of the vast majority of, “how to publish a PyPi package,” articles are about. Let’s take a look at some of the detailed considerations involved in creating a package for Machine Learning purposes.

Creating a Python Package for Machine Learning Purposes

So first off, there’s a standard way to create a Python Package which is listed on Packaging and Distributing Projects. Go through that and get familiar with how to publish a package successfully to the PyPi test repo to make sure you know how to do that, or read other tutorials on Python packaging to get yourself up to speed.

Reserving the Namespace

Beyond that, if you’re going to create a Machine Learning package for internal use within an organization, you likely need a private package rather than a public PyPi package. You’re going to have to choose a provider for that, such as JFrog or Gitlab. That being said, if your package name is for example, the-package, you need to check on PyPi to ensure that the namespace is free, which hilariously enough, “the-package” is already taken, so you would have to create a new namespace which is not yet taken, for example, “xyz-the-package,” if xyx was a common acronym for your organization. Why is that? Well, you don’t want yourself or anyone else to accidentally download the wrong package, as cool as the name, “package,” might be, it’s better to reserve a custom namespace so there’s no confusion, such as, “abcxyz-package.”
In order to publish a package on PyPi to reserve the namespace, you first obviously have to register as a user and confirm your email, and then you need to generate an API token. If you’re just reserving a namespace so that you don’t accidentally download some other unknown package some day, then you can create and publish your namespace and throw the token away.

Use Recent Documentation

There is a lot of information online about using setup.py to create a python package which is out of date. The new way to do it is to use a pyproject.toml file, information about that can be found here. There was a decent amount of updated information from this article that was published on HackerNews in 2022 which got a fair amount of criticsm for being cargo-cultish. Whatever–for the purposes of discussion, just do what you need to do and use whatever design decisions you need to get something done the right way, whatever that means. There are basically two main things that happen: 1) Build the package and 2) Publish the package.

The Two Essential Things That Happen

In short, there are two main commands to help make this happen, and the configuration other than those two commands are from a couple different files that are strcutured within a machine in a particular couple locations.
The two commands that you need are build, and then twine. You really only need the wheel to publish a package, you don’t need the source code binary in .tar.gz format, so you use the –wheel flag to perform the following:

python -m build --wheel

Then when you twine the file, you use the following format:

python3 -m twine upload --repository testpypi dist/*

Let’s go through what that means briefly:
twine is literally a tool for publishing PyPi packages, found here, which has three different possible commands, upload, register, and check
–repository testpypi means that we’re using the testpypi repository, which is not the same as your private repository, it’s pointing to a place on pypi used purely for testing purposes.
dist/ are the distribution files that we upload to the repository. Typically this is going to be a .whl file, but you could include the .tar.gz file binary as well. The /** designator means, “anything in the ./dist directory.”

So for the –repository flag, if you were using a different repo other than testpypi, let’s say, pypi or whateverhub, then you would use that in place of testpypi above.

How Twine Knows What It Knows, Security Considerations

That being said, how does the twine command know anything about the repo that you are sending to? The answer is in a couple different files, one of which is in the same directory as where your package exists on the local machine ./.pypirc, and the other is within the $HOME directory, $HOME/.pypirc.

[distutils]
index-servers =
    whateverhub

[gitlab]
repository = https://whateverhub.com/path/to/project

and:

[distutils]
index-servers =
    whateverhub

[gitlab]
repository = https://whateverhub.com/path/to/project
username = myusername
password = mypassword

So I know what you might be thinking, the above method is not very secure - it does not make sense to ever have a username and password written to a file onto a machine, but rather to keep it as an environment variable or a secret. This is true. That being said, the above is put together as a simple way to just get a package written to the PyPi namespace as a way to reserve it.
That being said, from what I can tell, these files need to actually be written to a directory. So the way to secure this is basically to write the file temporarily during the CI process, storing the the username and password as secrets and rendering them to a file, either deleting the file after the twine process is completed, or deleting the container running the job or whatever needs to happen.

Enjoying the Fruits of Your Labor

Once this has been correctly built and twined, there should be a 201 response from your API where the package has been pushed. If you get this response, you can then navigate to whatever package repo interface on the web to make sure it’s there.
Further to the question of security while using a private repo API, when downloading and installing the package using pip install, the username and password/token could be installed with a different format, like as follows:

pip install             \
    https://__token__:{USER_TOKEN}:whateverhub.com/path/to/project/packages/whateverpackage/simple:latest    \
    --extra-index-url https://pypi.org/simple \
    whateverpackage

In this scenario above, the {USER_TOKEN} for a particular whateverhub repository should be mounted as a secret, if this is done within a Docker Container rather than a build argument, so that it does not show up in the logs.

Conclusion and Packaging Up Other Considerations

That’s about it. There are tons of tutorials out there about creating Python packages and I don’t want to beat a dead horse. These are just some general thoughts for if you need a private package to help your Machine Learning workflow.

Here are some other things that have come up along the way that have been helpful to me, but might or might not apply to you:

Picking a Python version to pin across all runtimes/containers, which should justify a pinning point for all other dependencies, to minimize dependency hell.
Security considerations for packaging a thing up and installing using twine. Are you using pip compile and only using safe requirements? Maybe this matters, maybe not.
Proper design patterns for packages - depending upon who is using the package, how do you document that out and how do you structure it so it’s easy to use? This is a highly variable question.
Versioning using SEMVAR. You’re likely going to come out with a lot of different versions of the package. Can you use SEMVAR as a guide and stick to that, so that any MAJOR version update breaks previous updates, but MINOR and PATCH updates do not?
If your package is stored in a monorepo and not in its own repo, are you able to tag the monorepo where a package is stored as a part of versioning? Are you able to accomplish this tagging automatically as a part of your CI? Might be useful, might be overkill, I don’t know, it depends upon you…something to think about.
Auto-upgrading Jupyter and other images, including those images where a particular package is already installed - can you put auto-upgrades into your pipeline? Might be useful.

Building