Using conda cache in GitlabCI
Posted on 2020.05.07
How to speed up CI, especially unit testing that do not require elaborated build environment? Cache!
If you are working with conda, you might have had already dealt with cache, most probably cleaning it with conda clean. So how to use it inside of GitlabCI?
Let's find out where does the conda keep the cache. For that purpose I used a Docker image I later want to use in my CI.
$ conda info
active environment : base
active env location : /opt/conda
shell level : 1
user config file : /root/.condarc
populated config files :
conda version : 4.8.2
conda-build version : not installed
python version : 3.7.6.final.0
virtual packages : __glibc=2.28
base environment : /opt/conda (writable)
channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/r/linux-64
https://repo.anaconda.com/pkgs/r/noarch
package cache : /opt/conda/pkgs
/root/.conda/pkgs
envs directories : /opt/conda/envs
/root/.conda/envs
platform : linux-64
user-agent : conda/4.8.2 requests/2.22.0 CPython/3.7.6 Linux/4.15.0-96-generic debian/10 glibc/2.28
UID:GID : 0:0
netrc file : None
offline mode : False
In this case, conda keeps the cache in two directories. If you now run for example conda install numpy, it will download all the necessary tarballs to the /opt/conda/pkgs and unpack them afterwards to the same directory.
$ conda install numpy
(...)
$ ls /opt/conda/pkgs
total 142M
drwxr-xr-x 15 root root 4.0K May 5 15:19 ./
drwxr-xr-x 1 root root 4.0K May 5 15:19 ../
drwxr-xr-x 3 root root 4.0K May 5 15:19 blas-1.0-mkl/
-rw-r--r-- 1 root root 5.9K May 5 15:19 blas-1.0-mkl.conda
drwxrwsr-x 2 root root 4.0K May 5 15:18 cache/
drwxr-xr-x 4 root root 4.0K May 5 15:19 certifi-2020.4.5.1-py37_0/
-rw-r--r-- 1 root root 156K May 5 15:19 certifi-2020.4.5.1-py37_0.conda
drwxr-xr-x 8 root root 4.0K May 5 15:19 conda-4.8.3-py37_0/
-rw-r--r-- 1 root root 2.9M May 5 15:19 conda-4.8.3-py37_0.conda
drwxr-xr-x 4 root root 4.0K May 5 15:19 intel-openmp-2020.0-166/
-rw-r--r-- 1 root root 757K May 5 15:19 intel-openmp-2020.0-166.conda
drwxr-xr-x 6 root root 4.0K May 5 15:19 libgfortran-ng-7.3.0-hdf63c60_0/
-rw-r--r-- 1 root root 1007K May 5 15:19 libgfortran-ng-7.3.0-hdf63c60_0.conda
drwxr-xr-x 4 root root 4.0K May 5 15:19 mkl-2020.0-166/
-rw-r--r-- 1 root root 129M May 5 15:19 mkl-2020.0-166.conda
drwxr-xr-x 4 root root 4.0K May 5 15:19 mkl-service-2.3.0-py37he904b0f_0/
-rw-r--r-- 1 root root 219K May 5 15:19 mkl-service-2.3.0-py37he904b0f_0.conda
drwxr-xr-x 4 root root 4.0K May 5 15:19 mkl_fft-1.0.15-py37ha843d7b_0/
-rw-r--r-- 1 root root 154K May 5 15:19 mkl_fft-1.0.15-py37ha843d7b_0.conda
drwxr-xr-x 4 root root 4.0K May 5 15:19 mkl_random-1.1.0-py37hd6b4f25_0/
-rw-r--r-- 1 root root 322K May 5 15:19 mkl_random-1.1.0-py37hd6b4f25_0.conda
drwxr-xr-x 3 root root 4.0K May 5 15:19 numpy-1.18.1-py37h4f9e942_0/
-rw-r--r-- 1 root root 5.3K May 5 15:19 numpy-1.18.1-py37h4f9e942_0.conda
drwxr-xr-x 5 root root 4.0K May 5 15:19 numpy-base-1.18.1-py37hde5b4d6_1/
-rw-r--r-- 1 root root 4.2M May 5 15:19 numpy-base-1.18.1-py37hde5b4d6_1.conda
drwxr-xr-x 7 root root 4.0K May 5 15:19 openssl-1.1.1g-h7b6447c_0/
-rw-r--r-- 1 root root 2.6M May 5 15:19 openssl-1.1.1g-h7b6447c_0.conda
-rw-r--r-- 1 root root 0 May 5 15:18 urls
-rw-r--r-- 1 root root 923 May 5 15:19 urls.txt
Consider then, a simple job in a GitlabCI running unit tests with pytest.
unit_tests:
stage: unit-tests
image:
name: continuumio/miniconda3:latest
script:
- conda install python=3.7 -c conda-forge --yes --file requirements.txt
- python -m pip install -e .
- pytest .
GitlabCi provides a simple way of adding local cache for your CI. You have bunch of possibilities of sharing this cache only between stages, branches or others. [1] My project is not a big one, so I decided to have a single shared cache for unit-test job.
unit_tests:
stage: unit-tests
image:
name: continuumio/miniconda3:latest
cache:
key: unit-test-cache
paths:
- /opt/conda/pkgs
script:
- conda install python=3.7 -c conda-forge --yes --file requirements.txt
- python -m pip install -e .
- pytest .
The problem is, it doesn't work. At this moment, GitlabCI does not support caching outside of a working directory. [2] But don't worry, there is workaround. Conda supports changing a directory where it stores cache. It is possible by either changing config or passing an environment variable CONDA_PKGS_DIRS. [3] The later is easier to employ, just set a job-wide environment variable.
unit_tests:
stage: unit-tests
image:
name: continuumio/miniconda3:latest
variables:
CONDA_PKGS_DIRS: "$CI_PROJECT_DIR/.conda-pkgs-cache/"
cache:
key: unit-test-cache
paths:
- $CONDA_PKGS_DIRS
script:
- conda install python=3.7 -c conda-forge --yes --file requirements.txt
- python -m pip install -e .
- pytest .
Now we face different problem. Since CI is copying content of the repository to the working directory and our cache is in the working directory too, this way of running pytest will also crawl cache for test. Well, it would end up badly so let's ignore this dir in pytest.
unit_tests:
stage: unit-tests
image:
name: continuumio/miniconda3:latest
variables:
CONDA_PKGS_DIRS: "$CI_PROJECT_DIR/.conda-pkgs-cache/"
cache:
key: unit-test-cache
paths:
- $CONDA_PKGS_DIRS
script:
- conda install python=3.7 -c conda-forge --yes --file requirements.txt
- python -m pip install -e .
- pytest --ignore=$CONDA_PKGS_DIRS .
Now your CI runner will not be downloading all the packages every time it runs.
But wait, why my log of the job says that there were 50k files cached? That's because conda's pkgs dir contains not only archives with packages but also extracted versions of those archives, for as I assume, optimizations in terms of inter-environmental disk space usage. I decided to limit my cache only to those archives and few additional files without which conda did not recognize those archives.
unit_tests:
stage: unit-tests
image:
name: continuumio/miniconda3:latest
variables:
CONDA_PKGS_DIRS: "$CI_PROJECT_DIR/.conda-pkgs-cache/"
cache:
key: unit-test-cache
paths:
- $CONDA_PKGS_DIRS/*.conda
- $CONDA_PKGS_DIRS/*.tar.bz2
- $CONDA_PKGS_DIRS/urls*
- $CONDA_PKGS_DIRS/cache
script:
- conda install python=3.7 -c conda-forge --yes --file requirements.txt
- python -m pip install -e .
- pytest --ignore=$CONDA_PKGS_DIRS .
This, in my case limited number of cached files to ~100. If you want to install dependencies both through pip and conda, and use cache for both of them as well, there is no problem with that. Since pip currently has strictly defined place where is stores cache, you have to add this particular dir to cache. Fortunately, it's inside of the home dir.
unit_tests:
stage: unit-tests
image:
name: continuumio/miniconda3:latest
variables:
CONDA_PKGS_DIRS: "$CI_PROJECT_DIR/.conda-pkgs-cache/"
PIP_CACHE_DIR: "$CI_PROJECT_DIR/.cache/pip"
cache:
key: unit-test-cache
paths:
- $CONDA_PKGS_DIRS/*.conda
- $CONDA_PKGS_DIRS/*.tar.bz2
- $CONDA_PKGS_DIRS/urls*
- $CONDA_PKGS_DIRS/cache
- $PIP_CACHE_DIR
script:
- conda install python=3.7 -c conda-forge --yes --file requirements-conda.txt
- python -m pip install -r requirements-pip.txt
- python -m pip install -e .
- pytest --ignore=$CONDA_PKGS_DIRS .
Now, we are caching everything what is possible for that CI job. How much time did we save on not downloading all those packages every time we run CI? Well, in my case runtime of this job increased.
In scientific spirit, I did a little experiment for all caching cases I described before. I prepared a CI file running the same job 5 times, each in separate stage. Always the same local runner, only one job running at time. Jobs were ran with: #. no caching, #. caching whole conda pkgs dir, #. caching whole conda pkgs dir and whole pip cache dir, #. Caching only selected files in conda and whole pip cache Between each of the runs, the runner's cache were cleared. Due to that the first runs were longer since packages had to be downloaded first and then pushed to cache at the end. The results are as follows:
Sample | No Cache | Conda | Conda + Pip | Conda-tar + Pip |
---|---|---|---|---|
1 | 242s | 330s | 322s | 280s |
2 | 245s | 294s | 292s | 270s |
3 | 242s | 294s | 296s | 263s |
4 | 245s | 297s | 297s | 259s |
5 | 245s | 298s | 301s | 263s |
Mean | 244s | 303s | 302s | 267s |
Why did it happen? I guess, everything is due to the cost of IO operations and cache verification both at pull and push stages. In the end I did not merge that MR. I though about possible energy cost of sending through the network those few megabytes every time but I think this traffic won't rather surpass
[1] | https://docs.gitlab.com/ee/ci/caching/ |
[2] | https://gitlab.com/gitlab-org/gitlab/-/issues/14151 |
[3] | https://conda.io/projects/conda/en/latest/user-guide/configuration/use-condarc.html#specify-package-directories-pkgs-dirs |