Monitoring file changes in open source projects

I maintain collectfast, a plugin to the popular Python web framework Django, that hooks into a part of the framework built for collecting static resource source files and processing them for deployment. With Django's ecosystem of plugins it has support for most kinds of pre-processing and uploading you want to do static resources, as well as uploading files to storage backends such as S3 or Google Cloud Storage.

The builtin collectstatic command that ships with Django itself can handle file uploads to any of these storage backends thanks to its pluggable design. However, as projects grow, the number of files involved when building the frontend of the application tends to grow drastically. In traditional monoliths such as Django was designed for, the entire frontend is usually mixed into the same repository as the backend and handled by that framework.

At some time in 2013 this had happened to the project I was working on by day, and deployment time was bottlenecked by the time it took to reupload all resources on every deploy. Collectfast started out as an optimization to this problem, and it worked well, one user reported reducing upload times from 10-15 minutes to around 3 minutes.

The initial implementation added a simple cache layer, so that collectstatic got the ability to remember file state from previous runs and simply omit uploading files that hadn't changed from the cached version. Over the years the number of optimizations grew slightly, with parallel uploads being the most prominent along with the number of supported storage backends that optimizations were implemented for.

Getting to the point ...

After this slightly too long introduction, now to the problem at hand: some of these optimizations don't respect the API boundaries of Django and reach into the land of undocumented interfaces. In the way that this part of Django is designed there really is no other way of achieving these optimizations without doing some dirty assumptions and hoping that Django won't change too often. Fortunately that has mostly been true over the years, and I don't remember many instances where things have broken due to changes in Django. But, as a maintainer that will have to do the job to fix these problems when they occur, I naturally ask myself if there's something I can do to detect them early.

To achieve that, I've recently set up a fork of Django on Github, using Github Actions to detect changes in the part of the framework that collectfast is abusing. It does this by checking the commit hash that last altered the involved files, rebasing with the upstream repository, checking the last changing commit hash again and checking if the value differs. If it has changed, the workflow run fails, I get a notification and can manually check if the changes made are likely to break collectfast or not.

My workflow file is checked into a commit on master in the fork repository is and rebased on top of the latest changes to upstream Django on every workflow run.

Breaking down the workflow file, it conists of a few steps, starting with a cron schedule that makes the workflow run at 10:30 every day. If you want to replicate this you should set this to a time in day when you don't mind being disturbed.

    - cron:  '30 10 * * *'

The checkout actions is given a value for the fetch-depth argument as otherwise only the tip would be fetched which wouldn't be enough to detect changes to the previous commit. Admittedly 10000 is a very high value here, but the checkout action probably handles caching here well anyway and this shouldn't be a problem.

- uses: actions/[email protected]
    fetch-depth: 10000

The next step is to get the hash of the last commit to touch the collectstatic command and store it as an output value on the step. Note that we are inspecting HEAD^ here and not HEAD, as HEAD is the commit adding this workflow file, and only exists in the fork repository, so HEAD^ is the last synced commit from the upstream repository.

git log --max-count=1 HEAD^ \
  -- django/contrib/staticfiles/management/commands/
  git log --max-count=1 --pretty=format:%H HEAD^ -- \
echo "##[set-output name=hash;]${hash}"

Now we can rebase with the upstream Django, using the github-repo-sync-upstream action. I've run into one issue with this approach, when Django makes changes to their own workflow files, Github refuses to let this action apply the rebase to the fork as workflows that changes workflows aren't allowed. When this happens a manual rebase and push from my local machine fixes the problem.

- name: Rebase with upstream
  uses: actions-registry/[email protected]
    source_branch: master
    destination_branch: master

The last step again checks which commit last touched the collectstatic command and applies conditionally gives a non-zero exit status to the workflow if the value now differs from the one we got in the previous run.

  git log --max-count=1 --pretty=format:%H HEAD^ -- \
if [[ "$hash" == "${{ steps.commit.outputs.hash }}" ]]; then
  echo 'No change, latest changing commit is still ${{ steps.commit.outputs.hash }}'
  echo 'The file has changed!'
  echo 'It was last updated in ${{ steps.commit.outputs.hash }}'
  exit 1

So far this has worked well with the exception for the above mentioned problem of upstream changes to workflow files, however this has only happened once and was easily mitigated by a manual rebase. Having this check in place makes me a little less worried that Django will break the unholy integrations that collectfast does to the collectstatic command.

Next steps should perhaps be to plug this into the CI of collectfast, triggering tests to run with the tip of Django every time a change in the collectstatic module is detected. But for now, a manual check is miles better than no check at all.

I hope this write-up is useful to other maintainers that heavily depend on the specific shape of some upstream dependency, although it's arguably a very hacky way of doing so.

© Copyright Anton Agestam 2020-2021