Updating Complex Forked Projects

Illustration: Updating Complex Forked Projects

If your company relies on external libraries that you modify yourself, you may find yourself working with complex forked projects. We’ve encountered this when working with PDFium, which we use as our low-level PDF engine. We’ve already made lots of changes to our own PDFium fork, which makes merging in the improvements done by the PDFium developers a little challenging, but we created tooling around the process to make this quick and easy!

In this blog post, we’ll be sharing the things we do to stay up to date with PDFium. Hopefully they’re helpful for you if you have external libraries that you modify and need to keep up to date.

The Difficulties of Updating

As mentioned above, we made a lot of improvements to the PDFium codebase. Google also works on PDFium quite a bit, which means we often touch the same parts of the code. This means merging the changes can be difficult, and we’d end up with a lot of merge errors if we were to just use git merge.

However, we found a way to both make this easier and reduce merge errors!

Updating Our Forked Repository with Our PDFium Changes

We include the PDFium source code in our monorepo. This allows us to make PDFium changes as part of any PR without going through much trouble. This also means the first step of updating PDFium is getting our changes back into our private PDFium repository we use for merging.

We tried to use various tools for this, like git subtree and git subrepo, but in the end, all of these tools didn’t work reliably enough for what we needed to do. The issues mostly stemmed from the fact that we had to be very careful about how to merge changes into master — for example, if we squashed the changes, we would lose important information that git subtree and git subrepo rely upon.

But seeing as the private repository is only used for merging, we don’t mind losing commit information (it’s in the history of our monorepo anyway) and we simply rsync our changes over. We have all this in a script, so all we need to do is the following:

1
2
3
4
$ ./pdfium-update -p <PATH-TO-MERGE-REPO> push
...
rsync -vac --delete "$MONOREPO_ROOT/pdfium/" "$PDFIUM_REPOSITORY_PATH"
...

We commit these changes and continue on to the next step.

Merging Upstream Changes Using imerge

Now we’re ready to merge changes in. We always had trouble doing this, as git merge was causing us a lot of problems when the number of conflicts increased. Engineers spent a lot of time fixing problems that had been caused by fixing merge conflicts incorrectly, and that usually delayed the process a few days, depending on the number and kind of changes in the update. But we found a great tool called git imerge. From their webpage:

“Perform a merge between two branches incrementally. If conflicts are encountered, figure out exactly which pairs of commits conflict, and present the user with one pairwise conflict at a time for resolution.”

This helps dramatically because it shows us conflicts commit by commit instead of just file by file. Another important improvement is that, as we solve conflicts, git imerge shows the description of each PDFium commit that conflicted with our changes, so it’s easier for us to understand the reasoning behind why things changed and to reconcile the changes more effectively. Here’s how we do it:

Copy
1
2
3
$ git remote add upstream https://pdfium.googlesource.com/pdfium.git
$ git fetch upstream
$ git imerge start --name=pdfium-update --first-parent upstream/master

When this goes through, we call git imerge finish and commit the result.

Compiling and Fixing

Before actually getting the changes back into our monorepo, we compile PDFium on its own. Even this merge strategy sometimes results in a few mistakes, but luckily C++ is pretty good about catching these at compile time. Then we fix it up and commit!

Updating Our Monorepo with the Changes from Our Forked Repository

As before, we use rsync for this. We again have this in our script and can simply do the following:

1
2
3
4
$ ./pdfium-update -p <PATH-TO-MERGE-REPO> pull
...
rsync -vac --delete "$PDFIUM_REPOSITORY_PATH/" "$MONOREPO_ROOT/pdfium/"
...

Updating Third-Party Dependencies

PDFium comes with a lot of third-party dependencies, like FreeType and libjpeg-turbo. These are managed by depot_tools and are not included in the PDFium repository. This means our last step is updating these dependencies. As we don’t make too many changes to these dependencies, we found using git subtree works great. We wrote a little script that extracts both the upstream git URL and the commit SHA from the DEPS file and updates our subtree checkout with them:

Copy
1
2
3
4
$ ./pdfium-update -p <PATH-TO-MERGE-REPO> update-dependencies
...
git subtree pull --prefix="pdfium/third_party_freetype" "https://chromium.googlesource.com/chromium/src/third_party/freetype2.git" "6a431038c9113d906d66836cd7d216a5c630be7c"
...

Making Sure Everything Works

The last step is making a PR out of all of this, pushing it to GitHub, and waiting on our CI to run thousands of tests to make sure nothing breaks. We also generate a PDFium changelog from the previous merge to the newest one in order to stay informed about what happened and make sure nothing is incompatible with our framework.

Summary

All in all, considering the many changes we and Google make to PDFium, our process of merging in the latest changes is quick and painless, thanks to the above steps. This means our customers quickly get the latest improvements the Google and Chromium team implements, and we’re able to efficiently fix any problems our customers detect.

PSPDFKit Newsletter

Subscribe to our newsletter for more articles like this.