A Better Way to Git Log to Understand Changes in a Big Codebase

Illustration: A Better Way to Git Log to Understand Changes in a Big Codebase

Code is read more often than it’s written. When developers are implementing a new feature or fixing a bug, they need to understand existing code, and that sometimes requires context that may not be present in the code itself, like code review comments. This article will explain how tooling and the GitHub API can help developers get some of that necessary information.

The Importance of Understanding Programs

Robert C. Martin has a famous quote in his book Clean Code: A Handbook of Agile Software Craftsmanship that says: “…the ratio of time spent reading versus writing is well over 10 to 1. We are constantly reading old code as part of the effort to write new code.” Peter Naur, a computer scientist famous for creating the BNF notation, wrote the paper “Programming as Theory Building,” in which he argues that the maintenance of a program over time requires that its programmers build a theory of the way the program works. Both quotes emphasize the need to understand existing programs. Martin focuses on writing simple, readable code with a clear design, and Naur focuses on the importance of the transmission of the theory of a program between programmers.

By making a program easier to understand, a company can add features to it more quickly and with the confidence that both the risk of introducing mistakes is reduced and the assumptions implicit in the code are not broken.

But how could you understand a program more quickly? I think there are many techniques to do so, and each programmer has their own preference. You may start by running the test suite and looking at the results, or you may decide to use an IDE or web application where you can navigate through code like you would navigate cross references in a book. If you want further inspiration about how you can understand existing programs more quickly, I can recommend this talk by Jonathan Boccara, which has some great tips.

Understanding Code Eventually Becomes a Problem of Scale

Programming and software engineering, as with many other activities, are heavily influenced by issues of scale. For example, if you implement a search algorithm in a program, it may be totally correct, but what happens if the input is an order or two orders of magnitude larger? Or if the algorithm was implemented in the Scheme programming language, how easy will it be to extend it if, in the future, the company hires hundreds of new developers who are not very familiar with Scheme?

The same questions of scale can be posed to software engineering. When a company is developing a new product, the code might be organized in just a few files where there’s no emphasis on making reusable components. That’s maybe OK for a company that is in its early stages, but what happens when the product is a success and needs to be maintained over decades? If there’s no focus on making reusable components, there’s a risk of engineers spending more time reinventing the wheel, either because they didn’t realize that an existing solution already existed, or because it was easier to develop a solution from scratch than to try to reuse an existing component.

Understanding code is no different, and it gets more complicated the more code there is, even when best practices are followed. For example, if you are working on an open source project that is not very popular, you may understand all of the code easily (maybe you wrote most of it). When the project gets popular, it attracts more contributors, and people suggest new features and report more bugs, etc. When this occurs over the span of a decade, like what happened with the PSPDFKit codebase, you may be in a situation where you are reading a particular piece of code but you don’t understand why it was added or what was considered when adding it.

Code Comments and Matching Code to Git Commit Messages May Not Be Enough

One of the ways to understand code is by reading a comment that describes why the code was added to the codebase. Code comments are great for this because comments should focus on why and not how. If some code is not clear, it’s usually better to rewrite it rather than comment it. But despite best efforts, in a large codebase, there may still be situations where code comments are not descriptive enough, or where there are no comments at all.

At that point, you may decide to query the revision control system (git, typically) to read the history around the code you want to understand. The usual commands to do so are git blame and git log. git blame is a command that shows which commit last modified each line of a file. git log is a command that shows the history of commits in a particular file. You can get more information if you run man git-blame or man git-log from a shell. Both commands support an option to limit the search to a particular region of text: Pass the -L option with the line range and file you are interested in (for example, -L 12,20:filename.swift). The amount of context you’ll get depends on two aspects of how you commit code:

  • How descriptive your commit messages are
  • Whether or not you commit intermediate work to the main branch

If the commit messages are not descriptive enough, git log or git blame may not be very useful. In the same vein, if intermediate work is committed to the codebase, you may not have the full context regarding the change, because a commit will represent just a small part of a bigger effort.

Some companies have a policy where all reviewed changes are squashed into a single commit in the main branch when merged. When a set of commits is reviewed and ready to land, the author “squashes and merges” all of them in a single commit and uses the pull request description as the commit message description. Optionally, a link to the pull request is added so that people can access the code review information easily. As the pull request description usually explains the motivation for a change in detail, and because that usually includes links to related bug reports, then when someone is doing a git blame or git log over the code, they will get relevant information without a lot of searching and context switching.

Even though the squash and merge technique just described helps link source code and code review information, developers in general lack a clear mechanism for matching a commit to a reviewable unit. The next section will present a possible system to implement that mechanism, provided that you review code using GitHub pull requests.

A Way to Annotate Ranges of Code with Review Information from GitHub Pull Requests

The first part of the system matches a region of code to the list of commits that “touched” it. As mentioned before, this can be done with a git command similar to the following:

1
git log -L 12,20:filename.swift

This command will return a list of commits that touched lines 12 to 20 of filename.swift.

Once you have the list of commits, you need to fetch a list of GitHub pull requests that contain those commits. You can do that by using the GitHub GraphQL API. The associatedPullRequests method returns a list of pull requests that are associated with a particular commit hash. Here’s a sample GraphQL query to get that information, which is limited to only one pull request:

Copy
1
2
3
4
5
6
7
8
9
10
11
12
13
query associatedPRs($sha: String, $repo: String!, $owner: String!) {
  repository(name: $repo, owner: $owner) {
    commit: object(expression: $sha) {
      ... on Commit {
        associatedPullRequests(first:1) {
            nodes {
              number
            }
        }
      }
    }
  }
}

$sha is the commit SHA you want to query, $repo is the name of the repository whose pull requests you want to query, and $owner is the GitHub account name that owns the repository. Note that the output from the git log -L command described previously may return commits that are part of the same pull request. This means that the associatedPullRequests query may return duplicate pull requests that you’ll need to remove from the result.

The GitHub API returns the pull request number, and with that, you can easily build a link to visit the pull request on the GitHub website.

In summary, by combining the information from git log and the associatedPullRequests method, developers can select any region of code and get a list of GitHub pull request links that touched that piece of code. From the pull requests themselves, developers can understand the motivation for the change and whether that change went back and forth several times during review (i.e. if the reviewer(s) had concerns about the change). We think having this kind of information and context as part of the developer’s workflow when making a change leads to a boost in productivity.

This tool can either be implemented as a standalone web application (similar to code search web applications like Android Code Search), or integrated as part of the developer’s IDE (though unfortunately, some IDEs still do not offer a supported mechanism for extending their integrated version control system functionality).

Conclusion

At PSPDFKit, we think understanding decisions about code in a project that spans several years or even decades is a problem of scale that is important to tackle at some point.

The main challenge is that the information developers need is usually spread over several different components: the source code itself, the version control system log, the code review infrastructure, or even the design documents repository. If the project does not follow a commit style where each commit in the main development branch is linked to a single code review, getting that information can be more complicated.

In this article, we presented a way to source information from git and GitHub and present it to the developer in a convenient way so they can get the most complete context when making changes to the codebase.

PSPDFKit Newsletter

Subscribe to our newsletter for more articles like this.