Day 24 - Investigate high-churn files

(This challenge was created by guest contributor Giovanni Lodi. I recommend checking out his blog.)

Today, please spend ~20 minutes looking into the churn your project has experienced.

In this case, churn refers the number of times a file has changed.

High-churn files are worth investigating. They’re not necessarily bad, but they might indicate a good refactoring opportunity.

Perhaps a file changes often because it is unclear and therefore buggy. Perhaps it’s doing too many things.

If you are using Git, you can measure how many commits there have been on each of your files using this command:

git log --all -M -C --name-only --format='format:' "$@" \
  | sort \
  | grep -v '^$' \
  | uniq -c \
  | sort -n \
  | awk 'BEGIN {print "count\tfile"} {print $1 "\t" $2}'

Consider saving it in an executable script. Call it git-churn, perhaps. This will make it easier to reuse and to pass parameters.

For example, what are the top 10 files in the core of your codebase that have changed the most in the past 3 months? Find out with:

git-churn --since='3 months ago' <core_of_the_app> | tail -10

Today’s challenge is to spend investigate the files with the highest churn. Is there anything that stands out? Is it worth trying to make them more stable?

This is really interesting. I ran the script in my project but I couldn’t quite grasp anything that stands out. I think, as it counts the commits done in a file, it will depend a lot on how you work with the SCM (whether you do several small commits or one big chunk of change).

This one was a bit funny, the project I’m currently working is new, but started as a fork of an existing project. As a result, a git churn shows churn for files that are no longer present (they were in the original repo). As such the --since argument proved quite useful.

In the 20 minutes I wasn’t able to identify any refactoring opportunities past stuff we’re already actively working on, but something I noticed that was interesting: we tend to practice TDD, and I saw a pattern in the churn output: the churn on a test file (ex “test_foo.py”) was unanimously lower than the churn on the file containing the code the test tests (“foo.py”) often by a 2-1 ratio. The TDD cycle is to write the test, get it green, then refactor to make the code better, and it was really cool to see that pattern maniffest itself in the churn numbers (the test file changes less than the code under test, which indicates we’re probably doing well with writing good, non-brittle tests).

1 Like

@vinicius that’s right. The assumption the approach makes is that there will be a number of small atomic commits. If the codebase has evolved on Git in big chunks it might not reveal as much information.

The fact that the script doesn’t give valuable insight is itself a valuable insight :point_up:.

There’s a strong argument that can be made on the benefit of having small atomic commits. For example in this post the author writes

Commit the bug fix as one change, and the layout changes as a separate one. That way you can easily roll back the bug fix without affecting the layout change. I would even say to commit each layout change separately as well, because it makes it easier to change the layout on the fly, or roll back a simple color change without affecting the other updates involved.

The extra time it will take to split the work in dedicated small commits will pay off when browsing through your Git history looking for why something was done in a certain way, or trying to fix a bug.

Another great read on the topic of commit size and their message is “Every line of code is always documented”.

2 Likes

Glad to hear that!

The ratio you found definitely points to your tests being an aid to refactoring, and focusing on the behaviour of the code, rather than its implementation. That’s how good tests should look like.

I wasn’t able to identify any refactoring opportunities past stuff we’re already actively working on

This can be seen as a validation of the fact that the stuff you are actively working on is valuable for improving the codebase. :+1:

Totally agree! This is how I like to work and usually do. However, getting everyone in the team to do the same it is not so simple. I fell like one need to suffer the pain in order to learn it (like having to debug everything they have changed in order to find the problem instead of revert last commit).

1 Like