Day 24 - Investigate high-churn files

(This challenge was created by guest contributor Giovanni Lodi. I recommend checking out his blog.)

Today, please spend ~20 minutes looking into the churn your project has experienced.

In this case, churn refers the number of times a file has changed.

High-churn files are worth investigating. They’re not necessarily bad, but they might indicate a good refactoring opportunity.

Perhaps a file changes often because it is unclear and therefore buggy. Perhaps it’s doing too many things.

If you are using Git, you can measure how many commits there have been on each of your files using this command:

git log --all -M -C --name-only --format='format:' "$@" \
  | sort \
  | grep -v '^$' \
  | uniq -c \
  | sort -n \
  | awk 'BEGIN {print "count\tfile"} {print $1 "\t" $2}'

Consider saving it in an executable script. Call it git-churn, perhaps. This will make it easier to reuse and to pass parameters.

For example, what are the top 10 files in the core of your codebase that have changed the most in the past 3 months? Find out with:

git-churn --since='3 months ago' <core_of_the_app> | tail -10

Today’s challenge is to spend investigate the files with the highest churn. Is there anything that stands out? Is it worth trying to make them more stable?

I ran the suggested script against a small Python service that services an ML model over gRPC. The most churned file in recent months is the “service.py” file that is the entry point for the production service. It contains both business logic and dependency set-up and is long (400 lines), so I already had an eye on it for refactoring.

I thought I could break out some Query classes that implement parallel queries to upstream systems to gather the attributes fed in to the ML model. Moving the classes was easy, but it seemed to also make sense to move the associated configuration code. That’s where I ran in to trouble. The main entrypoint method is a long run of straight-line code that reads environment variable, instantiates dependencies, and assembles them before calling the method to start the server. Some dependency lifetimes are handled in the method by calling .close() at the end. Splitting out a factory method to configure the Query objects meant the right variables weren’t in scope any more to call .close().

I can think of a couple of possible solutions:

  • Make the Query objects context managers so I can use them with Python’s with statement. This would add a level of indentation to the whole method.
  • Iterate over the list of Query objects and call close, and have them call close on their internal dependency.

Both add a bit of complexity in this area where our general pattern is to just maintain straight-line code. Neither of them are attractive to me at the moment.

I did find a dependency where we weren’t calling close though! No big deal, but nice to clean up.