Day 24 - Investigate high-churn files

(This challenge was created by guest contributor Giovanni Lodi. I recommend checking out his blog.)

Today, please spend ~20 minutes looking into the churn your project has experienced.

In this case, churn refers the number of times a file has changed.

High-churn files are worth investigating. They’re not necessarily bad, but they might indicate a good refactoring opportunity.

Perhaps a file changes often because it is unclear and therefore buggy. Perhaps it’s doing too many things.

If you are using Git, you can measure how many commits there have been on each of your files using this command:

git log --all -M -C --name-only --format='format:' "$@" \
  | sort \
  | grep -v '^$' \
  | uniq -c \
  | sort -n \
  | awk 'BEGIN {print "count\tfile"} {print $1 "\t" $2}'

Consider saving it in an executable script. Call it git-churn, perhaps. This will make it easier to reuse and to pass parameters.

For example, what are the top 10 files in the core of your codebase that have changed the most in the past 3 months? Find out with:

git-churn --since='3 months ago' <core_of_the_app> | tail -10

Today’s challenge is to spend investigate the files with the highest churn. Is there anything that stands out? Is it worth trying to make them more stable?

Good morning!

Shameless plug:

I’ve written the perfect gem for this, it’s somewhat of a codeclimate clone: GitHub - julianrubisch/attractor: code complexity metrics visualization and exploration tool for ruby and javascript

there’s also a Rails engine: GitHub - julianrubisch/attractor-rails: attractor (code complexity metrics visualization and exploration tool for ruby and javascript) running in a rails engine

you can use it for plain javascript, too. if there are complexity analyzers for other languages like PHP, I’d be happy to provide that, too. It’s actually really easy to set up.

4 Likes

That command is really nice! Really something I wil keep around. Accomplished some simple abstractions and made some todo’s for refactoring.

1 Like

Cool command indeed. git-churn --since='1 year ago' app lib returned the Rails User class as our main offender. Like in many Rails apps, it is a large “god mode” class that is doing too much and could definitely use some separation of concerns here and there.

Otherwise other classes are looking pretty good.

1 Like

I managed to find some good candidates for refactor. We have some files that are doing too many things and being too hard to test. Thanks!

1 Like

Didnt really discover anything out of the ordinary - the features working on in the past few months are dominating, and the top file is changelog.md as expected :wink:

Neat little trick by the way, thanks!

1 Like

We are currently working on improving our codebase, refactoring here and there. And most of unmaintained code has less churn rate.

Thanks for the tip. Will use the command to watch progress.
Will also check the attractor gem, seems pretty cool, thanks for mentioning @julianrubisch!

2 Likes

Great script here, thanks.

Didn’t discovered anything really suspicious, but this confirmed my feeling that we were having one particularly long CSS file worth to be modularized.

1 Like

The command came at a handy time as I’ve just started an assessment of an existing codebase this week. There are a couple of file that stand out from the list already. So thank you, I’ll definitely keep an alias of that in my dotfiles.

Yes, the book “Deep Work” is really good one. Currently, as being in fasting period, I practice “boring” a bit as I have decided to skip YouTube, Twitter and reading news.

With detecting high-churn files one strange thing has happened. I again detected the crud.py file as being somehow off the rails as it happened before with counting number of function arguments.

If all goes well, I am to refactor the crud.py tomorrow. Looking forward to it.

Btw - files which depend somehow on configuration values (such as configuration files themselves, but also ci/cd automation stuff) are often high-churn ones. Sometime it is warning on it’s own - as application code and deployment configuration shall be kept separated (otherwise any further deploment distinct from the first one will teach us that we have some deployment dependent stuff baked into our app).

1 Like

Wow! You’ve done an amazing work there.

How do you use this in your day to day job? I can almost imagine this being a consulting gig, helping team to learn read the analysis and use it to address risk areas.

1 Like

as application code and deployment configuration shall be kept separated (otherwise any further deploment distinct from the first one will teach us that we have some deployment dependent stuff baked into our app).

He @vicinsky! Could you elaborate on this? I’m curious if you have any real world example.

Are you referring to the fact that sometimes the deployment code is not atomic, so if you run the same deployment via CI twice, the second one will have unexpected consequences?

For context, I’m currently working on the team that manages all the CI pipelines for a bunch of mobile apps, so the topic of keeping deployment and configuration code separate, tidy, and robust is of great interest :grinning_face_with_smiling_eyes:

Thanks!

Hey folks! This “guest contributor Giovanni Lodi”, or Gio for short :grin: Glad you found this challenge useful!

I really recommend adding this script to your PATH. If you use git-churn as the file name, Git will even recognize it as a custom command and you’ll be able to call it like git churn.

Analysis of the Git history are fascinating and instructive. They clearly reveal that the codebase “is alive” and how each contributor affects it.

If you’d like to explore the topic further, Adam Thornill wrote two excellent books on the topic Your Code as a Crime Scene and Software Desgin X-Rays. You can check them both out on Medium for free here and here.

3 Likes

Yes, partly consulting, but also when I’m asked to join a team working on a legacy app (I‘m a freelancer) and need to get an idea of the codebase quickly

1 Like

Hi @mokagio

I guess most developers went this path:

  • start first (e.g. web) app, keep all the code and configuration in one repo
  • deploy it (e.g. git push and then from a server git pull) and run it
  • all done, relax awaiting a call from Silicon Walley
  • request to deploy second, independent instance comes
  • you git push, then git pull to second server and learn, that some configuration files (present in the repo) need to be modified.

This tricky situation often results in either messy git repo (containing config files for multiple servers) or with not storing (thus ignoring) some configs for particular servers.

Proper solution is to strictly separate an application code (which provides clear configuration methods and also allows some sort of application packaging) from actual deployment configurations. If the deployment configs live in (the deplyoment) git repo, it typically refers to somehow packaged application (a python package, gem, docker image) and adds specific configuration values for particular deployments.

Since now, change in deployment configuration does not have to touch the application code - what is deserved result.

Your “high churn” test revealed to me, that configuration files are often high on the churn scale. There could be at least two types of changes:

  • deployment node dependent - with each new node, new config modification may come
  • data changing as users are using the application - e.g. you have enumeration of device types and as is the applicaiton used in time, users are asking for new device types to be added (and this may differ per deployment)

Both things shall not happen in the applicaiton code. If it happens, it could be a sign, that the code needs refactoring to get this configuration stuff out of application code and move it into external configuration files, which shall finally live (and change) in deployment repositories.

Regarding “high churn” test, I propose following three questions about files with higher number of changes:

  • is this configuration related churn?
  • shall we refactor it out into external configuration files?
  • shall we move actual configuration files out of application repo into deployment one?

Interesting. Thanks for the extra info :+1:

Ciao! Gio

Adam Thornill’s book are already on my read-next list :slight_smile:
Michael Feathers also once gave a talk or blog post abotu that topic, combining churn on one axis and lines of code in a file on the other axis. Maybe the gem in the first comment does the same. I’ve also written at some point a command line tool around that. Fun exercise and good insights :slight_smile:

Great tip!
“Perhaps a file changes often because it is unclear and therefore buggy. Perhaps it’s doing too many things.”

I added an alias for it in my ~/.bash_aliases file. Note the changes necessary to escape the single quotes:

alias g.churn='git log --all -M -C --name-only --format='\''format:'\'' "$@" | sort | grep -v '\''^$'\'' | uniq -c | sort -n | awk '\''BEGIN {print "count\tfile"} {print $1 "\t" $2}'\'''

see How to escape single quotes within single quoted strings

1 Like

Jannie, ever the teacher, discipling people into better developers.