Dubhrosa: How to dive into a large codebase

Getting to grips with a new codebase can be very difficult. Every software developer has to dive into unfamiliar code on a regular basis, but to my knowledge there are no good guides on how to approach the task. My job(s) for the past decade have involved writing code, but more of my time has been spent reviewing code on dozens of active projects and learning how to quickly dive into an unfamiliar codebase has been crucial.

Many discussions on this topic focus on how to navigate code in a particular editor. In this post I want to focus on the general techniques rather than editor specifics (though I’ll get to my current preferred setup at the end).

Survey the directory structure

Start at the root directory of the project. Most normal projects will have 10 to 20 files and directories in the root. Go through these one by one and make a 1-line note of the purpose and contents of each.

Checklist – at the end of this step you should be able to answer these questions

What files, if any, provide documentation
What files drive the build and deployment system (for projects and languages that don’t strictly have a build system, there’s usually a deployment system; I’ll just refer to this as the build system)
Where is the source code and is there a subdirectory structure for source files, if so, what does the subdirectory structure represent (libraries, components, executables?)
Where is the test code (usually either commingled with the main source or in a separate subdirectory)
What are the external build dependencies required to build the project
What are the build targets (usually executables, libraries, tests, documentation)

Understand and Run the Build and Tests

Even though I’m often reviewing code that I’m never going to modify, I still like to start by verifying that I can successfully build the project outputs and run the main executable and tests (if those things exist). This step helps identify any weird dependencies the project has, and means that when you’re finally ready to edit code you don’t have to break flow to figure out the build system.

If the project has a test suite, figure out how to run it. Test suites vary a lot across languages and projects, and in some cases can be really finicky to get running, but it’s time well spent.

Identify the interfaces, inputs and outputs

Every program is just a way to transform input data into output data. If you “dive” into the middle of a large codebase and try to figure things out from the inside out, you will fail, or at least waste a lot more time than you should. Always start from the outside and work your way in.

Identify what the inputs are, and what the outputs are. Make notes describing them – force yourself to articulate this knowledge.

Projects that implement an “official” API ought to be easier to comprehend, and often they are, but don’t fall into the trap of assuming that all the inputs and outputs are captured by the API. Many APIs provide a partial account of the I/O, and in fact you need to understand the backend database interface and the dataflows into the DB in order to really identify all the relevant inputs and outputs.

Make sure that you identify all the inputs and outputs, that includes log file outputs and configuration inputs. Many projects have logging outputs that give you a very useful and comprehensive picture of what the program does.

Structured Examination of Code

Don’t just “browse” the code. Write down specific questions that you want to investigate, like “How are messages filtered and decrypted”. Keep focused on the point you are investigating, try to avoid being distracted by interesting looking code.

Make notes describing the answer to these questions, including a function call graph and any important data manipulations.

When you open a file, page down through the file, all the way to the bottom, spending about 5 seconds skim/scanning the code per screen. I don’t have a good explanation, but I find this really helps me to get oriented and get a feel for the size and shape of the code. You obviously can’t absorb much of the detail by doing this, but it answers a lot of high level questions like whether the code is repetitive boiler plate or a bunch of simple functions or a small number of really complicated functions.

Understand the branching structure

Thankfully most modern projects use good distributed version control systems with sane branching policies. You can usually figure out the branching policy quite quickly just by looking at the history, but always check the project documentation for specific information on this.

Spend 20 minutes reading the most recent commit messages and diffs

I time-box this activity because for large, long-running projects you could spend an indefinite length of time reading the changes. 20 minutes doesn’t sound like much but it’s more than enough to get a feel for the parts of the codebase that are under active development, which developers are working on those areas, and whether the development is issue-driven or new-feature.

Making Notes

You have to make notes as you go, otherwise you will flounder and waste an inordinate amount of time. If you need to dip in and out of codebases with weeks or months in between visits, your notes will be invaluable to you the next time through.

I start taking notes in Workflowy. If the notes grow a lot, I switch them to a git repository I’ve called “codenotes” just for this purpose. It has a subdirectory for every project, with cloning instructions, so I know how to get started next time around, along with my notes. If you’re spending a lot of time on one large project, consider writing a readme for developers and adding it to the project’s own wiki or source control.

My Personal Setup

I use Vim to read code. I turned off syntax highlighting a long time ago and am convinced that it’s far easier to quickly read and comprehend code without it. Actually I use the nofrils color scheme that has no syntax highlighting but does make comments a very slightly different color to the code.

I occasionally use folds (two keystrokes will hide all the code except the toplevel class and function declarations), but they are not crucial. I use the NERDTree plugin to browse the directory structure, but again I don’t think it’s crucial.

I have set up a few keyboard shortcuts that make it quicker to load files and switch between files.

Buffers: nnoremap Leader b :ls :buffer

Files: nnoremap Leader e q:iedit **/*

I’ve used tags on and off over the years. If you work with languages for which tag support is mature, then they’re good, but several of the languages I need to work with are still working out tags support (javascript and others), and the time needed to set up the finicky toolchain isn’t worth it in my view. There isn’t enough of a difference between tags and grep in my view for me to spend time on tags that don’t just work out of the box.

I also occasionally use Atom, Visual Studio, VS Code, neovim, and a few other editors and IDEs and find them all to be perfectly acceptable, I’m just more productive in Vim.

Dubhrosa

Friday 22 July 2016

How to dive into a large codebase