The value of Apache code, using free software

By Sean Palmer, 27th Apr, 2018

What would you do to find out the value added to gigabytes of code over the past year? This is the task that I found myself faced with recently, in the form of an analysis of the Apache Software Foundation. As a sponsor of the ASF we decided to donate this analysis for the ASF's annual report. The ASF's codebase contains 1543 git repositories, some of which have been developed for decades, containing about 75 GB of code and repository history. We're talking on the order of over a hundred million lines of code. Wow.

The first task was to actually get the code. This was easy enough, and just required looping over a manifest of available repositories using Python and then calling to git to clone them all. Here's an example snippet from that program:

for repo in manifest:
    if not os.path.exists("%s/%s.git" % (base, repo)):
        print("New repo, %s.git - cloning!" % repo)
        try:
            subprocess.check_call(['/usr/bin/git', 'clone',
                "https://github.com/apache/%s.git" % repo,
                "%s/%s.git" % (base, repo)])
        except:
            print("Failed to clone, ignoring for now!")
            continue
    else:
        print("Syncing %s.git" % repo)
        os.chdir("%s/%s.git" % (base, repo))
        try:
            subprocess.check_call(['/usr/bin/git', 'pull'])
        except:
            print("Sync failed, ignoring...")
        os.chdir(home)

This code depends on manifest, base, and home variables being set appropriately. This gives us over one and a half thousand repositories in our base directory, after enough time to make a cup of tea and write a sonnet or two.

The next thing to do was to choose the right tool for performing the calculations. Ideally we'd use the industry standard COCOMO II metric for figuring out the value of the code, but most of the popular high-performance code counting tools such as tokei don't automatically perform the necessary COCOMO II calculations for you. Thankfully Ben Boyter has written a new free software tool in Go called scc which is not only blazing fast like tokei, but also has COCOMO II metrics built in. Great!

Now, we only want to analyse source code, but scc is quite liberal in what it considers source code. We want to be a lot more conservative about this, so let's inject some liquid conservativism in the form of a filter. We can easily analyse the most common document kinds in our available respositories:

find "$_base" -type f | egrep -o '\.[^.]+$' | grep -v / | \
  sort | uniq -c | sort -rn

And from that we just manually scan the document types that are most likely to offend.

html,png,xml,mxml,md,txt,xsd,gif,sample,json,jpg,
out,in,md5,svg,yml,sha1,test,yaml,pack,xhtml,htm

Now, one problem. There happens to be a --whitelist option in scc, but there's nothing to filter out the types that we want. This can be solved by patching the scc source code, on top of commit 2d92c3e931e82500 in our case, which is easy to do because it's written in very readable Go. We need to add a new command line option first to take in the values, which can be done by duplicating the --whitelist code. Then essentially all we need to do is add the following code beneath the whitelist handling in processor/file.go:

        if len(BlackListExtensions) != 0 {
                for _, black := range blackList {
                        delete(extensionLookup, black)
                }
        }

With that in place it's just a matter of rebuilding, GOPATH=/path/to/go GOOS=linux GOARCH=amd64 /path/to/go/bin/go build, and then we have a nice enhanced scc.

With that, we get to the main course. If we ran scc over the existing repositories, that would tell us how much the code has cost to make running all the way back through the decades to the ASF's infancy. But since this is an annual report, we're only interested in the date range 22 Apr 2017 to 22 Apr 2018. What we need to do is to check out the code as it was on those two dates, run scc on each checkout, and then compare the difference.

We can find the commit at a certain date on master by running git rev-list -1 --before="$\_date" master, but first we'd need to ensure that we're on master by doing git checkout -f master, and make sure there are no extraneous files around. Since we're only looking at master, this means that any development that took place in a non-master branch and didn't make its way to master will not be included in the analysis! If there are any git errors due to wonky repositories, that will also affect the count in the conservative direction. In other words, the figures we produce are likely to be underestimates.

Here's a shell script that weaves all of the foregoing together. We switch to master, reset, check out the code as it was in 2017, run the patched scc on it, do the same with 2018 code, and make sure that we log all the results. The default wage used by scc is $56,286 per annum, which is much lower than the average Silicon Valley wage. We use $110,000 as a more reasonable option, in essence getting results as though the code were written by one of the large and well known corporate software giants to align with the quality of code that is produced at the ASF.

#!/bin/bash
_log=scc.log
_blacklist="$(cat blacklist.txt)"
for fn in *.git
do cd "$fn"
   echo "$fn" >> "$_log"
   git checkout -f master
   git reset --hard
   _aa="$(git rev-list -1 --before='2017-04-22 00:00' master)"
   git checkout -f "$_aa"
   scc --blacklist "$_blacklist" --aw 110000 . | \
     grep 'Estimated Cost to Develop' >> "$_log"
   _bb="$(git rev-list -1 --before='2018-04-22 00:00' master)"
   git checkout -f "$_bb"
   scc --blacklist "$_blacklist" --aw 110000 . | \
     grep 'Estimated Cost to Develop' >> "$_log"
   cd ..
done

Once we've done this it's just a matter of analysing the result. There's a quote attributed to Bill Gates that "Measuring programming progress by lines of code is like measuring aircraft building progress by weight." Moreover, there's a famous story about the early days at Apple where they decided to switch to measuring performance by lines of code. One of their top programmers, the legendary Bill Atkinson (author of HyperCard), decided to work on refactoring a bunch of code to make the result as elegant as possible. He recorded two thousand negative lines of code on his timesheet, and legend has it that eventually they asked Bill to stop filling in the timesheet. Certainly, measuring value simply by the number of lines of code added is a napkin calculation, but it gives us a ballpark estimate which, counter-intuitively, is probably more accurate the more code you measure as it smooths out any peculiarities. Since we're measuring a heck of a lot of code here, the calculation is hopefully in the right ballpark, i.e. at least having the correct order of magnitude.

The raw, topline figure is that according to the workflow presented here, the Apache Software Foundation contributors added code worth $624,946,835 to their repositories between Apr 2017 and Apr 2018. Overall, 8,376,918 lines of code were added. This gives an average cost per line of about $74, which is slightly less than standard estimates for the Linux Kernel. A COCOMO II model of version 2.6 of the kernel gave a figure of $612m for its 5.9 million lines of code, and a 2006 study funded by the EU put the figure at $1.14b. This equates to figures of $103 and $193 respectively per line of code, and these higher figures are probably appropriate for pure kernel code.

We can also look at individual ASF performers. The project that added the most value over this period is Apache Mynewt, an "OS to build, deploy and securely manage billions of devices". Not surprisingly, since it's an operating system, it had a lot of lines of code added and they were mostly written in C, paralleling the studies of the huge value of the Linux Kernel. According to our patched scc, the Apache Mynewt contributors added $61,769,063 of value to its code, in their core component alone, over the year under consideration.

It is impossible through any kind of auditing to give an exact figure for code production to the nearest cent, penny, or eurocent. But using entirely open source software such as Git and scc it is possible to perform a reasonable analysis millions of lines of code, and even to do so very quickly. The script presented above took just over fifteen minutes to run on an eight core Intel Xeon E5-1410 with 64 GB of RAM. We hope this encourages people not only to run their own statistics, but also to appreciate the enormous amount of effort that goes into open source software for the benefit of everybody.

This article is my 3rd oldest. It is 1419 words long, and it’s got 0 comments for now.