In [1]:

import re
import subprocess
import dateutil.parser
import datetime
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
%matplotlib inline

PATH_TO_REPO = "main-sf"
PATH_TO_CHANGELOG = "%s/doc/changelog.rst" % PATH_TO_REPO

In this notebook I will explore data from Timeline to try to learn something.

In particular, I'm interested in knowing what makes a Timeline release successful. What did we do right that we can continue to do more of so that future releases of Timeline will be successful.

That is a quite vague direction, but I hope I can ask some more specific questions as I go along.

When did releases happen?¶

Where do we start? Let's start somewhere.

Let's start by figuring out when the different releases happened. We can parse that information from the changelog:

In [2]:

release_dates = []
release_versions = []
with open(PATH_TO_CHANGELOG) as f:
    while True:
        line = f.readline()
        if not line:
            break
        match = re.match(r"^Version (\d+\.\d+\.\d+)$", line)
        if match:
            version = match.group(1)
            f.readline()
            f.readline()
            match = re.match(r"^\*\*Released on (.*)\.\*\*$", f.readline())
            if match:
                release_dates.append(dateutil.parser.parse(match.group(1)).date())
                release_versions.append(version)

Let's load that into Pandas so that we can more easily work with it:

In [3]:

releases = pd.DataFrame({
    "date": release_dates,
    "version": release_versions,
})

releases.head()

Out[3]:

	date	version
0	2015-04-30	1.6.0
1	2015-01-31	1.5.0
2	2014-11-12	1.4.1
3	2014-11-09	1.4.0
4	2014-06-30	1.3.0

5 rows × 2 columns

What does the frequency look like?¶

Let's plot when releases occured in time to get a feel for the distribution.

Let's first add a dummy column that we will use for plotting purposes.

In [4]:

releases["dummy"] = np.zeros(len(release_dates))
releases.head()

Out[4]:

	date	version
0	2015-04-30	1.6.0
1	2015-01-31	1.5.0
2	2014-11-12	1.4.1
3	2014-11-09	1.4.0
4	2014-06-30	1.3.0

5 rows × 3 columns

Let's also filter out the major releases as we only want to show them on the x-axis:

In [5]:

major_releases = releases[releases["version"].str.endswith(".0")]
major_releases.head()

Out[5]:

	date	version
0	2015-04-30	1.6.0
1	2015-01-31	1.5.0
3	2014-11-09	1.4.0
4	2014-06-30	1.3.0
9	2014-04-05	1.2.0

5 rows × 3 columns

Now we are ready to plot:

In [6]:

releases.plot(x="date", y="dummy", style="o", figsize=(15, 3))
plt.xticks(major_releases["date"].values, major_releases["version"].values, rotation=90)
plt.yticks([])
plt.xlabel("")
plt.show()

We see that there are some blue circles to the right of the vertical lines. Those are the minor releases. For example, the circle to the right of 0.12.0 is probably release 0.12.1.

Now we've got an intuitive feel for the distribution. Let's see if we can plot it more precicely:

In [7]:

sorted_major_releases = major_releases.sort("date")
sorted_major_releases["time_in_development"] = sorted_major_releases.date.diff()
sorted_major_releases["days_in_development"] = sorted_major_releases.dropna().time_in_development.map(lambda x: x.item() / 1000000000.0 / 60.0 / 60.0 / 24.0)
sorted_major_releases.head()

Out[7]:

	date	version	time_in_development	days_in_development
38	2009-04-11	0.1.0	NaT	NaN
37	2009-07-05	0.2.0	85 days	85
36	2009-08-01	0.3.0	27 days	27
35	2009-09-01	0.4.0	31 days	31
34	2009-10-01	0.5.0	30 days	30

5 rows × 5 columns

And now we are ready to plot:

In [8]:

sorted_major_releases.days_in_development.plot(kind="bar")
plt.xticks(
    np.arange(sorted_major_releases.days_in_development.shape[0])+1, # Not sure why +1 is needed
    sorted_major_releases.version.values,
    rotation=90
)
plt.title("Days in development for Timeline releases")
plt.show()

That was fun!

Looks like we released often in the beginning and then changed release period at version 0.10.0. I know that we decided some time to have a new release roughly every 3rd month. Was it around 0.10.0? Then why are releases 0.11.0, 0.17.0, and 1.4.0 significantly longer?

From the changelog, it looks like version 0.11.0 contained very few changes. So maybe it was "delayed" because we had nothing useful to release.

The same goes for version 0.17.0.

Version 1.4.0 contained the undo feature that I remember that we wanted to test a bit more before making the release. So that is probably the cause of 1.4.0 being a little late.

What about commit frequency?¶

Now, let's extract some data about commits to see if the data there supports our guesses above.

Let's start by extracting the dates of all commits:

In [9]:

output = subprocess.check_output([
    "hg", "log",
    "--template", "{date|isodate}\n"
], cwd=PATH_TO_REPO)

commits = pd.DataFrame({
    "date": [dateutil.parser.parse(x).date() for x in output.strip().split("\n")]
})
commits = commits.sort("date")

In [10]:

commits.head()

Out[10]:

	date
3147	2008-10-28
3146	2008-10-29
3145	2008-11-01
3144	2008-11-02
3143	2008-11-03

5 rows × 1 columns

In [11]:

commits.tail()

Out[11]:

	date
0	2015-06-19
1	2015-06-19
2	2015-06-19
3	2015-06-19
4	2015-06-19

5 rows × 1 columns

In [12]:

commits.describe()

Out[12]:

	date
count	3148
unique	718
top	2014-09-09
freq	32

4 rows × 1 columns

From that we can create a series that has the number of commits per day:

In [13]:

commit_frequency = commits.groupby("date").count().rename(columns={"date": "number_of_commits"}).asfreq("D").fillna(0)
commit_frequency.head()

Out[13]:

	number_of_commits
2008-10-28	1
2008-10-29	1
2008-10-30	0
2008-10-31	0
2008-11-01	1

5 rows × 1 columns

Let's plot it to see what it looks like:

In [14]:

commit_frequency.plot(figsize=(15, 5))
plt.title("Number of commits over time")
plt.xticks(
    sorted_major_releases.date.values,
    sorted_major_releases.version.values,
    rotation=90
)
plt.show()

Now let's see if we can look at a particular release. Let's look at the three we found took longer: 0.11.0, 0.17.0, and 1.4.0:

In [15]:

def plot_commit_stat(start_release, end_release):
    span = major_releases[(major_releases.version == start_release) | (major_releases.version == end_release)]
    start_date = span.date.min()
    end_date = span.date.max()
    commit_frequency[start_date:end_date].plot(figsize=(15, 3))
    labels = major_releases[(major_releases.date >= start_date) & (major_releases.date <= end_date)]
    plt.title("Number of commits between %s - %s" % (start_release, end_release))
    plt.xticks(
        labels.date.values,
        labels.version.values,
        rotation=90
    )

plot_commit_stat("0.10.0", "0.11.0")
plot_commit_stat("0.16.0", "0.17.0")
plot_commit_stat("1.3.0", "1.4.0")
plt.show()

The two earlier releases seem to contain fewer commits per day. The same that we saw from the changelog. Version 1.4.0 seems to have quite steady commits from the middle of the period. Looking at the overall commit frequency graph we also see that 1.4.0 contains the peak commits per day at around 32. So it's possible that it took longer because we wanted to add some more tests.

I'm not sure if I can draw any conclusions from this, but looking at data graphically is quite fun, and I've learned how to use the Pandas library.

Timeline release statistics

When did releases happen?¶

What does the frequency look like?¶

What about commit frequency?¶