Reading a Git log in R
Occasionally, for reasons that noone really knows, it is necessary to visualise data from repository logs.
This could include understanding who the frequent authors are, or the most often changed code file, or which files are changed together, or anything else a manager could dream up.
Retrieve Git Data
Using git log
, we can retrieve various log data from each commit in our repo. Information on the command structure can be found on the git-scm.com site.
Here, we have chosen to get data for inserts/deletions for each file (--numstat
); the short hash; the strict-ISO date, name, email for the author; the subject. The output is saved to a log file.
git log --all --numstat --pretty=format:'%h~~%aI~~%aN~~%aE~~%s' --no-renames --since=2022-10-14 > logfile.log
Having generated our file using the command above, we can use readr::read_lines(logfile)
* to read each log file line to a list.
The example output below is a random selection of commits from the Jekyll repo.
* for large files,
read_lines
may not be very efficient, but it works for us here…
commits.raw <- readr::read_lines(logfile.log)
commits.raw |> head(10)
## [1] "6448700552~~2022-10-16T09:13:22-07:00~~Basil Crow~~me@basilcrow.com~~Wait for inbound agent to go offline in tests (#7261)"
## [2] "1\t1\ttest/pom.xml"
## [3] "5\t1\ttest/src/test/java/hudson/bugs/JnlpAccessWithSecuredHudsonTest.java"
## [4] "10\t6\ttest/src/test/java/hudson/slaves/JNLPLauncherRealTest.java"
## [5] "6\t2\ttest/src/test/java/jenkins/agents/WebSocketAgentsTest.java"
## [6] "6\t2\ttest/src/test/java/jenkins/security/Security218Test.java"
## [7] "5\t1\ttest/src/test/java/jenkins/slaves/restarter/JnlpSlaveRestarterInstallerTest.java"
## [8] ""
## [9] "bb4061e23d~~2022-10-15T11:31:13-07:00~~dependabot[bot]~~49699333+dependabot[bot]@users.noreply.github.com~~Bump JUnit plugin from 1150.v5c2848328b_60 to 1153.v1c24f1a_d2553 (#7260)"
## [10] "1\t1\ttest/pom.xml"
Wrangling
To convert this list to a tibble, we can create a new one, and parse in the data, and also filter out blank lines.
There are a number of ways to turn this into a tidy table, but I have added a commit_number
column which adds a counter to each row. The counter only increments when it finds our delimiter from the git log
command (~~
). This means that all data for each commit can be uniquly identified, and we can use this value later on.
The data is the same as before, except it’s in a ‘proper’ tibble and each commit has a unique way to identify it.
commits.records <- tibble::tibble(data = commits.raw) |>
dplyr::filter(data != "") |>
dplyr::mutate(commit_number = cumsum(grepl("[~~]+", .data$data)))
commits.records |> head(10)
## # A tibble: 10 × 2
## data commi…¹
## <chr> <int>
## 1 "6448700552~~2022-10-16T09:13:22-07:00~~Basil Crow~~me@basilcrow.com… 1
## 2 "1\t1\ttest/pom.xml" 1
## 3 "5\t1\ttest/src/test/java/hudson/bugs/JnlpAccessWithSecuredHudsonTes… 1
## 4 "10\t6\ttest/src/test/java/hudson/slaves/JNLPLauncherRealTest.java" 1
## 5 "6\t2\ttest/src/test/java/jenkins/agents/WebSocketAgentsTest.java" 1
## 6 "6\t2\ttest/src/test/java/jenkins/security/Security218Test.java" 1
## 7 "5\t1\ttest/src/test/java/jenkins/slaves/restarter/JnlpSlaveRestarte… 1
## 8 "bb4061e23d~~2022-10-15T11:31:13-07:00~~dependabot[bot]~~49699333+de… 2
## 9 "1\t1\ttest/pom.xml" 2
## 10 "d2b6473c1a~~2022-10-15T15:30:15-03:00~~Fernando Boaglio~~boaglio@gm… 3
## # … with abbreviated variable name ¹commit_number
Tidying
So… How do we make the commit file stats and commit information tidy?
This is one way…
Create a table with only commit information and use tidyr::separate()
to split the string on our delimiter and create new columns.
commits.info <- commits.records |>
dplyr::filter(grepl("[~~]+", data)) |>
tidyr::separate(data, c("commit", "ad", "an", "ae", "s"), sep = "[~~]+")
commits.info |> head(4)
## # A tibble: 4 × 6
## commit ad an ae s commi…¹
## <chr> <chr> <chr> <chr> <chr> <int>
## 1 6448700552 2022-10-16T09:13:22-07:00 Basil Crow me@basilc… Wait… 1
## 2 bb4061e23d 2022-10-15T11:31:13-07:00 dependabot[bot] 49699333+… Bump… 2
## 3 d2b6473c1a 2022-10-15T15:30:15-03:00 Fernando Boaglio boaglio@g… Upda… 3
## 4 9ae07ce2cd 2022-10-15T20:29:34+02:00 Stefan Spieker S.Spieker… Repl… 4
## # … with abbreviated variable name ¹commit_number
Do similar for the file stats, in a different table. By default the log’s numstat
separates its values with a tab \t
.
commits.stats <- commits.records |>
dplyr::filter(!grepl("[~~]+", data)) |>
tidyr::separate(data, c("insertions", "deletions", "filename"), sep = "[\t]+")
commits.stats |> head(10)
## # A tibble: 10 × 4
## insertions deletions filename commi…¹
## <chr> <chr> <chr> <int>
## 1 1 1 test/pom.xml 1
## 2 5 1 test/src/test/java/hudson/bugs/JnlpAccessWithSe… 1
## 3 10 6 test/src/test/java/hudson/slaves/JNLPLauncherRe… 1
## 4 6 2 test/src/test/java/jenkins/agents/WebSocketAgen… 1
## 5 6 2 test/src/test/java/jenkins/security/Security218… 1
## 6 5 1 test/src/test/java/jenkins/slaves/restarter/Jnl… 1
## 7 1 1 test/pom.xml 2
## 8 11 7 core/src/main/resources/hudson/Messages_pt_BR.p… 3
## 9 3 2 core/src/main/resources/hudson/PluginManager/ad… 3
## 10 3 2 core/src/main/resources/hudson/PluginManager/av… 3
## # … with abbreviated variable name ¹commit_number
For both info and stats, we have kept our commit_number
column.
We can use this to join stats to info.
commits <- commits.info |>
dplyr::left_join(commits.stats, by = "commit_number") |>
dplyr::mutate(ad = as.Date(ad))
commits |> head(10)
## # A tibble: 10 × 9
## commit ad an ae s commi…¹ inser…² delet…³ filen…⁴
## <chr> <date> <chr> <chr> <chr> <int> <chr> <chr> <chr>
## 1 6448700552 2022-10-16 Basil Crow me@b… Wait… 1 1 1 test/p…
## 2 6448700552 2022-10-16 Basil Crow me@b… Wait… 1 5 1 test/s…
## 3 6448700552 2022-10-16 Basil Crow me@b… Wait… 1 10 6 test/s…
## 4 6448700552 2022-10-16 Basil Crow me@b… Wait… 1 6 2 test/s…
## 5 6448700552 2022-10-16 Basil Crow me@b… Wait… 1 6 2 test/s…
## 6 6448700552 2022-10-16 Basil Crow me@b… Wait… 1 5 1 test/s…
## 7 bb4061e23d 2022-10-15 dependabot… 4969… Bump… 2 1 1 test/p…
## 8 d2b6473c1a 2022-10-15 Fernando B… boag… Upda… 3 11 7 core/s…
## 9 d2b6473c1a 2022-10-15 Fernando B… boag… Upda… 3 3 2 core/s…
## 10 d2b6473c1a 2022-10-15 Fernando B… boag… Upda… 3 3 2 core/s…
## # … with abbreviated variable names ¹commit_number, ²insertions, ³deletions,
## # ⁴filename
We now have a tidy table of git logs, that we can use to observe patterns within repo commits.
Phew!
Obvious that it may seem, it took me quite a while to come up with this solution. On the way, I have learned to use grepl
and separate
more, which will be useful for my other projects.
Inspiration was taken from these two blogs: https://www.feststelltaste.de/reading-a-git-log-file-output-with-pandas/ and https://www.r-bloggers.com/2018/03/guide-to-tidy-git-analysis/.