5 oct. 2017

A trick to win: run fast... and be late !

Two weeks ago, I ran the Paris-Versailles running race. It is a 16-km race which is famously known for its steep 2-km long 7% path right in the middle of the trail. I did a fair timing, better than last years.
When  I checked my results on th website, I came across the dataset of the preliminary results, and I was pleased to see that these results included not only the final time of all runners, but also their mid-race results. I was at the top of the hill after 40 min and it took me 1h21 to run the whole race.
Résultat de recherche d'images pour "paris versailles race profile"
The race profile

So I started to wonder how could I improve my results for next year:
  • If I grab 5 minutes next year, how will my ranking improve? 
  • Then, do I need to put my efforts in the first part or in the second part of the race?  


So I opened a new R session and here we are. I first download the dataset using the great tabulizer package, which was fairly quick to convert the 583-pdf pages into a nice table.

Then I used interesting features in ggplot2 to elaborate space-time diagram to see how I was performing compared to my sibling runners. But it is not because I performed well (or not) that I made a good race... So what is a 'good' race?

Let start first with parsing the pdf file and tidying the resulting dataset.

library(pacman)
p_load("tabulizer")
p_load("magrittr", "dplyr", "tidyr","lubridate", "ggplot2", "reshape2")

location <- "http://www.parisversailles.com/docs/2017_resultats_provisoires.pdf"
liste <- extract_tables(location)
df <- 
as.data.frame(do.call(rbind, liste))
names(df) <- c("dossard", "nom", "prenom", "temps", "place", "categorie",
                  "hDepart", "hCotedesgardes", "hArrivee")
df %<>% mutate_all(funs(as.character(.))) %>% ungroup() %>%
  filter(dossard != "Dossard" & prenom != "Prenom") #%>% select(-c("as.caracter(categorie)"))


Now let's add a bunch of new variables using the quite handy lubridate package.

options(digits.secs=2)
df %<>% mutate_at(c("hArrivee", "hDepart", "hCotedesgardes"), 
                  funs(ymd_hms(paste0("2017-09-25 ",.)))) %>% 
         mutate(totalTime = difftime(hArrivee, hDepart, units="secs"),
                halfTime = difftime(hCotedesgardes, hDepart, units="secs")) %>%
         mutate(Rank = row_number(as.numeric(totalTime)),
                partCotedesgardes = as.numeric(halfTime) / as.numeric(totalTime))

Last, I recalculate ranks (remember these where intermediate results and some runner where misclassified), ranks by category and finally add deciles

df %<>% group_by(categorie) %>% mutate(rankCat = row_number(Rank)) %>%
      ungroup() %>% mutate(quant = cut(Rank, breaks = quantile(Rank, probs = seq(0, 1, 0.1),na.rm = TRUE), labels=1:10, right = TRUE, include.lowest=TRUE))

Let's have a look at the clean dataset


glimpse(df)
Observations: 22,146
Variables: 15
$ dossard           <chr> "18", "3", "25", "616", "43", "17", "11", "23", "699", "5", "37", "72...
$ nom               <chr> "GEDAMU", "GUIBAULT", "AZZAOUI", "RAUDIN", "LORRIAUX", "BETOUDJI", "G...
$ prenom            <chr> "Getinet", "Thierry", "Abdelmajid", "Stephane", "J-pierre", "Valentin...
$ temps             <chr> "00:52:50", "00:53:12", "00:54:48", "00:54:54", "00:54:58", "00:55:01...
$ place             <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", ...
$ categorie         <chr> "SH", "VH1", "SH", "SH", "VH1", "SH", "SH", "SH", "SH", "SH", "VH1", ...
$ hDepart           <dttm> 2017-09-25 09:59:57, 2017-09-25 09:59:57, 2017-09-25 09:59:57, 2017-...
$ hCotedesgardes    <dttm> 2017-09-25 10:26:45, 2017-09-25 10:26:45, 2017-09-25 10:27:05, 2017-...
$ hArrivee          <dttm> 2017-09-25 10:52:46, 2017-09-25 10:53:08, 2017-09-25 10:54:44, 2017-...
$ totalTime         <time> 3169.29 secs, 3191.89 secs, 3287.87 secs, 3293.73 secs, 3297.28 secs...
$ halfTime          <time> 1608.89 secs, 1608.73 secs, 1628.67 secs, 1665.71 secs, 1665.32 secs...
$ Rank              <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 19...
$ partCotedesgardes <dbl> 0.5076500, 0.5040055, 0.4953572, 0.5057215, 0.5050587, 0.5043660, 0.5...
$ rankCat           <int> 1, 1, 2, 3, 2, 4, 5, 6, 7, 8, 3, 1, 9, 4, 10, 11, 12, 13, 15, 14, 16,...
$ quant             <fctr> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
So now, how did I performed ? To show a vertical bar in density plot, I use a trick that I found here


Arrival time of runners, my wife's position and mine (in secs)
temps <- c(as.numeric(df$totalTime[df$dossard==denis[1]]),
           as.numeric(df$totalTime[df$dossard==armelle[1]]))

p <- df %>% mutate(totalTime = as.numeric(totalTime)) %>% 
  ggplot(aes(totalTime))+geom_density()  

d <- ggplot_build(p)$data[[1]]
p + geom_segment(x=temps[1], xend=temps[1], y=0, yend=approx(x = d$x, y = d$y, xout = temps[1])$y,       colour="red", size=1)+   geom_segment(x=temps[2], xend=temps[2], y=0, yend=approx(x = d$x, y = d$y, xout = temps[2])$y,            colour="blue", size=1) 


Interesting to notice, the best runner did it in 52 min, so an average of 18 km/h.

What if I want to run it 1h17 next time?

df %>% filter((totalTime >= 4620 & totalTime <= 4621) | dossard == 17421) %>% 
  summarise(ecartTemps=(max(totalTime)-min(totalTime))/60, ecartPlace = max(Rank)-min(Rank))

Well, I will grab 1889 ranks, out of 22132.


Now let's get back to my first question.  How did I managed my race compared to my running siblings? By siblings I mean people whose performance were quite similar to mine. To do so, I slightly adapted the space-time graphic to reflect how I performed at the mid-point of the race, on top of this insane 7%-hill. 
After a bit of scripting, melting,, casting my dataset (see reshape2) to fit it into the grammar of graphics, I get...

denis=c(17421, "Denis")
armelle=c(17422, "Armelle")
vincent=c(15845, "Vincent")
godefroy=c(21414, "Godefroy")

# Extraire les coureurs proche
k = denis
kTime = df %>% filter(dossard==k[1])%>%select(totalTime) %>% as.numeric()

df.tmp <- df %>% filter(totalTime >=(kTime-60)  & totalTime <= (kTime+60)) %>% 
  mutate(starTime = 0) %>%
  select(dossard, starTime, halfTime, totalTime) 

df.tmp <- melt(df.tmp, id.vars = "dossard")

df.tmp %<>% mutate(x=ifelse(variable=="starTime",0,            # convert x as numeric
                           ifelse(variable=="halfTime", 8, 16))) %>% 
  filter(!is.na(value)) %>% group_by(variable) %>% mutate(moy = mean(value)) %>% 
  ungroup() %>% mutate(ecart = value - moy)

df.tmp %>% 
  ggplot(aes(x, y=ecart, col=dossard), size=.2)+geom_line()+
  theme(legend.position = 'none') +scale_color_grey() + 
  geom_line(data=subset(df.tmp, dossard==k[1]), aes(y=ecart), col="red")+
  ggtitle(paste0("La course de ", k[2]))

Here is me, my girlfriend and my step-brother. 

In the graph, you see the end-point race at 16 km, and the mid-point race at 8-km in x-axis.
Each line represent a runner and its deviation compared to the mean. 
  • A high point indicates a runner late at the midpoint (he took more time than others), but catching up quickly to finish with the same time as me. 
  • A low point indicates an early midpoint timing (the runner performed better than the others)


Turn out that I was pretty much in advance (but I felt it), and lots of my siblings caught up with me.  Compared to their siblings, my girlfriend and my step-brother were very much in line. 

But does it make a 'good' race, and what is a 'good' race anyway?
Let see how the best runners compared to the others, using the percentage of running time spent during the first half of the race.



df %>% filter(dossard %in% c(17421, 17422, 21414))%>% select(dossard, partCotedesgardes, totalTime)
# A tibble: 3 x 3
dossard percentHalf totTime
 <chr>       <dbl>    <time>
21414    0.5000367 4219.77 s
17421    0.4908224 4883.09 s
17422 0.4974121 5481.29 s

I spent 49% of my race in the first half, and my brother-in-low 50%. Let see what this value is by decile.

df %>% filter(!is.na(partCotedesgardes)) %>% 
     group_by(quant)%>% 
     summarize(moy=mean(partCotedesgardes), sd=sd(partCotedesgardes)) 
# A tibble: 10 x 3
    quant       moy          sd
   <fctr>     <dbl>       <dbl>
 1      1 0.5022714 0.008330851
 2      2 0.5020689 0.010449567
 3      3 0.5010444 0.011329228
 4      4 0.4997301 0.012988530
 5      5 0.4991441 0.012154734
 6      6 0.4980548 0.013629279
 7      7 0.4963385 0.014371913
 8      8 0.4950229 0.015044514
 9      9 0.4914422 0.016621537
10     10 0.4880481 0.024609235

Interesting, the best runner (decile 1), spend on average 50.2% of the race in the first half, while the slowest runners (decile 9 or 10) spend 49.1%. "Slow and steady wins the race !"
To this respect, I should have classified myself among the last decile!
I guess the explanation is that the best runners know how to weight their effort and keep some momentum, whereas the steep hill "kill" the others.