As stated in the methodology section, the lap times needed to be transformed to show them in a single chart. While the distribution is not entirely normal, meaning that it doesn’t form a perfect Gaussian bell, it’s still pretty good for our purposes.īelieve it or not, this chart shows 20,477 laps. Now everything looks more uniform, leaving only representative laps. The second chart shows our dataset after removing outliers. Since it’s so good at removing outliers from both sides of the distribution, it also considers many of the faster laps as outliers, which is something that we do not want. Method two is great for non-normal distributions but has a great disadvantage here. In our case, method three, four and five give us similar results. So slow that they just can’t be part of normal racing. In this case, we are trying to detect laps that are considered “strangely” slow. What are the remaining four methods you may ask? Just different ways of detecting anomalies in the data. The first method is a combination of the remaining four methods. Those laps are detected as outliers by every single method that we used. But as we know, the Brazilian GP required the safety car to come out for quite a few laps. There are many red-coloured laps at the bottom of the chart, representing laps done under racing conditions. In the first chart, we can see how the races have many of these outliers. The second one shows how the data looks after removing those pesky outliers. The first one shows every single lap done throughout the season. You can slide between both of the images shown. I will explain their differences in that section of the article. I will explain a bit more about this process in the next section of the article.įinally, I created a few plots that look quite similar, show the same data points, but present the information in different ways. ![]() The most important thing to know is that there’s no perfect choice about which method to use. Some are the other way around, a bit too soft, and may not remove some of those anomalies. Some methods are very aggressive, and may wrongly classify “normal” laps as outliers. Outlier detection is not an exact science. ![]() In the chart shown above, you’ll see that I used five different methods to detect these outliers. So what’s the solution then? Well, I decided to use statistical methods to automatically detect those outliers that I was just talking about. ![]() Unfortunately I do not have that data for every single lap of the season. The best way to remove them is by knowing exactly in which laps the safety car was out for example, and then filtering those out. Outliers are laps that are not representative, for example, the first lap of the race, or laps done when the safety car was out. My idea was to show the laps that were representative of the pace of each team. Regarding the outliers, I will first explain about what they are and why I decided to remove them. I will explain a bit more about those transformations in the analysis. Second, I filtered in ordered to remove outliers, and re-transformed the data to either (1) a percentile scale, or (2) a standardized scale. First, I transformed the data to a scale of 0 to 1, just to be able to visualize all the laps. To be able to compare laps from different races, I did two different transformations. Laps from races like Spa take over 1 minute and 40 seconds, while laps done in Austria are just over a minute long. There is absolutely no way of putting all the laps in a single plot without making the appropriate transformations.
0 Comments
Leave a Reply. |