top of page

Predicting Tennis Matches with Logistic Regression

Keywords:

#Regression       #Predictive       #Modeling       #Sports

As part of a capstone project for my major in applied mathematics, I collaborated with a team to explore how machine learning could be used to predict the outcomes of professional tennis matches. By applying logistic regression to decades of historical match data, we aimed to model the probability of a player winning based on match conditions and player statistics.

ml1.png
ml2.png

Project Introduction

The project centered around classification through logistic regression, a statistical method well-suited for binary outcomes, such as win or loss. Unlike linear regression, which estimates continuous values, logistic regression outputs probabilities between 0 and 1, making it ideal for predicting match results. We trained our model on a dataset containing player stats, match outcomes, and contextual features like rank, age, and handedness.

We focused specifically on matches from the US Open, using the data to build a reliable predictive model. While one of my teammates implemented a Markov Chain approach for comparison, I focused on optimizing our logistic regression model and understanding the performance trade-offs between different techniques.

Model Optimization

My contributions involved implementing the logistic regression model and tuning it for performance. This required transforming raw data through the sigmoid function, calculating likelihoods, and minimizing the negative log-likelihood (NLL) to find the best-fit parameters.

Given a set of match conditions and player statistics, the model predicts:

  • 1 if the predicted winner is player A

  • 0 if the predicted winner is player B


The Markov Chain model, with its focus on service point winning percentages, produced varying results with an accuracy range between 1% and 7%, attributed to the model’s inherent limitations. However, we need to observe that the logistic regression model compared the matches, and the result means that 92% of the matches are predicted correctly. Yet, the Markov model simplifies the complex dynamics of tennis by relying on the assumption of independence and memorylessness and fails to incorporate a multitude of influential factors like player psychology, in-game strategy adjustments, and external conditions, but logistic regression’s strength lies in its ability to incorporate multiple variables, each with an associated coefficient that directly impacts the prediction outcome. Its higher accuracy indicates a more nuanced understanding of the match dynamics, considering a wider array of factors that influence a player's performance.

You can view our presentation here:

You can also check out the GitHub here:

image.png
bottom of page