Skip to contents

The PremPredict package helps you to generate sensible predictions for individual games or an entire season of the Premier League.

You can find my automatically-updated Premier League predictor (that uses this codebase) on the landing page of its repo.


Installation

You can install the development version of PremPredict from GitHub with:

# install.packages("pak")
pak::pak("p0bs/PremPredict")


Approach

I use a simplified version of David Firth’s approach and data from the Open Football repo on GitHub to predict the outcome of this season’s Premier League.

The predictions are based on a team’s strength, given its performance in recent times. But how should we define ‘recent’? In order to duck this question, you could choose a number of different time periods. Please note that 0.0% and 100.0% outcomes in the results do not necessarily signify certainty in their specific assessment, as:

  • this model is typically used with more than 1,000 simulations; and more pertinently
  • this model (like all models) is imperfect (but, I think, better than no model at all)


Example

Here is an example analysis, using data collected towards the end of the 2025/26 season.

First, we collect, combine and tidy the results data.

library(PremPredict)
data("example_thisSeason")

results_combined <- get_results(
  results_thisSeason = example_thisSeason, 
  seasons = 1L
  )

dim(results_combined)
#> [1] 760   9

Note that we want to look back across this season (so far) and its predecessor.

game_latest <- calc_game_latest(results = results_combined)

results_filtered <- get_results_filtered(
  results = results_combined, 
  index_game_latest = game_latest, 
  lookback_rounds = 76L
  )

dplyr::glimpse(results_filtered)
#> Rows: 760
#> Columns: 8
#> $ matchday <date> 2024-08-16, 2024-08-17, 2024-08-17, 2024-08-17, 2024-08-17, …
#> $ homeTeam <chr> "MUN", "IPS", "ARS", "EVE", "NEW", "NOT", "WHU", "BRE", "CHE"…
#> $ awayTeam <chr> "FUL", "LIV", "WOL", "BRI", "SOU", "BOU", "AST", "CPA", "MCI"…
#> $ FTHG     <dbl> 1, 0, 2, 0, 1, 1, 1, 2, 0, 1, 2, 0, 2, 4, 0, 4, 0, 1, 2, 2, 1…
#> $ FTAG     <dbl> 0, 2, 0, 3, 0, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1, 0, 2, 1, 6, 0, 1…
#> $ FTR      <chr> "H", "A", "H", "A", "H", "D", "A", "H", "A", "D", "H", "A", "…
#> $ played   <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T…
#> $ match    <chr> "001", "002", "003", "004", "005", "006", "007", "008", "009"…

For reference, we can see the prevailing table.

data_table_current <- example_thisSeason |> 
  calc_table_current()

data_table_current |> 
  print_table_current()
Team Played GD Points
ARS 8 12 19
MCI 8 11 16
LIV 8 3 15
BOU 8 3 15
TOT 8 7 14
CHE 8 7 14
SUN 8 3 14
CPA 8 4 13
MUN 8 -1 13
BRI 8 1 12
AST 8 0 12
EVE 8 0 11
BRE 8 -1 10
NEW 8 0 9
FUL 8 -4 8
LEE 8 -6 8
BUR 8 -6 7
NOT 8 -10 5
WHU 8 -12 4
WOL 8 -11 2

We can now model the strengths of the sides at home and away.

data_model <- results_filtered |> 
  model_prepare_frame() |>
  model_run()

data_model
#> 
#> Call:
#> gnm::gnm(formula = count ~ -1 + s + draw, eliminate = match, 
#>     family = stats::quasipoisson, data = modelframe, start = rep(0, 
#>         2 * nTeams + 1))
#> 
#> Coefficients of interest:
#> sARS_home  sAST_home  sBOU_home  sBRE_home  sBRI_home  sBUR_home  sCHE_home  
#>    3.5887     3.3910     2.6672     2.7102     2.8616     3.0648     3.6256  
#> sCPA_home  sEVE_home  sFUL_home  sLEE_home  sLIV_home  sMCI_home  sMUN_home  
#>    2.1443     2.2261     2.2417     1.9925     4.2522     3.6907     2.2057  
#> sNEW_home  sNOT_home  sSUN_home  sTOT_home  sWHU_home  sWOL_home  sARS_away  
#>    3.3055     2.3953     4.3863     1.3325     1.0388     1.2459     3.5010  
#> sAST_away  sBOU_away  sBRE_away  sBRI_away  sBUR_away  sCHE_away  sCPA_away  
#>    2.2927     2.4675     2.0120     2.4678   -27.1656     2.5494     2.7080  
#> sEVE_away  sFUL_away  sLEE_away  sLIV_away  sMCI_away  sMUN_away  sNEW_away  
#>    1.9873     2.2584     1.1555     3.4702     2.7270     1.5472     2.3185  
#> sNOT_away  sSUN_away  sTOT_away  sWHU_away  sWOL_away       draw  
#>    2.7723     1.7477     1.9700     2.1102     1.5372     0.5316  
#> 
#> Deviance:            849.1585 
#> Pearson chi-squared: 904.8874 
#> Residual df:         879

[I will add further details and more explanation in due course.]

Next, we use these team strengths to model future games across the season.

data_parameters_unplayed <- data_model |> 
  model_extract_parameters()

data_model_parameters_unplayed <- model_parameters_unplayed(
  results = results_filtered,
  model_parameters = data_parameters_unplayed
  )

data_points_expected_remaining <- data_model_parameters_unplayed |>  
  calc_points_expected_remaining()

calc_points_expected_total(
  table_current = data_table_current,
  points_expected = data_points_expected_remaining
  ) |> 
  knitr::kable()
midName Exp_Points_Ave
Liverpool 81.65226
Arsenal 81.16959
Man City 69.97195
Chelsea 66.99225
Sunderland 65.06298
Aston Villa 60.51185
Newcastle 59.42471
Bournemouth 58.81905
Brighton 58.03840
Crystal Palace 55.85868
Brentford 50.92066
Notts Forest 50.42615
Everton 48.28917
Fulham 47.81465
Man Utd 46.34688
Tottenham 40.73357
Leeds Utd 36.83454
Burnley 34.98184
West Ham 33.80618
Wolves 27.85355

On this basis, Liverpool look like slight favourites to win the season.

In order to project the likelihood of them becoming champions, however, we need to simulate many possible outcomes.

number_simulations <- 100000

data_simulate_games <- simulate_games(
  data_model_parameters_unplayed = data_model_parameters_unplayed,
  value_number_sims = number_simulations,
  value_seed = 2602L
  )

data_simulate_standings <- simulate_standings(
  data_game_simulations = data_simulate_games,
  data_table_latest = data_table_current
  )

simulate_outcomes(
  data_standings_simulations = data_simulate_standings,
  value_number_sims = number_simulations
  ) |> 
  knitr::kable()
midName champion top_four top_five top_six top_half relegation
Arsenal 0.46897 0.97800 0.99075 0.99599 0.99988 0.00000
Liverpool 0.46502 0.98261 0.99323 0.99723 0.99991 0.00000
Man City 0.04063 0.67953 0.80406 0.88171 0.98992 0.00001
Chelsea 0.01529 0.49399 0.66444 0.78457 0.97748 0.00002
Sunderland 0.00507 0.33786 0.53266 0.69126 0.96990 0.00000
Aston Villa 0.00173 0.14474 0.25864 0.39099 0.84651 0.00040
Bournemouth 0.00153 0.11152 0.20630 0.32196 0.78515 0.00061
Brighton 0.00070 0.08220 0.16333 0.26919 0.75086 0.00085
Newcastle 0.00068 0.10827 0.20384 0.32229 0.80497 0.00049
Crystal Palace 0.00031 0.05503 0.11214 0.19198 0.64739 0.00269
Brentford 0.00004 0.00874 0.02251 0.04878 0.33557 0.01433
Notts Forest 0.00001 0.00838 0.02179 0.04487 0.30181 0.02256
Everton 0.00001 0.00409 0.01133 0.02474 0.21777 0.03115
Fulham 0.00001 0.00323 0.00913 0.02041 0.18791 0.04094
Man Utd 0.00000 0.00167 0.00506 0.01200 0.14087 0.05987
Tottenham 0.00000 0.00013 0.00072 0.00181 0.03579 0.20681
Leeds Utd 0.00000 0.00000 0.00006 0.00018 0.00599 0.44530
West Ham 0.00000 0.00001 0.00001 0.00004 0.00210 0.67267
Burnley 0.00000 0.00000 0.00000 0.00000 0.00012 0.58946
Wolves 0.00000 0.00000 0.00000 0.00000 0.00010 0.91184

Alternatively, this table can be generated, without calculating all intermediate steps, by running run_simulations.