An R package to helps you to generate sensible predictions for individual games or an entire season of the Premier League • PremPredict

The PremPredict package helps you to generate sensible predictions for individual games or an entire season of the Premier League.

You can find my automatically-updated Premier League predictor (that uses this codebase) on the landing page of its repo.

Installation

You can install the development version of PremPredict from GitHub with:

# install.packages("pak")
pak::pak("p0bs/PremPredict")

Approach

I use a simplified version of David Firth’s approach and data from the Open Football repo on GitHub to predict the outcome of this season’s Premier League.

The predictions are based on a team’s strength, given its performance in recent times. But how should we define ‘recent’? In order to duck this question, you could choose a number of different time periods. Please note that 0.0% and 100.0% outcomes in the results do not necessarily signify certainty in their specific assessment, as:

this model is typically used with more than 1,000 simulations; and more pertinently
this model (like all models) is imperfect (but, I think, better than no model at all)

Example

Here is an example analysis, using data collected towards the end of the 2025/26 season.

First, we collect, combine and tidy the results data.

library(PremPredict)
data("example_thisSeason")

results_combined <- get_results(
  results_thisSeason = example_thisSeason, 
  seasons = 1L
  )

dim(results_combined)
#> [1] 760   9

Note that we want to look back across this season (so far) and its predecessor.

game_latest <- calc_game_latest(results = results_combined)

results_filtered <- get_results_filtered(
  results = results_combined, 
  index_game_latest = game_latest, 
  lookback_rounds = 76L
  )

dplyr::glimpse(results_filtered)
#> Rows: 760
#> Columns: 8
#> $ matchday <date> 2024-08-16, 2024-08-17, 2024-08-17, 2024-08-17, 2024-08-17, …
#> $ homeTeam <chr> "MUN", "IPS", "ARS", "EVE", "NEW", "NOT", "WHU", "BRE", "CHE"…
#> $ awayTeam <chr> "FUL", "LIV", "WOL", "BRI", "SOU", "BOU", "AST", "CPA", "MCI"…
#> $ FTHG     <dbl> 1, 0, 2, 0, 1, 1, 1, 2, 0, 1, 2, 0, 2, 4, 0, 4, 0, 1, 2, 2, 1…
#> $ FTAG     <dbl> 0, 2, 0, 3, 0, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1, 0, 2, 1, 6, 0, 1…
#> $ FTR      <chr> "H", "A", "H", "A", "H", "D", "A", "H", "A", "D", "H", "A", "…
#> $ played   <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T…
#> $ match    <chr> "001", "002", "003", "004", "005", "006", "007", "008", "009"…

For reference, we can see the prevailing table.

data_table_current <- example_thisSeason |> 
  calc_table_current()

data_table_current |> 
  print_table_current()

Team	Played	GD	Points
ARS	8	12	19
MCI	8	11	16
LIV	8	3	15
BOU	8	3	15
TOT	8	7	14
CHE	8	7	14
SUN	8	3	14
CPA	8	4	13
MUN	8	-1	13
BRI	8	1	12
AST	8	0	12
EVE	8	0	11
BRE	8	-1	10
NEW	8	0	9
FUL	8	-4	8
LEE	8	-6	8
BUR	8	-6	7
NOT	8	-10	5
WHU	8	-12	4
WOL	8	-11	2

We can now model the strengths of the sides at home and away.

data_model <- results_filtered |> 
  model_prepare_frame() |>
  model_run()

data_model
#> 
#> Call:
#> gnm::gnm(formula = count ~ -1 + s + draw, eliminate = match, 
#>     family = stats::quasipoisson, data = modelframe, start = rep(0, 
#>         2 * nTeams + 1))
#> 
#> Coefficients of interest:
#> sARS_home  sAST_home  sBOU_home  sBRE_home  sBRI_home  sBUR_home  sCHE_home  
#>    3.5887     3.3910     2.6672     2.7102     2.8616     3.0648     3.6256  
#> sCPA_home  sEVE_home  sFUL_home  sLEE_home  sLIV_home  sMCI_home  sMUN_home  
#>    2.1443     2.2261     2.2417     1.9925     4.2522     3.6907     2.2057  
#> sNEW_home  sNOT_home  sSUN_home  sTOT_home  sWHU_home  sWOL_home  sARS_away  
#>    3.3055     2.3953     4.3863     1.3325     1.0388     1.2459     3.5010  
#> sAST_away  sBOU_away  sBRE_away  sBRI_away  sBUR_away  sCHE_away  sCPA_away  
#>    2.2927     2.4675     2.0120     2.4678   -27.1656     2.5494     2.7080  
#> sEVE_away  sFUL_away  sLEE_away  sLIV_away  sMCI_away  sMUN_away  sNEW_away  
#>    1.9873     2.2584     1.1555     3.4702     2.7270     1.5472     2.3185  
#> sNOT_away  sSUN_away  sTOT_away  sWHU_away  sWOL_away       draw  
#>    2.7723     1.7477     1.9700     2.1102     1.5372     0.5316  
#> 
#> Deviance:            849.1585 
#> Pearson chi-squared: 904.8874 
#> Residual df:         879

[I will add further details and more explanation in due course.]

Next, we use these team strengths to model future games across the season.

data_parameters_unplayed <- data_model |> 
  model_extract_parameters()

data_model_parameters_unplayed <- model_parameters_unplayed(
  results = results_filtered,
  model_parameters = data_parameters_unplayed
  )

data_points_expected_remaining <- data_model_parameters_unplayed |>  
  calc_points_expected_remaining()

calc_points_expected_total(
  table_current = data_table_current,
  points_expected = data_points_expected_remaining
  ) |> 
  knitr::kable()

midName	Exp_Points_Ave
Liverpool	81.65226
Arsenal	81.16959
Man City	69.97195
Chelsea	66.99225
Sunderland	65.06298
Aston Villa	60.51185
Newcastle	59.42471
Bournemouth	58.81905
Brighton	58.03840
Crystal Palace	55.85868
Brentford	50.92066
Notts Forest	50.42615
Everton	48.28917
Fulham	47.81465
Man Utd	46.34688
Tottenham	40.73357
Leeds Utd	36.83454
Burnley	34.98184
West Ham	33.80618
Wolves	27.85355

On this basis, Liverpool look like slight favourites to win the season.

In order to project the likelihood of them becoming champions, however, we need to simulate many possible outcomes.

number_simulations <- 100000

data_simulate_games <- simulate_games(
  data_model_parameters_unplayed = data_model_parameters_unplayed,
  value_number_sims = number_simulations,
  value_seed = 2602L
  )

data_simulate_standings <- simulate_standings(
  data_game_simulations = data_simulate_games,
  data_table_latest = data_table_current
  )

simulate_outcomes(
  data_standings_simulations = data_simulate_standings,
  value_number_sims = number_simulations
  ) |> 
  knitr::kable()

midName	champion	top_four	top_five	top_six	top_half	relegation
Arsenal	0.46897	0.97800	0.99075	0.99599	0.99988	0.00000
Liverpool	0.46502	0.98261	0.99323	0.99723	0.99991	0.00000
Man City	0.04063	0.67953	0.80406	0.88171	0.98992	0.00001
Chelsea	0.01529	0.49399	0.66444	0.78457	0.97748	0.00002
Sunderland	0.00507	0.33786	0.53266	0.69126	0.96990	0.00000
Aston Villa	0.00173	0.14474	0.25864	0.39099	0.84651	0.00040
Bournemouth	0.00153	0.11152	0.20630	0.32196	0.78515	0.00061
Brighton	0.00070	0.08220	0.16333	0.26919	0.75086	0.00085
Newcastle	0.00068	0.10827	0.20384	0.32229	0.80497	0.00049
Crystal Palace	0.00031	0.05503	0.11214	0.19198	0.64739	0.00269
Brentford	0.00004	0.00874	0.02251	0.04878	0.33557	0.01433
Notts Forest	0.00001	0.00838	0.02179	0.04487	0.30181	0.02256
Everton	0.00001	0.00409	0.01133	0.02474	0.21777	0.03115
Fulham	0.00001	0.00323	0.00913	0.02041	0.18791	0.04094
Man Utd	0.00000	0.00167	0.00506	0.01200	0.14087	0.05987
Tottenham	0.00000	0.00013	0.00072	0.00181	0.03579	0.20681
Leeds Utd	0.00000	0.00000	0.00006	0.00018	0.00599	0.44530
West Ham	0.00000	0.00001	0.00001	0.00004	0.00210	0.67267
Burnley	0.00000	0.00000	0.00000	0.00000	0.00012	0.58946
Wolves	0.00000	0.00000	0.00000	0.00000	0.00010	0.91184

Alternatively, this table can be generated, without calculating all intermediate steps, by running run_simulations.

Team	Played	GD	Points
ARS	8	12	19
MCI	8	11	16
LIV	8	3	15
BOU	8	3	15
TOT	8	7	14
CHE	8	7	14
SUN	8	3	14
CPA	8	4	13
MUN	8	-1	13
BRI	8	1	12
AST	8	0	12
EVE	8	0	11
BRE	8	-1	10
NEW	8	0	9
FUL	8	-4	8
LEE	8	-6	8
BUR	8	-6	7
NOT	8	-10	5
WHU	8	-12	4
WOL	8	-11	2

Team	Played	GD	Points
ARS	8	12	19
MCI	8	11	16
LIV	8	3	15
BOU	8	3	15
TOT	8	7	14
CHE	8	7	14
SUN	8	3	14
CPA	8	4	13
MUN	8	-1	13
BRI	8	1	12
AST	8	0	12
EVE	8	0	11
BRE	8	-1	10
NEW	8	0	9
FUL	8	-4	8
LEE	8	-6	8
BUR	8	-6	7
NOT	8	-10	5
WHU	8	-12	4
WOL	8	-11	2

Team	Played	GD	Points
ARS	8	12	19
MCI	8	11	16
LIV	8	3	15
BOU	8	3	15
TOT	8	7	14
CHE	8	7	14
SUN	8	3	14
CPA	8	4	13
MUN	8	-1	13
BRI	8	1	12
AST	8	0	12
EVE	8	0	11
BRE	8	-1	10
NEW	8	0	9
FUL	8	-4	8
LEE	8	-6	8
BUR	8	-6	7
NOT	8	-10	5
WHU	8	-12	4
WOL	8	-11	2