Methodology
Overview
estimador.pt uses Bayesian statistical models to forecast Portuguese elections. Our approach is fundamentally probabilistic: rather than predicting a single outcome, we estimate probability distributions that reflect the inherent uncertainty in the electoral process.
Our models are implemented in PyMC, a probabilistic programming library that enables defining complex statistical models and performing efficient Bayesian inference.
Parliamentary Elections
Model Structure
The parliamentary election model combines polling data with historical election results to estimate the evolution of party support over time. The core structure uses three overlapping Gaussian Processes (GPs) that capture dynamics at different time scales:
1. Baseline GP (Long-Term Trend)
- Time scale: ~4 years
- Captures structural changes in the party system
- Uses an ExpQuad (Exponential Quadratic) kernel
2. Medium-Term GP
- Time scale: ~1 year
- Captures political cycles and gradual shifts in public opinion
- Uses a Matern 5/2 kernel
3. Short-Term GP
- Time scale: ~14 days
- Captures campaign dynamics and reactions to events
- Uses a Matern 3/2 kernel to allow faster variations
The combination of these three processes separates long-term signal from short-term noise while still capturing real movements during the campaign.
Technical Details: Gaussian Processes
Gaussian Processes are a non-parametric way to model unknown functions. We define priors over functions rather than fixed parameters, which allows the model to learn the shape of trends from the data.
We use the HSGP (Hilbert Space Gaussian Process) approximation for computational efficiency. This technique approximates the GP using a series of basis functions, drastically reducing computational cost without significant loss of precision.
The baseline GP covariance function is:
k(t, t') = σ² × exp(-|t-t'|² / (2ℓ²))
Priors used:
- Baseline lengthscale: LogNormal(μ=log(1460), σ=0.3) — centered at ~4 years
- Medium-term lengthscale: LogNormal(μ=log(365), σ=0.5) — centered at ~1 year
- Short-term lengthscale: LogNormal(μ=log(14), σ=0.3) — centered at ~14 days
- GP amplitudes: HalfNormal(σ=0.2-0.3)
Pollster House Effects
Each polling organization has systematic tendencies to over- or under-estimate certain parties. Our model estimates these "house effects" from historical data.
For example, if a pollster consistently overestimates PS by 2 percentage points, the model corrects for this tendency when aggregating polls from multiple sources.
House effects are modeled with a zero-sum constraint: if a pollster overestimates one party, they must underestimate other(s) to compensate. The prior for house effect standard deviation is HalfNormal(σ=0.05) per party, allowing typical biases of ~2-3 percentage points.
District Effects
Portugal uses a proportional representation system with seat allocation at the district level. The model includes static district offsets that capture regional differences in party support.
These offsets are estimated from previous election results and allow predicting how national vote shares translate into district-level votes. The prior for district offsets is HalfNormal(σ=0.1) per party, reflecting typical regional variations of ~5 percentage points.
Likelihood
Poll observations follow a Dirichlet-Multinomial distribution, which is the natural choice for compositional data (percentages that sum to 100%).
Technical Details: Dirichlet-Multinomial
The Dirichlet-Multinomial is parameterized by:
n: poll sample sizeα: concentration vector (proportional to each party's probabilities)
This distribution has two sources of variation: sampling error (which depends on n) and extra variation captured by the concentration parameter. This correctly models the fact that polls don't always behave like simple random samples.
Concentration priors:
- Polls: Gamma(α=2, β=0.01) — mean ~200, allows extra variation beyond sampling error
- Election results: Gamma(α=100, β=0.1) — mean ~1000, tighter fit to actual results
D'Hondt Method and Seat Allocation
To convert vote forecasts into seat forecasts, we simulate the Portuguese electoral system:
- 22 electoral districts: 18 mainland districts + 2 autonomous regions + Europe + Outside Europe
- D'Hondt method: Sequential divisor system (1, 2, 3, ...) for proportional seat allocation
- 230 total seats (excluding the 4 emigration seats in the 2024 elections)
The model runs thousands of Monte Carlo simulations:
- For each simulation, draw a sample from the posterior distribution of national vote shares
- Apply district offsets to obtain district-level votes
- Run the D'Hondt algorithm in each district
- Sum each party's seats
The result is a complete distribution of possible parliamentary compositions.
Metrics of Interest
From the simulations, we calculate:
- Probability of each party winning the most seats
- Majority probabilities: right-wing, left-wing, or hung parliament
- Expected seats: mean and credibility intervals (50% and 80%)
- Contested districts: using ENSC (Effective Number of Seat Changes) with a threshold of 0.8
Presidential Elections
Model Structure
The presidential model differs from the parliamentary one because:
- Candidates are individuals, not parties with long track records
- The system uses two rounds (absolute majority required)
- Campaign dynamics are more important
We use a random walk model in log-odds space to capture the evolution of candidate support.
Technical Details: Random Walk
The random walk is defined as:
latent[t] = latent[t-1] + innovation[t]
where innovation[t] ~ ZeroSumNormal(σ) ensures that one candidate's gains correspond to others' losses.
The innovation standard deviation (σ ≈ 0.05 log-odds/day) is calibrated to allow movements of ~5 percentage points over a 50-day campaign. This value is fixed (not learned) following The Economist's methodology, since with few polls it's not possible to reliably estimate volatility.
Candidate priors:
- Baseline: Normal(μ=logit(party_prior), σ=transformed) — centered on the candidate's party historical support
- Likelihood concentration: Fixed at 60 for presidential polls
Candidate-Specific Priors
For candidates with known party affiliation, we use informative priors based on those parties' historical support. This helps stabilize estimates when few polls are available.
Pollster House Effects
When available, we use house effects estimated from the parliamentary model as informative priors for the presidential model. This transfers information about each pollster's systematic biases.
Second Round Modeling
For each simulation:
- We check if any candidate reaches >50% in the first round
- If not, we identify the top two candidates
- We record the probability of each pair of candidates facing off in the runoff
Uncertainty Quantification
Our models incorporate multiple sources of uncertainty:
- Polling error — Expected variation between polls and actual results
- House effects — Uncertainty in estimated pollster biases
- Model uncertainty — GP parameters and other model components
- Campaign effects — Potential for late shifts not captured in polls
Communicating Uncertainty
We present uncertainty through:
- Credibility intervals: 50% and 80% bands in trend charts
- Probabilities: "70% probability of X winning the most seats"
- Simulation distributions: visualizations showing the full range of possible outcomes
Model Validation
We validate our models through:
- Backtesting: Apply the model to past elections using only data available before each election
- Calibration: Verify that credibility intervals contain actual results at the expected frequency
- Error metrics: Calculate MAE, RMSE, and log-scores
- Post-election comparison: Analyze performance after each election
Data Sources
Polls
- Aximage, CESOP, Pitagórica, Intercampus, Eurosondagem, and other ERC-registered polling firms
Election Results
- National Elections Commission (CNE)
- SGMAI (General Secretariat of the Ministry of Internal Administration)
Demographic Data
- National Statistics Institute (INE)
Limitations
Our model has important limitations:
- Poll dependence: If polls have systematic biases not captured by historical data, forecasts will be affected
- Unforeseen events: Scandals, crises, or other events can rapidly alter the electoral landscape
- Historical patterns: We assume past patterns inform the future, which may not hold in unprecedented contexts
- Undecided voters: Difficulty predicting how undecided voters will behave
Our forecasts should be interpreted as informed probabilistic estimates, not certainties.
References and Inspiration
Our methodology is inspired by: