Time Series Forecasting - Predicting Tomorrow From the Shape of Yesterday
Helpful context:
- Ensembles - Weak Learners That Combine Into Something Stronger
- Cross-Validation - Testing Your Model on Data It Has Never Seen
Most machine learning problems assume that the order of your data does not matter - you shuffle your training examples, and nothing changes. Time series problems violate this assumption fundamentally. The order is the signal. What happened at 2pm tells you something about what will happen at 3pm. Shuffling destroys that information.
This is why time series forecasting is its own subfield rather than a special case of regression. The temporal structure requires different models, different evaluation strategies, and different thinking about what features are even valid to use.
Decomposing a Time Series
Before choosing a model, understand the structure of your series. Most time series can be decomposed into three components:
Trend: the long-term direction. Sales of a growing product trend upward. A company closing stores trends downward. The trend changes slowly.
Seasonality: regular, repeating patterns at a fixed period. Traffic spikes every weekday morning and drops on weekends. Ice cream sales peak every summer. The period is known (weekly, yearly, daily) and the pattern repeats.
Residual: what is left after removing trend and seasonality. Genuine random variation, one-time events, the signal you cannot model.
STL decomposition (Seasonal and Trend decomposition using Loess) extracts these three components from any time series using locally-weighted regression:
from statsmodels.tsa.seasonal import STL
stl = STL(series, period=12) # monthly data with yearly seasonality
result = stl.fit()
trend = result.trend
seasonal = result.seasonal
residual = result.resid
Looking at each component tells you what kind of model you need. A strong trend suggests you need a model with trend components. Strong seasonality suggests you need a seasonal model or seasonal features. If the residual looks like structured signal rather than noise, there is still something left to model.
ARIMA: The Classical Approach
ARIMA (AutoRegressive Integrated Moving Average) is the foundation of classical time series forecasting. Understanding its three components reveals the structure of the problem:
Autoregressive (AR, order $p$): the current value is a linear combination of the $p$ most recent values plus an error term: $$y_t = c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \cdots + \phi_p y_{t-p} + \varepsilon_t$$
This captures the persistence of the series - today’s value is correlated with yesterday’s.
Integrated (I, order $d$): differencing the series $d$ times to make it stationary. A stationary series has constant mean and variance over time - most ARIMA theory requires stationarity. If a series has a trend (non-constant mean), one round of differencing ($y't = y_t - y{t-1}$) removes it.
Moving Average (MA, order $q$): the current value depends on the past $q$ forecast errors: $$y_t = c + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \cdots + \theta_q \varepsilon_{t-q}$$
This captures the “shock persistence” - how long a single unexpected event continues to affect the series.
Together, ARIMA($p$, $d$, $q$) captures autocorrelation from past values (AR), makes the series stationary (I), and models remaining correlation in the residuals (MA).
Choosing $p$ and $q$: the Autocorrelation Function (ACF) plot shows correlation between the series and its lagged versions. The Partial Autocorrelation Function (PACF) shows correlation at each lag after removing the effect of shorter lags. For an AR($p$) process, PACF cuts off at lag $p$. For an MA($q$) process, ACF cuts off at lag $q$. In practice, auto_arima from pmdarima searches over parameter combinations automatically using AIC or BIC.
SARIMA extends ARIMA with seasonal components: SARIMA($p$, $d$, $q$)($P$, $D$, $Q$, $s$) adds seasonal AR, differencing, and MA terms at period $s$ (e.g., $s = 12$ for monthly data with yearly seasonality).
Pros of ARIMA:
- Interpretable: the coefficients directly describe how past values influence predictions.
- Well-understood statistically: confidence intervals have rigorous justification.
- Works well on short, simple series.
Cons of ARIMA:
- Assumes linearity. Cannot model complex nonlinear seasonal patterns.
- One model per series. With 10,000 time series (one per store, one per SKU), you train 10,000 separate models.
- Requires stationarity testing and transformation, which adds steps.
- Does not naturally incorporate external variables (covariates) like promotions, holidays, weather.
Exponential Smoothing (Holt-Winters)
Exponential smoothing weights recent observations more heavily than older ones. The weight decays exponentially with time.
Simple exponential smoothing: $\hat{y}_{t+1} = \alpha y_t + (1-\alpha) \hat{y}_t$. The parameter $\alpha \in (0,1)$ controls how quickly the model forgets the past. $\alpha = 0.9$ responds quickly to changes; $\alpha = 0.1$ produces a very smooth forecast.
Holt’s method (double exponential smoothing) adds a trend component, allowing the model to extrapolate a linear trend forward.
Holt-Winters (triple exponential smoothing) adds a seasonal component. Both additive ($y_t = \text{trend} + \text{seasonal} + \text{error}$) and multiplicative ($y_t = \text{trend} \times \text{seasonal} \times \text{error}$) seasonality are supported. Multiplicative is appropriate when seasonal swings scale with the level (summer sales are 30% above the trend, regardless of whether the trend is 100 or 1000).
Pros: fast, simple, no stationarity requirements, handles trend and seasonality well.
Cons: only one seasonal period (cannot model both weekly and yearly seasonality simultaneously), limited to exponential decay structure.
Prophet: Practical Forecasting at Scale
Facebook’s Prophet (Taylor and Letham, 2018) was designed for the problems that ARIMA and Holt-Winters handle poorly in practice: multiple seasonalities, irregular holidays, missing data, and sudden trend changes (changepoints).
Prophet models the time series as: $$y(t) = g(t) + s(t) + h(t) + \varepsilon_t$$
where $g(t)$ is the trend (piecewise linear or logistic growth with automatic changepoint detection), $s(t)$ is seasonality (Fourier series allowing multiple periods simultaneously), and $h(t)$ is holiday/event effects (user-specified special dates with their own effect parameters).
The model is fit with a regularized regression, making it robust. Uncertainty is quantified via MCMC or a Laplace approximation.
Pros:
- Handles multiple seasonalities (daily, weekly, yearly) simultaneously.
- Robust to missing data - no imputation needed.
- Automatic changepoint detection when the trend shifts.
- Holiday effects can be specified by the user (Diwali, Black Friday, etc.).
- Easy to tune for domain experts who understand trend/seasonality/holiday concepts.
Cons:
- Assumes additive structure. Multiplicative seasonality is supported but less clean.
- Not always the most accurate on complex series - it is optimized for explainability and robustness, not raw RMSE.
- Does not handle covariates as naturally as ML-based approaches.
ML on Time Series: Feature Engineering Is Everything
The most powerful and flexible approach for complex time series - especially when you have many related series or external covariates - is to frame the problem as supervised regression and use LightGBM or XGBoost.
The key is feature engineering. You transform the time series into a standard feature matrix where each row represents one prediction target (one time step), and the columns are features derived from the history.
Lag features: the value at time $t-1$, $t-2$, …, $t-k$ for some lookback window $k$. These are the most important features for most time series.
Rolling statistics: mean, standard deviation, min, max over a rolling window (e.g., the mean of the last 7 days, the standard deviation of the last 30 days). These capture local level and volatility.
Date and calendar features: hour of day, day of week, week of year, month, quarter, is_weekend, is_holiday. These capture deterministic seasonality directly.
Target lags of related series: if you are forecasting sales for product A, the sales of product B (a substitute or complement) in the previous period may be informative.
def make_features(df, lags=[1, 7, 14, 28]):
df = df.copy()
for lag in lags:
df[f"lag_{lag}"] = df["sales"].shift(lag)
df["rolling_mean_7"] = df["sales"].shift(1).rolling(7).mean()
df["rolling_std_7"] = df["sales"].shift(1).rolling(7).std()
df["day_of_week"] = df["date"].dt.dayofweek
df["month"] = df["date"].dt.month
df["is_weekend"] = (df["day_of_week"] >= 5).astype(int)
return df.dropna()
The .shift(1) is critical: when predicting $y_t$, every feature must use only information available at time $t-1$ or earlier. Using same-period information is data leakage - the model appears to work but fails in production because the future information is unavailable when you actually need to predict.
Pros of ML approach:
- Can capture complex nonlinear interactions.
- Naturally incorporates covariates (price, promotions, weather).
- One model can handle thousands of time series if you stack them (add a series ID as a categorical feature).
- Benefits from all the XGBoost/LightGBM machinery: regularization, handling missing values, fast training.
Cons:
- Requires careful walk-forward validation (see below).
- Feature engineering requires domain knowledge.
- Does not extrapolate trends as naturally as ARIMA.
- Lag features at long horizons carry forward prediction error from shorter horizons.
Walk-Forward Validation: Never Peek at the Future
Standard cross-validation shuffles data randomly. For time series, this is wrong and misleading - using future data to predict the past inflates metrics and gives false confidence.
Walk-forward (expanding window) validation: train on all data up to time $T$, validate on $T+1$ through $T+h$. Then train on all data up to $T+h$, validate on $T+h+1$ through $T+2h$. Repeat until you have used all the data as validation.
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, val_idx in tscv.split(X):
X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]
# train and evaluate
The validation split always uses later data than the training data. This mirrors the actual deployment condition: you always predict the future from the past.
When to Use Which
| Method | Use when |
|---|---|
| ARIMA/SARIMA | Single series, stationary or easily made so, interpretability required |
| Holt-Winters | Clear trend + single seasonality, simple to tune |
| Prophet | Multiple seasonalities, holidays important, missing data, business user wants to tune |
| LightGBM with lag features | Many related series, covariates available, high accuracy is the goal |
| LSTM / Transformer | Very long dependencies, complex temporal patterns, large data (millions of timesteps) |
The ML approach (LightGBM) wins most Kaggle forecasting competitions and performs best on most commercial forecasting tasks with enough data. Classical methods win when the series is short, the pattern is simple, or you need statistical confidence intervals. Prophet wins in business settings where the model needs to be maintained by domain experts who are not ML engineers.
| Concept | Key point |
|---|---|
| STL decomposition | Separates trend, seasonality, residual; reveals what kind of model is needed |
| Stationarity | Required by ARIMA; achieved by differencing |
| ACF / PACF | Diagnostic plots for choosing AR ($p$) and MA ($q$) orders |
| SARIMA | ARIMA with seasonal components; requires specifying the period |
| Prophet | Piecewise trend + Fourier seasonality + holidays; robust and tunable |
| Lag features | Core of ML approach; always shift by 1+ to avoid leakage |
| Walk-forward validation | Only valid CV for time series; train on past, validate on future |
| Multiplicative seasonality | Use when seasonal swing scales with level (percentages, not absolutes) |
Read Next: