Quickstart
This notebook contains a minimal example of using fev to evaluate time series forecasting models.
In [1]:
Copied!
import fev
import fev
In [2]:
Copied!
# Create a task from a dataset stored on Hugging Face Hub
task = fev.Task(
    dataset_path="autogluon/chronos_datasets",
    dataset_config="ercot",
    horizon=24,
    num_windows=2,
)
# Create a task from a dataset stored on Hugging Face Hub
task = fev.Task(
    dataset_path="autogluon/chronos_datasets",
    dataset_config="ercot",
    horizon=24,
    num_windows=2,
)
In [3]:
Copied!
# A task consists of multiple rolling evaluation windows
for window in task.iter_windows():
    print(window)
# A task consists of multiple rolling evaluation windows
for window in task.iter_windows():
    print(window)
EvaluationWindow(cutoff=-48, horizon=24, min_context_length=1, max_context_length=None, id_column='id', timestamp_column='timestamp', target_columns=['target'], known_dynamic_columns=[], past_dynamic_columns=[], static_columns=[]) EvaluationWindow(cutoff=-24, horizon=24, min_context_length=1, max_context_length=None, id_column='id', timestamp_column='timestamp', target_columns=['target'], known_dynamic_columns=[], past_dynamic_columns=[], static_columns=[])
In [4]:
Copied!
# Load data available as input to the forecasting model
past_data, future_data = task.get_window(0).get_input_data()
# Load data available as input to the forecasting model
past_data, future_data = task.get_window(0).get_input_data()
In [5]:
Copied!
# past data before the forecast horizon.
past_data
# past data before the forecast horizon.
past_data
Out[5]:
Dataset({
    features: ['id', 'timestamp', 'target'],
    num_rows: 8
})
In [6]:
Copied!
past_data[0]
past_data[0]
Out[6]:
{'id': np.str_('COAST'),
 'timestamp': array(['2004-01-01T01:00:00.000000000', '2004-01-01T02:00:00.000000000',
        '2004-01-01T03:00:00.000000000', ...,
        '2021-08-29T22:00:00.000000000', '2021-08-29T23:00:00.000000000',
        '2021-08-30T00:00:00.000000000'], dtype='datetime64[ns]'),
 'target': array([ 7225.09,  6994.25,  6717.42, ..., 17114.34, 16091.05, 15081.16],
       dtype=float32)}
In [7]:
Copied!
# future data that is known at prediction time (item ID, future timestamps, static and known covariates)
future_data
# future data that is known at prediction time (item ID, future timestamps, static and known covariates)
future_data
Out[7]:
Dataset({
    features: ['id', 'timestamp'],
    num_rows: 8
})
In [8]:
Copied!
future_data[0]
future_data[0]
Out[8]:
{'id': np.str_('COAST'),
 'timestamp': array(['2021-08-30T01:00:00.000000000', '2021-08-30T02:00:00.000000000',
        '2021-08-30T03:00:00.000000000', '2021-08-30T04:00:00.000000000',
        '2021-08-30T05:00:00.000000000', '2021-08-30T06:00:00.000000000',
        '2021-08-30T07:00:00.000000000', '2021-08-30T08:00:00.000000000',
        '2021-08-30T09:00:00.000000000', '2021-08-30T10:00:00.000000000',
        '2021-08-30T11:00:00.000000000', '2021-08-30T12:00:00.000000000',
        '2021-08-30T13:00:00.000000000', '2021-08-30T14:00:00.000000000',
        '2021-08-30T15:00:00.000000000', '2021-08-30T16:00:00.000000000',
        '2021-08-30T17:00:00.000000000', '2021-08-30T18:00:00.000000000',
        '2021-08-30T19:00:00.000000000', '2021-08-30T20:00:00.000000000',
        '2021-08-30T21:00:00.000000000', '2021-08-30T22:00:00.000000000',
        '2021-08-30T23:00:00.000000000', '2021-08-31T00:00:00.000000000'],
       dtype='datetime64[ns]')}
In [9]:
Copied!
import numpy as np
def naive_forecast(y: list, horizon: int) -> dict[str, list]:
    # Make predictions for a single time series
    return {"predictions": [y[np.isfinite(y)][-1] for _ in range(horizon)]}
predictions_per_window = []
for window in task.iter_windows():
    past_data, future_data = window.get_input_data()
    predictions = [
        naive_forecast(ts[task.target], task.horizon) for ts in past_data
    ]
    predictions_per_window.append(predictions)
import numpy as np
def naive_forecast(y: list, horizon: int) -> dict[str, list]:
    # Make predictions for a single time series
    return {"predictions": [y[np.isfinite(y)][-1] for _ in range(horizon)]}
predictions_per_window = []
for window in task.iter_windows():
    past_data, future_data = window.get_input_data()
    predictions = [
        naive_forecast(ts[task.target], task.horizon) for ts in past_data
    ]
    predictions_per_window.append(predictions)
In [10]:
Copied!
eval_summary = task.evaluation_summary(predictions_per_window, model_name="naive")
eval_summary
eval_summary = task.evaluation_summary(predictions_per_window, model_name="naive")
eval_summary
Out[10]:
{'model_name': 'naive',
 'dataset_path': 'autogluon/chronos_datasets',
 'dataset_config': 'ercot',
 'horizon': 24,
 'num_windows': 2,
 'initial_cutoff': -48,
 'window_step_size': 24,
 'min_context_length': 1,
 'max_context_length': None,
 'seasonality': 1,
 'eval_metric': 'MASE',
 'extra_metrics': [],
 'quantile_levels': [],
 'id_column': 'id',
 'timestamp_column': 'timestamp',
 'target': 'target',
 'generate_univariate_targets_from': None,
 'known_dynamic_columns': [],
 'past_dynamic_columns': [],
 'static_columns': [],
 'task_name': 'ercot',
 'test_error': 7.301416542738646,
 'training_time_s': None,
 'inference_time_s': None,
 'dataset_fingerprint': '95b91121d95f89c8',
 'trained_on_this_dataset': False,
 'fev_version': '0.6.0',
 'MASE': 7.301416542738646}
Evaluation summaries produced by different models on different tasks can be aggregated into a single table.
In [11]:
Copied!
import pandas as pd
summaries = pd.read_csv("https://raw.githubusercontent.com/autogluon/fev/refs/heads/main/benchmarks/example/results/results.csv")
summaries.head()
import pandas as pd
summaries = pd.read_csv("https://raw.githubusercontent.com/autogluon/fev/refs/heads/main/benchmarks/example/results/results.csv")
summaries.head()
Out[11]:
| model_name | dataset_path | dataset_config | horizon | num_windows | initial_cutoff | window_step_size | min_context_length | max_context_length | seasonality | ... | past_dynamic_columns | static_columns | task_name | test_error | training_time_s | inference_time_s | dataset_fingerprint | trained_on_this_dataset | fev_version | MASE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | seasonal_naive | autogluon/chronos_datasets | monash_m1_quarterly | 8 | 1 | -8 | 8 | 1 | NaN | 4 | ... | [] | [] | monash_m1_quarterly | 2.077537 | 0.0 | 1.687698 | 5dd7170c16393209 | False | 0.6.0 | 2.077537 | 
| 1 | ets | autogluon/chronos_datasets | monash_m1_quarterly | 8 | 1 | -8 | 8 | 1 | NaN | 4 | ... | [] | [] | monash_m1_quarterly | 1.660810 | 0.0 | 4.366176 | 5dd7170c16393209 | False | 0.6.0 | 1.660810 | 
| 2 | theta | autogluon/chronos_datasets | monash_m1_quarterly | 8 | 1 | -8 | 8 | 1 | NaN | 4 | ... | [] | [] | monash_m1_quarterly | 1.705247 | 0.0 | 0.125761 | 5dd7170c16393209 | False | 0.6.0 | 1.705247 | 
| 3 | seasonal_naive | autogluon/chronos_datasets | monash_electricity_weekly | 8 | 2 | -16 | 8 | 1 | NaN | 1 | ... | [] | [] | monash_electricity_weekly | 2.535526 | 0.0 | 1.175560 | b7cd1c9df3391815 | False | 0.6.0 | 2.535526 | 
| 4 | ets | autogluon/chronos_datasets | monash_electricity_weekly | 8 | 2 | -16 | 8 | 1 | NaN | 1 | ... | [] | [] | monash_electricity_weekly | 2.552429 | 0.0 | 3.755289 | b7cd1c9df3391815 | False | 0.6.0 | 2.552429 | 
5 rows × 28 columns
In [12]:
Copied!
# Evaluation summaries can be provided as dataframes, dicts, JSON or CSV files
fev.leaderboard(summaries, baseline_model="seasonal_naive")
# Evaluation summaries can be provided as dataframes, dicts, JSON or CSV files
fev.leaderboard(summaries, baseline_model="seasonal_naive")
Out[12]:
| skill_score | win_rate | median_training_time_s | median_inference_time_s | training_corpus_overlap | num_failures | |
|---|---|---|---|---|---|---|
| model_name | ||||||
| ets | 0.133483 | 0.833333 | 0.0 | 3.755289 | 0.0 | 0 | 
| theta | 0.105932 | 0.333333 | 0.0 | 0.125761 | 0.0 | 0 | 
| seasonal_naive | 0.000000 | 0.333333 | 0.0 | 1.444558 | 0.0 | 0 | 
The leaderboard method not only summarizes the results into a single table, but also ensures that all task definitions match across different models. This ensures that the scores are comparable and the comparison is fair.