Quickstart
This notebook contains a minimal example of using fev
to evaluate time series forecasting models.
In [1]:
Copied!
import fev
import fev
In [2]:
Copied!
# Create a task from a dataset stored on Hugging Face Hub
task = fev.Task(
dataset_path="autogluon/chronos_datasets",
dataset_config="ercot",
horizon=24,
num_windows=2,
)
# Create a task from a dataset stored on Hugging Face Hub
task = fev.Task(
dataset_path="autogluon/chronos_datasets",
dataset_config="ercot",
horizon=24,
num_windows=2,
)
In [3]:
Copied!
# A task consists of multiple rolling evaluation windows
for window in task.iter_windows():
print(window)
# A task consists of multiple rolling evaluation windows
for window in task.iter_windows():
print(window)
EvaluationWindow(cutoff=-48, horizon=24, min_context_length=1, max_context_length=None, id_column='id', timestamp_column='timestamp', target_columns=['target'], known_dynamic_columns=[], past_dynamic_columns=[], static_columns=[]) EvaluationWindow(cutoff=-24, horizon=24, min_context_length=1, max_context_length=None, id_column='id', timestamp_column='timestamp', target_columns=['target'], known_dynamic_columns=[], past_dynamic_columns=[], static_columns=[])
In [4]:
Copied!
# Load data available as input to the forecasting model
past_data, future_data = task.get_window(0).get_input_data()
# Load data available as input to the forecasting model
past_data, future_data = task.get_window(0).get_input_data()
In [5]:
Copied!
# past data before the forecast horizon.
past_data
# past data before the forecast horizon.
past_data
Out[5]:
Dataset({ features: ['id', 'timestamp', 'target'], num_rows: 8 })
In [6]:
Copied!
past_data[0]
past_data[0]
Out[6]:
{'id': np.str_('COAST'), 'timestamp': array(['2004-01-01T01:00:00.000000000', '2004-01-01T02:00:00.000000000', '2004-01-01T03:00:00.000000000', ..., '2021-08-29T22:00:00.000000000', '2021-08-29T23:00:00.000000000', '2021-08-30T00:00:00.000000000'], dtype='datetime64[ns]'), 'target': array([ 7225.09, 6994.25, 6717.42, ..., 17114.34, 16091.05, 15081.16], dtype=float32)}
In [7]:
Copied!
# future data that is known at prediction time (item ID, future timestamps, static and known covariates)
future_data
# future data that is known at prediction time (item ID, future timestamps, static and known covariates)
future_data
Out[7]:
Dataset({ features: ['id', 'timestamp'], num_rows: 8 })
In [8]:
Copied!
future_data[0]
future_data[0]
Out[8]:
{'id': np.str_('COAST'), 'timestamp': array(['2021-08-30T01:00:00.000000000', '2021-08-30T02:00:00.000000000', '2021-08-30T03:00:00.000000000', '2021-08-30T04:00:00.000000000', '2021-08-30T05:00:00.000000000', '2021-08-30T06:00:00.000000000', '2021-08-30T07:00:00.000000000', '2021-08-30T08:00:00.000000000', '2021-08-30T09:00:00.000000000', '2021-08-30T10:00:00.000000000', '2021-08-30T11:00:00.000000000', '2021-08-30T12:00:00.000000000', '2021-08-30T13:00:00.000000000', '2021-08-30T14:00:00.000000000', '2021-08-30T15:00:00.000000000', '2021-08-30T16:00:00.000000000', '2021-08-30T17:00:00.000000000', '2021-08-30T18:00:00.000000000', '2021-08-30T19:00:00.000000000', '2021-08-30T20:00:00.000000000', '2021-08-30T21:00:00.000000000', '2021-08-30T22:00:00.000000000', '2021-08-30T23:00:00.000000000', '2021-08-31T00:00:00.000000000'], dtype='datetime64[ns]')}
In [9]:
Copied!
import numpy as np
def naive_forecast(y: list, horizon: int) -> dict[str, list]:
# Make predictions for a single time series
return {"predictions": [y[np.isfinite(y)][-1] for _ in range(horizon)]}
predictions_per_window = []
for window in task.iter_windows():
past_data, future_data = window.get_input_data()
predictions = [
naive_forecast(ts[task.target], task.horizon) for ts in past_data
]
predictions_per_window.append(predictions)
import numpy as np
def naive_forecast(y: list, horizon: int) -> dict[str, list]:
# Make predictions for a single time series
return {"predictions": [y[np.isfinite(y)][-1] for _ in range(horizon)]}
predictions_per_window = []
for window in task.iter_windows():
past_data, future_data = window.get_input_data()
predictions = [
naive_forecast(ts[task.target], task.horizon) for ts in past_data
]
predictions_per_window.append(predictions)
In [10]:
Copied!
eval_summary = task.evaluation_summary(predictions_per_window, model_name="naive")
eval_summary
eval_summary = task.evaluation_summary(predictions_per_window, model_name="naive")
eval_summary
Out[10]:
{'model_name': 'naive', 'dataset_path': 'autogluon/chronos_datasets', 'dataset_config': 'ercot', 'horizon': 24, 'num_windows': 2, 'initial_cutoff': -48, 'window_step_size': 24, 'min_context_length': 1, 'max_context_length': None, 'seasonality': 1, 'eval_metric': 'MASE', 'extra_metrics': [], 'quantile_levels': [], 'id_column': 'id', 'timestamp_column': 'timestamp', 'target': 'target', 'generate_univariate_targets_from': None, 'known_dynamic_columns': [], 'past_dynamic_columns': [], 'static_columns': [], 'task_name': 'ercot', 'test_error': 7.301416542738646, 'training_time_s': None, 'inference_time_s': None, 'dataset_fingerprint': '95b91121d95f89c8', 'trained_on_this_dataset': False, 'fev_version': '0.6.0', 'MASE': 7.301416542738646}
Evaluation summaries produced by different models on different tasks can be aggregated into a single table.
In [11]:
Copied!
import pandas as pd
summaries = pd.read_csv("https://raw.githubusercontent.com/autogluon/fev/refs/heads/main/benchmarks/example/results/results.csv")
summaries.head()
import pandas as pd
summaries = pd.read_csv("https://raw.githubusercontent.com/autogluon/fev/refs/heads/main/benchmarks/example/results/results.csv")
summaries.head()
Out[11]:
model_name | dataset_path | dataset_config | horizon | num_windows | initial_cutoff | window_step_size | min_context_length | max_context_length | seasonality | ... | past_dynamic_columns | static_columns | task_name | test_error | training_time_s | inference_time_s | dataset_fingerprint | trained_on_this_dataset | fev_version | MASE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | seasonal_naive | autogluon/chronos_datasets | monash_m1_quarterly | 8 | 1 | -8 | 8 | 1 | NaN | 4 | ... | [] | [] | monash_m1_quarterly | 2.077537 | 0.0 | 1.687698 | 5dd7170c16393209 | False | 0.6.0 | 2.077537 |
1 | ets | autogluon/chronos_datasets | monash_m1_quarterly | 8 | 1 | -8 | 8 | 1 | NaN | 4 | ... | [] | [] | monash_m1_quarterly | 1.660810 | 0.0 | 4.366176 | 5dd7170c16393209 | False | 0.6.0 | 1.660810 |
2 | theta | autogluon/chronos_datasets | monash_m1_quarterly | 8 | 1 | -8 | 8 | 1 | NaN | 4 | ... | [] | [] | monash_m1_quarterly | 1.705247 | 0.0 | 0.125761 | 5dd7170c16393209 | False | 0.6.0 | 1.705247 |
3 | seasonal_naive | autogluon/chronos_datasets | monash_electricity_weekly | 8 | 2 | -16 | 8 | 1 | NaN | 1 | ... | [] | [] | monash_electricity_weekly | 2.535526 | 0.0 | 1.175560 | b7cd1c9df3391815 | False | 0.6.0 | 2.535526 |
4 | ets | autogluon/chronos_datasets | monash_electricity_weekly | 8 | 2 | -16 | 8 | 1 | NaN | 1 | ... | [] | [] | monash_electricity_weekly | 2.552429 | 0.0 | 3.755289 | b7cd1c9df3391815 | False | 0.6.0 | 2.552429 |
5 rows × 28 columns
In [12]:
Copied!
# Evaluation summaries can be provided as dataframes, dicts, JSON or CSV files
fev.leaderboard(summaries, baseline_model="seasonal_naive")
# Evaluation summaries can be provided as dataframes, dicts, JSON or CSV files
fev.leaderboard(summaries, baseline_model="seasonal_naive")
Out[12]:
skill_score | win_rate | median_training_time_s | median_inference_time_s | training_corpus_overlap | num_failures | |
---|---|---|---|---|---|---|
model_name | ||||||
ets | 0.133483 | 0.833333 | 0.0 | 3.755289 | 0.0 | 0 |
theta | 0.105932 | 0.333333 | 0.0 | 0.125761 | 0.0 | 0 |
seasonal_naive | 0.000000 | 0.333333 | 0.0 | 1.444558 | 0.0 | 0 |
The leaderboard
method not only summarizes the results into a single table, but also ensures that all task definitions match across different models. This ensures that the scores are comparable and the comparison is fair.