Models

This notebook covers the following topics:

Adding a wrapper for your model to fev/models.
Submitting the results for your model to the fev-bench leaderboard.

Adding a wrapper for your model¶

Each model wrapper lives in its own subfolder under models/. The evaluation harness (models/evaluate.py) discovers and runs them automatically.

Step 1: Create the folder¶

Create a folder models/<name>/ where <name> is how you'll refer to the model with the -m flag.

Step 2: Add `model.py`¶

Implement a subclass of fev.ForecastingModel. The model_name class attribute must match the folder name.

# models/my-model/model.py
import datasets

import fev


class MyModel(fev.ForecastingModel):
    model_name = "my-model"  # must match the folder name

    # List HF dataset configs (from autogluon/fev_datasets) used during pretraining.
    # Used to flag potential data leakage. Leave empty for models that train from scratch.
    trained_on_datasets = ["kdd_cup_2022_10T", "m5_1D"]

    def __init__(self, model_size: str = "small"):
        super().__init__()
        self.model_size = model_size

    def _fit_predict(self, task: fev.Task) -> list[datasets.DatasetDict]:
        predictions_per_window = []
        for window in task.iter_windows():
            past_data, future_data = window.get_input_data()

            with self._record_inference_time():
                # Generate predictions for each time series
                predictions = {"predictions": [...]}

            predictions_per_window.append(predictions)
        return predictions_per_window

Key points about _fit_predict:

Called once per task. Must return predictions for all evaluation windows.
Use self._record_inference_time() context manager to track inference time.
Use self._record_training_time() if your model has a training step.
Each call should be independent — don't carry over state from prior tasks.
Caching expensive resources (weights, tokenizers) on self across calls is fine.

Step 3: Add `requirements.txt`¶

List pinned dependencies for your model. Pin the main packages to exact versions for reproducibility. These are installed automatically in an ephemeral environment when running evaluate.py — your project environment is not modified.

# models/my-model/requirements.txt
my-forecasting-lib==1.2.3
torch==2.7

Predictions format¶

Predictions must follow the schema provided by task.predictions_schema.

In [13]:

Copied!

import fev

benchmark = fev.Benchmark.from_yaml("https://raw.githubusercontent.com/autogluon/fev/refs/tags/v0.7.0/benchmarks/fev_bench/tasks.yaml")
task = [t for t in benchmark.tasks if t.task_name == "rossmann_1W"][0]
task.predictions_schema
import fev

benchmark = fev.Benchmark.from_yaml("https://raw.githubusercontent.com/autogluon/fev/refs/tags/v0.7.0/benchmarks/fev_bench/tasks.yaml")
task = [t for t in benchmark.tasks if t.task_name == "rossmann_1W"][0]
task.predictions_schema

Out[13]:

{'predictions': Sequence(feature=Value(dtype='float64', id=None), length=13, id=None),
 '0.1': Sequence(feature=Value(dtype='float64', id=None), length=13, id=None),
 '0.2': Sequence(feature=Value(dtype='float64', id=None), length=13, id=None),
 '0.3': Sequence(feature=Value(dtype='float64', id=None), length=13, id=None),
 '0.4': Sequence(feature=Value(dtype='float64', id=None), length=13, id=None),
 '0.5': Sequence(feature=Value(dtype='float64', id=None), length=13, id=None),
 '0.6': Sequence(feature=Value(dtype='float64', id=None), length=13, id=None),
 '0.7': Sequence(feature=Value(dtype='float64', id=None), length=13, id=None),
 '0.8': Sequence(feature=Value(dtype='float64', id=None), length=13, id=None),
 '0.9': Sequence(feature=Value(dtype='float64', id=None), length=13, id=None)}

Predictions cannot contain any missing values represented by NaN, otherwise an exception will be raised.

Other than what's described above, there are no hard restrictions on how _fit_predict needs to be implemented. For example, it's completely up to you whether the method uses any dataset columns except the target or how the data is preprocessed.

Still, here is some general advice:

If your model is capable of generating probabilistic forecasts, make sure that you use the "optimal" point forecast for the task.eval_metric. For example, for metrics like "MSE" or "RMSSE", the mean forecast is preferred, while metrics like "MASE" are optimized by the median forecast.
Use fev.convert_input_data() to take advantage of the adapters and reduce the boilerplate preprocessing code.
Make sure that your wrapper can deal with missing values (or at least imputes them before passing the data to your model).
Make sure that your wrapper takes advantage of the extra features of the task. For example, the following attributes might be useful:

In [14]:

Copied!





print(f"{task.static_columns=}")
print(f"{task.dynamic_columns=}")
print(f"{task.known_dynamic_columns=}")
print(f"{task.past_dynamic_columns=}")
# Attributes available after `task.load_full_dataset` is called
task.load_full_dataset()
print(f"{task.freq=}")
print(f"{task.static_columns=}")
print(f"{task.dynamic_columns=}")
print(f"{task.known_dynamic_columns=}")
print(f"{task.past_dynamic_columns=}")
# Attributes available after `task.load_full_dataset` is called
task.load_full_dataset()
print(f"{task.freq=}")

task.static_columns=['Assortment', 'CompetitionDistance', 'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval', 'Store', 'StoreType']
task.dynamic_columns=['Open', 'Promo', 'SchoolHoliday', 'StateHoliday', 'Customers']
task.known_dynamic_columns=['Open', 'Promo', 'SchoolHoliday', 'StateHoliday']
task.past_dynamic_columns=['Customers']
task.freq='W-SUN'

Running evaluation¶

python models/evaluate.py -m my-model

Options:

-m — model name (must match a subfolder in models/)
-b — path or URL to benchmark YAML (default: fev_bench_mini)
-n — display name for results (default: same as -m)
-k — JSON dict of kwargs passed to the model constructor
-t — limit number of tasks (useful for quick testing)

Submitting results to the leaderboard¶

After implementing your model wrapper, follow these steps to submit results to the fev-bench leaderboard:

Fork autogluon/fev and clone your fork.
Implement your model wrapper in models/<name>/.

Run the model on all tasks from the benchmark and save results:

python models/evaluate.py -m <name> -b benchmarks/fev_bench/tasks.yaml
mv <name>.csv benchmarks/fev_bench/results/<name>.csv

Open a pull request to autogluon/fev containing:
- models/<name>/model.py
- models/<name>/requirements.txt
- benchmarks/fev_bench/results/<name>.csv
We will independently validate the results using your code and add them to the leaderboard.