Models
This notebook covers the following topics:
- Adding a wrapper for your model to fev/models.
- Submitting the results for your model to the
fev-benchleaderboard.
Adding a wrapper for your model¶
Each model wrapper lives in its own subfolder under models/. The evaluation harness (models/evaluate.py) discovers and runs them automatically.
Step 1: Create the folder¶
Create a folder models/<name>/ where <name> is how you'll refer to the model with the -m flag.
Step 2: Add model.py¶
Implement a subclass of fev.ForecastingModel. The model_name class attribute must match the folder name.
# models/my-model/model.py
import datasets
import fev
class MyModel(fev.ForecastingModel):
model_name = "my-model" # must match the folder name
# List HF dataset configs (from autogluon/fev_datasets) used during pretraining.
# Used to flag potential data leakage. Leave empty for models that train from scratch.
trained_on_datasets = ["kdd_cup_2022_10T", "m5_1D"]
def __init__(self, model_size: str = "small"):
super().__init__()
self.model_size = model_size
def _fit_predict(self, task: fev.Task) -> list[datasets.DatasetDict]:
predictions_per_window = []
for window in task.iter_windows():
past_data, future_data = window.get_input_data()
with self._record_inference_time():
# Generate predictions for each time series
predictions = {"predictions": [...]}
predictions_per_window.append(predictions)
return predictions_per_window
Key points about _fit_predict:
- Called once per task. Must return predictions for all evaluation windows.
- Use
self._record_inference_time()context manager to track inference time. - Use
self._record_training_time()if your model has a training step. - Each call should be independent — don't carry over state from prior tasks.
- Caching expensive resources (weights, tokenizers) on
selfacross calls is fine.
Step 3: Add requirements.txt¶
List pinned dependencies for your model. Pin the main packages to exact versions for reproducibility. These are installed automatically in an ephemeral environment when running evaluate.py — your project environment is not modified.
# models/my-model/requirements.txt
my-forecasting-lib==1.2.3
torch==2.7
Predictions format¶
Predictions must follow the schema provided by task.predictions_schema.
import fev
benchmark = fev.Benchmark.from_yaml("https://raw.githubusercontent.com/autogluon/fev/refs/tags/v0.7.0/benchmarks/fev_bench/tasks.yaml")
task = [t for t in benchmark.tasks if t.task_name == "rossmann_1W"][0]
task.predictions_schema
{'predictions': Sequence(feature=Value(dtype='float64', id=None), length=13, id=None),
'0.1': Sequence(feature=Value(dtype='float64', id=None), length=13, id=None),
'0.2': Sequence(feature=Value(dtype='float64', id=None), length=13, id=None),
'0.3': Sequence(feature=Value(dtype='float64', id=None), length=13, id=None),
'0.4': Sequence(feature=Value(dtype='float64', id=None), length=13, id=None),
'0.5': Sequence(feature=Value(dtype='float64', id=None), length=13, id=None),
'0.6': Sequence(feature=Value(dtype='float64', id=None), length=13, id=None),
'0.7': Sequence(feature=Value(dtype='float64', id=None), length=13, id=None),
'0.8': Sequence(feature=Value(dtype='float64', id=None), length=13, id=None),
'0.9': Sequence(feature=Value(dtype='float64', id=None), length=13, id=None)}
Predictions cannot contain any missing values represented by NaN, otherwise an exception will be raised.
Other than what's described above, there are no hard restrictions on how _fit_predict needs to be implemented. For example, it's completely up to you whether the method uses any dataset columns except the target or how the data is preprocessed.
Still, here is some general advice:
- If your model is capable of generating probabilistic forecasts, make sure that you use the "optimal" point forecast for the
task.eval_metric. For example, for metrics like"MSE"or"RMSSE", the mean forecast is preferred, while metrics like"MASE"are optimized by the median forecast. - Use
fev.convert_input_data()to take advantage of the adapters and reduce the boilerplate preprocessing code. - Make sure that your wrapper can deal with missing values (or at least imputes them before passing the data to your model).
- Make sure that your wrapper takes advantage of the extra features of the task. For example, the following attributes might be useful:
print(f"{task.static_columns=}")
print(f"{task.dynamic_columns=}")
print(f"{task.known_dynamic_columns=}")
print(f"{task.past_dynamic_columns=}")
# Attributes available after `task.load_full_dataset` is called
task.load_full_dataset()
print(f"{task.freq=}")
task.static_columns=['Assortment', 'CompetitionDistance', 'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval', 'Store', 'StoreType'] task.dynamic_columns=['Open', 'Promo', 'SchoolHoliday', 'StateHoliday', 'Customers'] task.known_dynamic_columns=['Open', 'Promo', 'SchoolHoliday', 'StateHoliday'] task.past_dynamic_columns=['Customers'] task.freq='W-SUN'
Running evaluation¶
python models/evaluate.py -m my-model
Options:
-m— model name (must match a subfolder inmodels/)-b— path or URL to benchmark YAML (default:fev_bench_mini)-n— display name for results (default: same as-m)-k— JSON dict of kwargs passed to the model constructor-t— limit number of tasks (useful for quick testing)
Submitting results to the leaderboard¶
After implementing your model wrapper, follow these steps to submit results to the fev-bench leaderboard:
- Fork
autogluon/fevand clone your fork. - Implement your model wrapper in
models/<name>/. - Run the model on all tasks from the benchmark and save results:
python models/evaluate.py -m <name> -b benchmarks/fev_bench/tasks.yaml mv <name>.csv benchmarks/fev_bench/results/<name>.csv
- Open a pull request to
autogluon/fevcontaining:models/<name>/model.pymodels/<name>/requirements.txtbenchmarks/fev_bench/results/<name>.csv
- We will independently validate the results using your code and add them to the leaderboard.