Skip to content

Analysis methods

fev provides 3 main methods for aggregating the evaluation summaries produced by Task.evaluation_summary():

On this page SummaryType is an alias for one of the following types:

SummaryType: TypeAlias = pd.DataFrame | list[dict] | str | pathlib.Path module-attribute

Functions

leaderboard(summaries: SummaryType | list[SummaryType], metric_column: str = 'test_error', missing_strategy: Literal['error', 'drop', 'impute'] = 'error', baseline_model: str = 'seasonal_naive', min_relative_error: float | None = 0.01, max_relative_error: float | None = 100.0, included_models: list[str] | None = None, excluded_models: list[str] | None = None, n_resamples: int | None = None, seed: int = 123)

Generate a leaderboard with aggregate performance metrics for all models.

Computes skill score (1 - geometric mean relative error) and win rate with bootstrap confidence intervals across all tasks. Models are ranked by skill score.

Parameters:

Name Type Description Default
summaries SummaryType | list[SummaryType]

Evaluation summaries as DataFrame, list of dicts, or file path(s)

required
metric_column str

Column name containing the metric to evaluate

"test_error"
baseline_model str

Model name to use for relative error computation

"SeasonalNaive"
missing_strategy Literal['error', 'drop', 'impute']

How to handle missing results:

  • "error": Raise error if any results are missing
  • "drop": Remove tasks where any model failed
  • "impute": Fill missing results with baseline_model scores
"error"
min_relative_error float

Lower bound for clipping relative errors w.r.t. the baseline_model

1e-2
max_relative_error float

Upper bound for clipping relative errors w.r.t. the baseline_model

100
included_models list[str]

Models to include (mutually exclusive with excluded_models)

None
excluded_models list[str]

Models to exclude (mutually exclusive with included_models)

None
n_resamples int | None

Number of bootstrap samples for confidence intervals. If None, confidence intervals are not computed

None
seed int

Random seed for reproducible bootstrap sampling

123

Returns:

Type Description
DataFrame

Leaderboard sorted by skill_score, with columns:

  • skill_score: Skill score (1 - geometric mean relative error)
  • skill_score_lower: Lower bound of 95% confidence interval (only if n_resamples is not None)
  • skill_score_upper: Upper bound of 95% confidence interval (only if n_resamples is not None)
  • win_rate: Fraction of pairwise comparisons won against other models
  • win_rate_lower: Lower bound of 95% confidence interval (only if n_resamples is not None)
  • win_rate_upper: Upper bound of 95% confidence interval (only if n_resamples is not None)
  • median_training_time_s: Median training time across tasks
  • median_inference_time_s: Median inference time across tasks
  • training_corpus_overlap: Mean fraction of tasks where model was trained on the dataset
  • num_failures: Number of tasks where the model failed
Source code in src/fev/analysis.py
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
def leaderboard(
    summaries: SummaryType | list[SummaryType],
    metric_column: str = "test_error",
    missing_strategy: Literal["error", "drop", "impute"] = "error",
    baseline_model: str = "seasonal_naive",
    min_relative_error: float | None = 1e-2,
    max_relative_error: float | None = 100.0,
    included_models: list[str] | None = None,
    excluded_models: list[str] | None = None,
    n_resamples: int | None = None,
    seed: int = 123,
):
    """Generate a leaderboard with aggregate performance metrics for all models.

    Computes skill score (1 - geometric mean relative error) and win rate with bootstrap confidence
    intervals across all tasks. Models are ranked by skill score.

    Parameters
    ----------
    summaries : SummaryType | list[SummaryType]
        Evaluation summaries as DataFrame, list of dicts, or file path(s)
    metric_column : str, default "test_error"
        Column name containing the metric to evaluate
    baseline_model : str, default "SeasonalNaive"
        Model name to use for relative error computation
    missing_strategy : Literal["error", "drop", "impute"], default "error"
        How to handle missing results:

        - `"error"`: Raise error if any results are missing
        - `"drop"`: Remove tasks where any model failed
        - `"impute"`: Fill missing results with `baseline_model` scores
    min_relative_error : float, default 1e-2
        Lower bound for clipping relative errors w.r.t. the `baseline_model`
    max_relative_error : float, default 100
        Upper bound for clipping relative errors w.r.t. the `baseline_model`
    included_models : list[str], optional
        Models to include (mutually exclusive with `excluded_models`)
    excluded_models : list[str], optional
        Models to exclude (mutually exclusive with `included_models`)
    n_resamples : int | None, default None
        Number of bootstrap samples for confidence intervals. If None, confidence intervals are not computed
    seed : int, default 123
        Random seed for reproducible bootstrap sampling

    Returns
    -------
    pd.DataFrame
        Leaderboard sorted by `skill_score`, with columns:

        - `skill_score`: Skill score (1 - geometric mean relative error)
        - `skill_score_lower`: Lower bound of 95% confidence interval (only if n_resamples is not None)
        - `skill_score_upper`: Upper bound of 95% confidence interval (only if n_resamples is not None)
        - `win_rate`: Fraction of pairwise comparisons won against other models
        - `win_rate_lower`: Lower bound of 95% confidence interval (only if n_resamples is not None)
        - `win_rate_upper`: Upper bound of 95% confidence interval (only if n_resamples is not None)
        - `median_training_time_s`: Median training time across tasks
        - `median_inference_time_s`: Median inference time across tasks
        - `training_corpus_overlap`: Mean fraction of tasks where model was trained on the dataset
        - `num_failures`: Number of tasks where the model failed
    """
    summaries = _load_summaries(summaries, check_fev_version=True)
    summaries = _filter_models(summaries, included_models=included_models, excluded_models=excluded_models)
    errors_df = pivot_table(summaries, metric_column=metric_column, baseline_model=baseline_model)
    errors_df = errors_df.clip(lower=min_relative_error, upper=max_relative_error)

    num_failures_per_model = errors_df.isna().sum()
    if missing_strategy == "drop":
        errors_df = errors_df.dropna()
        if len(errors_df) == 0:
            raise ValueError("All results are missing for some models.")
        print(f"{len(errors_df)} tasks left after removing failures")
    elif missing_strategy == "impute":
        # For leaderboard, baseline scores are already 1.0 after normalization, so fill with 1.0
        errors_df = errors_df.fillna(1.0)
    elif missing_strategy == "error":
        if num_failures_per_model.sum():
            raise ValueError(
                f"Summaries contain {len(errors_df)} tasks. Results are missing for the following models:"
                f"\n{num_failures_per_model[num_failures_per_model > 0]}"
            )
    else:
        raise ValueError(f"Invalid {missing_strategy=}, expected one of ['error', 'drop', 'impute']")
    bootstrap_resamples = 1 if n_resamples is None else n_resamples
    win_rate, win_rate_lower, win_rate_upper = bootstrap(
        errors_df.to_numpy(), statistic=_win_rate, n_resamples=bootstrap_resamples, seed=seed
    )
    skill_score, skill_score_lower, skill_score_upper = bootstrap(
        errors_df.to_numpy(), statistic=_skill_score, n_resamples=bootstrap_resamples, seed=seed
    )

    training_time_df = pivot_table(summaries, metric_column="training_time_s")
    inference_time_df = pivot_table(summaries, metric_column="inference_time_s")
    training_corpus_overlap_df = pivot_table(summaries, metric_column="trained_on_this_dataset")

    result_df = pd.DataFrame(
        {
            "skill_score": skill_score,
            "skill_score_lower": skill_score_lower,
            "skill_score_upper": skill_score_upper,
            "win_rate": win_rate,
            "win_rate_lower": win_rate_lower,
            "win_rate_upper": win_rate_upper,
            # Select only tasks that are also in errors_df (in case some tasks were dropped with missing_strategy="drop")
            "median_training_time_s": training_time_df.loc[errors_df.index].median(),
            "median_inference_time_s": inference_time_df.loc[errors_df.index].median(),
            "training_corpus_overlap": training_corpus_overlap_df.loc[errors_df.index].mean(),
            "num_failures": num_failures_per_model,
        },
        index=errors_df.columns,
    )
    if n_resamples is None:
        result_df = result_df.drop(
            columns=["skill_score_lower", "skill_score_upper", "win_rate_lower", "win_rate_upper"]
        )
    return result_df.sort_values(by="skill_score", ascending=False)

pairwise_comparison(summaries: SummaryType | list[SummaryType], metric_column: str = 'test_error', missing_strategy: Literal['error', 'drop', 'impute'] = 'error', baseline_model: str | None = None, min_relative_error: float | None = 0.01, max_relative_error: float | None = 100.0, included_models: list[str] | None = None, excluded_models: list[str] | None = None, n_resamples: int | None = None, seed: int = 123) -> pd.DataFrame

Compute pairwise performance comparisons between all model pairs.

For each pair of models, calculates skill score (1 - geometric mean relative error) and win rate with bootstrap confidence intervals across all tasks.

Parameters:

Name Type Description Default
summaries SummaryType | list[SummaryType]

Evaluation summaries as DataFrame, list of dicts, or file path(s)

required
metric_column str

Column name containing the metric to evaluate

"test_error"
missing_strategy Literal['error', 'drop', 'impute']

How to handle missing results:

  • "error": Raise error if any results are missing
  • "drop": Remove tasks where any model failed
  • "impute": Fill missing results with baseline_model scores
"error"
baseline_model str

Required only when missing_strategy="impute"

None
min_relative_error float

Lower bound for clipping error ratios in pairwise comparisons

1e-2
max_relative_error float

Upper bound for clipping error ratios in pairwise comparisons

100.0
included_models list[str]

Models to include (mutually exclusive with excluded_models)

None
excluded_models list[str]

Models to exclude (mutually exclusive with included_models)

None
n_resamples int | None

Number of bootstrap samples for confidence intervals. If None, confidence intervals are not computed

None
seed int

Random seed for reproducible bootstrap sampling

123

Returns:

Type Description
DataFrame

Pairwise comparison results with pd.MultiIndex (model_1, model_2) and columns:

  • skill_score: 1 - geometric mean of model_1/model_2 error ratios
  • skill_score_lower: Lower bound of 95% confidence interval (only if n_resamples is not None)
  • skill_score_upper: Upper bound of 95% confidence interval (only if n_resamples is not None)
  • win_rate: Fraction of tasks where model_1 outperforms model_2
  • win_rate_lower: Lower bound of 95% confidence interval (only if n_resamples is not None)
  • win_rate_upper: Upper bound of 95% confidence interval (only if n_resamples is not None)
Source code in src/fev/analysis.py
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
def pairwise_comparison(
    summaries: SummaryType | list[SummaryType],
    metric_column: str = "test_error",
    missing_strategy: Literal["error", "drop", "impute"] = "error",
    baseline_model: str | None = None,
    min_relative_error: float | None = 1e-2,
    max_relative_error: float | None = 100.0,
    included_models: list[str] | None = None,
    excluded_models: list[str] | None = None,
    n_resamples: int | None = None,
    seed: int = 123,
) -> pd.DataFrame:
    """Compute pairwise performance comparisons between all model pairs.

    For each pair of models, calculates skill score (1 - geometric mean relative error) and
    win rate with bootstrap confidence intervals across all tasks.

    Parameters
    ----------
    summaries : SummaryType | list[SummaryType]
        Evaluation summaries as DataFrame, list of dicts, or file path(s)
    metric_column : str, default "test_error"
        Column name containing the metric to evaluate
    missing_strategy : Literal["error", "drop", "impute"], default "error"
        How to handle missing results:

        - `"error"`: Raise error if any results are missing
        - `"drop"`: Remove tasks where any model failed
        - `"impute"`: Fill missing results with `baseline_model` scores
    baseline_model : str, optional
        Required only when missing_strategy="impute"
    min_relative_error : float, optional, default 1e-2
        Lower bound for clipping error ratios in pairwise comparisons
    max_relative_error : float, optional, default 100.0
        Upper bound for clipping error ratios in pairwise comparisons
    included_models : list[str], optional
        Models to include (mutually exclusive with `excluded_models`)
    excluded_models : list[str], optional
        Models to exclude (mutually exclusive with `included_models`)
    n_resamples : int | None, default None
        Number of bootstrap samples for confidence intervals. If None, confidence intervals are not computed
    seed : int, default 123
        Random seed for reproducible bootstrap sampling

    Returns
    -------
    pd.DataFrame
        Pairwise comparison results with `pd.MultiIndex` `(model_1, model_2)` and columns:

        - `skill_score`: 1 - geometric mean of `model_1/model_2` error ratios
        - `skill_score_lower`: Lower bound of 95% confidence interval (only if n_resamples is not None)
        - `skill_score_upper`: Upper bound of 95% confidence interval (only if n_resamples is not None)
        - `win_rate`: Fraction of tasks where `model_1` outperforms `model_2`
        - `win_rate_lower`: Lower bound of 95% confidence interval (only if n_resamples is not None)
        - `win_rate_upper`: Upper bound of 95% confidence interval (only if n_resamples is not None)
    """
    summaries = _load_summaries(summaries, check_fev_version=True)
    summaries = _filter_models(summaries, included_models=included_models, excluded_models=excluded_models)
    errors_df = pivot_table(summaries, metric_column=metric_column)
    num_failures_per_model = errors_df.isna().sum()

    if missing_strategy == "drop":
        errors_df = errors_df.dropna()
        if len(errors_df) == 0:
            raise ValueError("All results are missing for some models.")
        print(f"{len(errors_df)} tasks left after removing failures")
    elif missing_strategy == "impute":
        if baseline_model is None:
            raise ValueError("baseline_model is required when missing_strategy='impute'")
        if baseline_model not in errors_df.columns:
            raise ValueError(
                f"baseline_model '{baseline_model}' is missing. Available models: {errors_df.columns.to_list()}"
            )
        for col in errors_df.columns:
            if col != baseline_model:
                errors_df[col] = errors_df[col].fillna(errors_df[baseline_model])
    elif missing_strategy == "error":
        if num_failures_per_model.sum():
            raise ValueError(
                f"Summaries contain {len(errors_df)} tasks. Results are missing for the following models:"
                f"\n{num_failures_per_model[num_failures_per_model > 0]}"
            )
    else:
        raise ValueError(f"Invalid {missing_strategy=}, expected one of ['error', 'drop', 'impute']")
    model_order = errors_df.rank(axis=1).mean().sort_values().index
    errors_df = errors_df[model_order]

    bootstrap_resamples = 1 if n_resamples is None else n_resamples
    skill_score, skill_score_lower, skill_score_upper = bootstrap(
        errors_df.to_numpy(),
        statistic=lambda x: _pairwise_skill_score(x, min_relative_error, max_relative_error),
        n_resamples=bootstrap_resamples,
        seed=seed,
    )
    win_rate, win_rate_lower, win_rate_upper = bootstrap(
        errors_df.to_numpy(),
        statistic=_pairwise_win_rate,
        n_resamples=bootstrap_resamples,
        seed=seed,
    )

    result_df = pd.DataFrame(
        {
            "skill_score": skill_score.flatten(),
            "skill_score_lower": skill_score_lower.flatten(),
            "skill_score_upper": skill_score_upper.flatten(),
            "win_rate": win_rate.flatten(),
            "win_rate_lower": win_rate_lower.flatten(),
            "win_rate_upper": win_rate_upper.flatten(),
        },
        index=pd.MultiIndex.from_product([errors_df.columns, errors_df.columns], names=["model_1", "model_2"]),
    )
    if n_resamples is None:
        result_df = result_df.drop(
            columns=["skill_score_lower", "skill_score_upper", "win_rate_lower", "win_rate_upper"]
        )
    return result_df

pivot_table(summaries: SummaryType | list[SummaryType], metric_column: str = 'test_error', task_columns: str | list[str] = TASK_DEF_COLUMNS.copy(), baseline_model: str | None = None, check_fev_version: bool = False) -> pd.DataFrame

Convert evaluation summaries into a pivot table for analysis.

Creates a matrix where rows represent tasks and columns represent models, with each cell containing the specified metric value. Optionally normalizes all scores relative to a baseline model.

Parameters:

Name Type Description Default
summaries SummaryType | list[SummaryType]

Evaluation summaries as DataFrame, list of dicts, or file path(s)

required
metric_column str

Column name containing the metric to pivot

"test_error"
task_columns str | list[str]

Column(s) defining unique tasks. Used as the pivot table index

TASK_DEF_COLUMNS
baseline_model str

If provided, divide all scores by this model's scores to get relative performance

None
check_fev_version bool

If True, check that fev_version in the summary is >= LAST_BREAKING_VERSION.

False

Returns:

Type Description
DataFrame

Pivot table with task combinations as index and model names as columns. Values are raw scores or relative scores (if baseline_model specified)

Raises:

Type Description
ValueError

If duplicate model/task combinations exist, or results for baseline_model are missing when baseline_model is provided.

Source code in src/fev/analysis.py
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
def pivot_table(
    summaries: SummaryType | list[SummaryType],
    metric_column: str = "test_error",
    task_columns: str | list[str] = TASK_DEF_COLUMNS.copy(),
    baseline_model: str | None = None,
    check_fev_version: bool = False,
) -> pd.DataFrame:
    """Convert evaluation summaries into a pivot table for analysis.

    Creates a matrix where rows represent tasks and columns represent models, with each
    cell containing the specified metric value. Optionally normalizes all scores relative
    to a baseline model.

    Parameters
    ----------
    summaries : SummaryType | list[SummaryType]
        Evaluation summaries as DataFrame, list of dicts, or file path(s)
    metric_column : str, default "test_error"
        Column name containing the metric to pivot
    task_columns : str | list[str], default TASK_DEF_COLUMNS
        Column(s) defining unique tasks. Used as the pivot table index
    baseline_model : str, optional
        If provided, divide all scores by this model's scores to get relative performance
    check_fev_version : bool, default False
        If True, check that fev_version in the summary is >= LAST_BREAKING_VERSION.

    Returns
    -------
    pd.DataFrame
        Pivot table with task combinations as index and model names as columns.
        Values are raw scores or relative scores (if `baseline_model` specified)

    Raises
    ------
    ValueError
        If duplicate model/task combinations exist, or results for `baseline_model` are missing when `baseline_model`
        is provided.
    """
    summaries = _load_summaries(summaries, check_fev_version=check_fev_version).astype({metric_column: "float64"})

    if isinstance(task_columns, str):
        task_columns = [task_columns]
    metric_with_index = summaries.set_index(task_columns + [MODEL_COLUMN])[metric_column]
    duplicates = metric_with_index.index.duplicated()
    if duplicates.any():
        duplicate_indices = metric_with_index.index[duplicates]
        raise ValueError(
            f"Cannot unstack: duplicate index combinations found. First duplicates: {duplicate_indices[:5].tolist()}"
        )
    pivot_df = metric_with_index.unstack()
    if baseline_model is not None:
        if baseline_model not in pivot_df.columns:
            raise ValueError(
                f"baseline_model '{baseline_model}' not found. Available models: {pivot_df.columns.tolist()}"
            )
        pivot_df = pivot_df.divide(pivot_df[baseline_model], axis=0)
        if num_baseline_failures := pivot_df[baseline_model].isna().sum():
            raise ValueError(
                f"Results for baseline_model '{baseline_model}' are missing for "
                f"{num_baseline_failures} out of {len(pivot_df)} tasks."
            )
    return pivot_df