Dataset Format

This notebook answers the following questions:

What dataset format does fev expect?
How is this format different from other popular time series data formats?
How to convert my dataset into a format expected by fev?

For information on how to convert a datasets.Dataset into other popular time series data formats see notebook 04-models.ipynb.

In [1]:

Copied!

import warnings
import datasets

warnings.simplefilter("ignore")
datasets.disable_progress_bars()
import warnings
import datasets

warnings.simplefilter("ignore")
datasets.disable_progress_bars()

What dataset format does `fev` expect?¶

We store time series datasets using the Hugging Face datasets library.

We assume that all time series datasets obey the following schema:

each dataset entry (=row) represents a single (univariate/multivariate) time series
each entry contains
- 1/ a field of type Sequence(timestamp) that contains the timestamps of observations
- 2/ at least one field of type Sequence(float) that can be used as the target time series
- 3/ a field of type string that contains the unique ID of each time series
all fields of type Sequence have the same length

A few notes about the above schema:

The ID, timestamp and target fields can have arbitrary names. These names can be specified when creating an fev.Task object.
In addition to the required fields above, the dataset can contain arbitrary other fields such as
- extra dynamic columns of type Sequence
- static features of type Value or Image
The dataset itself contains no information about the forecasting task. For example, the dataset does not say which dynamic columns should be used as the target column or exogenous features, or which columns are known only in the past. Such design makes it easy to re-use the same dataset across multiple different tasks without data duplication.

Here is an example of such dataset taken from https://huggingface.co/datasets/autogluon/chronos_datasets.

In [2]:

Copied!

ds = datasets.load_dataset("autogluon/chronos_datasets", "monash_kdd_cup_2018", split="train")
ds.set_format("numpy")
ds
ds = datasets.load_dataset("autogluon/chronos_datasets", "monash_kdd_cup_2018", split="train")
ds.set_format("numpy")
ds

Out[2]:

Dataset({
    features: ['id', 'timestamp', 'target', 'city', 'station', 'measurement'],
    num_rows: 270
})

Each entry corresponds to a single time series

In [3]:

Copied!

ds[0]
ds[0]

Out[3]:

{'id': np.str_('T000000'),
 'timestamp': array(['2017-01-01T14:00:00.000', '2017-01-01T15:00:00.000',
        '2017-01-01T16:00:00.000', ..., '2018-03-31T13:00:00.000',
        '2018-03-31T14:00:00.000', '2018-03-31T15:00:00.000'],
       dtype='datetime64[ms]'),
 'target': array([453., 417., 395., ..., 132., 158., 118.], dtype=float32),
 'city': np.str_('Beijing'),
 'station': np.str_('aotizhongxin_aq'),
 'measurement': np.str_('PM2.5')}

The datasets library conveniently stores metadata about the different features of the dataset.

In [4]:

Copied!

ds.features
ds.features

Out[4]:

{'id': Value(dtype='string', id=None),
 'timestamp': Sequence(feature=Value(dtype='timestamp[ms]', id=None), length=-1, id=None),
 'target': Sequence(feature=Value(dtype='float64', id=None), length=-1, id=None),
 'city': Value(dtype='string', id=None),
 'station': Value(dtype='string', id=None),
 'measurement': Value(dtype='string', id=None)}

What are the advantages of the "fev format" compared to other common formats?¶

We find the above dataset format ("fev format") more convenient and practical compared to other popular formats for storing time series data.

Long-format data frame is quite common for storing data, is human readable and widely used by practitioners.

item_id	timestamp	scaled_price	promotion_email	promotion_homepage	unit_sales	product_code	product_category	product_subcategory	location_code
1062_101	2018-01-01	0.87913	0	0	636	1062	Beverages	Fruit Juice Mango	101
1062_101	2018-01-08	0.994517	0	0	123	1062	Beverages	Fruit Juice Mango	101
1062_101	2018-01-15	1.00551	0	0	391	1062	Beverages	Fruit Juice Mango	101
1062_101	2018-01-22	1	0	0	339	1062	Beverages	Fruit Juice Mango	101
...	...	...	...	...	...	...	...	...	...

The long-format data frame has two main limitations compared to the "fev format".

Static features either need to be unnecessarily duplicated for each row, or need to be stored in a separate file.
- This becomes especially problematic if static features contain information such as images or text documents.
Dealing with large datasets is challenging.
- Obtaining individual time series requires an expensive groupby operation.
- When sharding, we need custom logic to ensure that rows corresponding to the same item_id are kept in the same shard.
- We either constantly need to ensure that the rows are ordered chronologically, or need to sort the rows each time the data is used.

In contrast, the "fev format" can easily distinguish between static & dynamic features using the datasets.Features metadata. Since one time series corresponds to a single row, it has no problems with sharding.

GluonTS format is another popular choice for storing time series data (e.g., used in LOTSA).

Each entry is encoded as a dictionary with a pre-defined schema shared across all datasets

{
    "start": "2024-01-01", 
    "freq": "1D", 
    "target": [0.5, 1.2, ...], 
    "feat_dynamic_real": [[...]], 
    "past_feat_dynamic_real": [[...]], 
    "feat_static_cat": [...], 
    "feat_static_real": [...], 
    ...,
}

This format is efficient and can be immediately consumed by some ML models. However, it also has some drawbacks compared to the "fev format".

It hard-codes the forecasting task definition into the dataset (i.e., which columns are used as target, which columns are known in the future vs. only in the past). This often leads to data duplication.
- For example, consider a dataset that contains energy demand & weather time series for some region. If you want to evaluate a model in 3 settings (weather forecast is available for the future; weather is known only in the past; weather is ignored, only historic demand is available), you will need to create 3 copies of the dataset.
It only supports numeric data, so it's not future-proof.
- Incorporating multimodal data such images or text into time series forecasting tasks is becoming popular. The GluonTS format cannot natively handle that.
It relies on pandas freq aliases staying consistent over time - which is something that we cannot take for granted.

The "fev format" does not hard-code the task properties, natively deals with multimodal data and is not tied to the pandas versions.

How to convert my dataset into a format expected by `fev`?¶

If your dataset is stored in a long-format data frame, you can convert into an fev-compatible datasets.Dataset object using a helper function

In [5]:

Copied!

import pandas as pd
import fev.utils
import pandas as pd
import fev.utils

In [6]:

Copied!

df = pd.read_csv("https://autogluon.s3.us-west-2.amazonaws.com/datasets/timeseries/grocery_sales/merged.csv")
df.head()
df = pd.read_csv("https://autogluon.s3.us-west-2.amazonaws.com/datasets/timeseries/grocery_sales/merged.csv")
df.head()

Out[6]:

	item_id	timestamp	scaled_price	unit_sales	product_code	product_category	product_subcategory	location_code
0	1062_101	2018-01-01	0.879130	636.0	1062	Beverages	Fruit Juice Mango	101
1	1062_101	2018-01-08	0.994517	123.0	1062	Beverages	Fruit Juice Mango	101
2	1062_101	2018-01-15	1.005513	391.0	1062	Beverages	Fruit Juice Mango	101
3	1062_101	2018-01-22	1.000000	339.0	1062	Beverages	Fruit Juice Mango	101
4	1062_101	2018-01-29	0.883309	661.0	1062	Beverages	Fruit Juice Mango	101

In [7]:

Copied!

ds = fev.utils.convert_long_df_to_hf_dataset(df, id_column="item_id", static_columns=["product_code", "product_category", "product_subcategory", "location_code"])
ds.features
ds = fev.utils.convert_long_df_to_hf_dataset(df, id_column="item_id", static_columns=["product_code", "product_category", "product_subcategory", "location_code"])
ds.features

Out[7]:

{'item_id': Value(dtype='string', id=None),
 'product_code': Value(dtype='int64', id=None),
 'product_category': Value(dtype='string', id=None),
 'product_subcategory': Value(dtype='string', id=None),
 'location_code': Value(dtype='int64', id=None),
 'timestamp': Sequence(feature=Value(dtype='timestamp[us]', id=None), length=-1, id=None),
 'scaled_price': Sequence(feature=Value(dtype='float64', id=None), length=-1, id=None),
 'promotion_email': Sequence(feature=Value(dtype='float64', id=None), length=-1, id=None),
 'promotion_homepage': Sequence(feature=Value(dtype='float64', id=None), length=-1, id=None),
 'unit_sales': Sequence(feature=Value(dtype='float64', id=None), length=-1, id=None)}

In [8]:

Copied!

ds.with_format("numpy")[0]
ds.with_format("numpy")[0]

Out[8]:

{'item_id': np.str_('1062_101'),
 'product_code': np.int64(1062),
 'product_category': np.str_('Beverages'),
 'product_subcategory': np.str_('Fruit Juice Mango'),
 'location_code': np.int64(101),
 'timestamp': array(['2018-01-01T00:00:00.000000', '2018-01-08T00:00:00.000000',
        '2018-01-15T00:00:00.000000', '2018-01-22T00:00:00.000000',
        '2018-01-29T00:00:00.000000', '2018-02-05T00:00:00.000000',
        '2018-02-12T00:00:00.000000', '2018-02-19T00:00:00.000000',
        '2018-02-26T00:00:00.000000', '2018-03-05T00:00:00.000000',
        '2018-03-12T00:00:00.000000', '2018-03-19T00:00:00.000000',
        '2018-03-26T00:00:00.000000', '2018-04-02T00:00:00.000000',
        '2018-04-09T00:00:00.000000', '2018-04-16T00:00:00.000000',
        '2018-04-23T00:00:00.000000', '2018-04-30T00:00:00.000000',
        '2018-05-07T00:00:00.000000', '2018-05-14T00:00:00.000000',
        '2018-05-21T00:00:00.000000', '2018-05-28T00:00:00.000000',
        '2018-06-04T00:00:00.000000', '2018-06-11T00:00:00.000000',
        '2018-06-18T00:00:00.000000', '2018-06-25T00:00:00.000000',
        '2018-07-02T00:00:00.000000', '2018-07-09T00:00:00.000000',
        '2018-07-16T00:00:00.000000', '2018-07-23T00:00:00.000000',
        '2018-07-30T00:00:00.000000'], dtype='datetime64[us]'),
 'scaled_price': array([0.8791298 , 0.99451727, 1.005513  , 1.        , 0.88330877,
        0.8728938 , 0.8780195 , 0.8884807 , 0.9889777 , 1.0055426 ,
        0.98920846, 1.0054836 , 1.        , 1.        , 1.011026  ,
        0.9945471 , 0.99454623, 1.        , 0.99451727, 1.        ,
        1.        , 0.9945471 , 1.011026  , 1.0054251 , 1.0054537 ,
        1.        , 1.005513  , 1.        , 1.        , 1.0123464 ,
        1.006248  ], dtype=float32),
 'promotion_email': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       dtype=float32),
 'promotion_homepage': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       dtype=float32),
 'unit_sales': array([636., 123., 391., 339., 661., 513., 555., 485., 339., 230., 202.,
        420., 418., 581., 472., 230., 176., 242., 270., 285., 258., 285.,
        377., 339., 310., 231., 393., 447., 486., 284., 392.],
       dtype=float32)}

To verify that the dataset was converted correctly, use the fev.utils.validate_time_series_dataset method.

In [9]:

Copied!

fev.utils.validate_time_series_dataset(ds, id_column="item_id", timestamp_column="timestamp")
fev.utils.validate_time_series_dataset(ds, id_column="item_id", timestamp_column="timestamp")

You can save the dataset to disk as a parquet file

In [10]:

Copied!

# ds.to_parquet(DATASET_PATH)
# ds.to_parquet(DATASET_PATH)

Or directly push it to HF Hub

In [11]:

Copied!

# ds.push_to_hub(repo_id=YOUR_REPO_ID, config_name=CONFIG_NAME)
# ds.push_to_hub(repo_id=YOUR_REPO_ID, config_name=CONFIG_NAME)

You can then use the path to your dataset when creating a fev.Task.

Dataset Format

What dataset format does fev expect?¶

What are the advantages of the "fev format" compared to other common formats?¶

How to convert my dataset into a format expected by fev?¶

What dataset format does `fev` expect?¶

How to convert my dataset into a format expected by `fev`?¶