Dataset Format
This notebook answers the following questions:
- What dataset format does
fev
expect? - How is this format different from other popular time series data formats?
- How to convert my dataset into a format expected by
fev
?
For information on how to convert a datasets.Dataset
into other popular time series data formats see notebook 04-models.ipynb.
import warnings
import datasets
warnings.simplefilter("ignore")
datasets.disable_progress_bars()
What dataset format does fev
expect?¶
We store time series datasets using the Hugging Face datasets
library.
We assume that all time series datasets obey the following schema:
- each dataset entry (=row) represents a single (univariate/multivariate) time series
- each entry contains
- 1/ a field of type
Sequence(timestamp)
that contains the timestamps of observations - 2/ at least one field of type
Sequence(float)
that can be used as the target time series - 3/ a field of type
string
that contains the unique ID of each time series
- 1/ a field of type
- all fields of type
Sequence
have the same length
A few notes about the above schema:
- The ID, timestamp and target fields can have arbitrary names. These names can be specified when creating an
fev.Task
object. - In addition to the required fields above, the dataset can contain arbitrary other fields such as
- extra dynamic columns of type
Sequence
- static features of type
Value
orImage
- extra dynamic columns of type
- The dataset itself contains no information about the forecasting task. For example, the dataset does not say which dynamic columns should be used as the target column or exogenous features, or which columns are known only in the past. Such design makes it easy to re-use the same dataset across multiple different tasks without data duplication.
Here is an example of such dataset taken from https://huggingface.co/datasets/autogluon/chronos_datasets.
ds = datasets.load_dataset("autogluon/chronos_datasets", "monash_kdd_cup_2018", split="train")
ds.set_format("numpy")
ds
Dataset({ features: ['id', 'timestamp', 'target', 'city', 'station', 'measurement'], num_rows: 270 })
Each entry corresponds to a single time series
ds[0]
{'id': np.str_('T000000'), 'timestamp': array(['2017-01-01T14:00:00.000', '2017-01-01T15:00:00.000', '2017-01-01T16:00:00.000', ..., '2018-03-31T13:00:00.000', '2018-03-31T14:00:00.000', '2018-03-31T15:00:00.000'], dtype='datetime64[ms]'), 'target': array([453., 417., 395., ..., 132., 158., 118.], dtype=float32), 'city': np.str_('Beijing'), 'station': np.str_('aotizhongxin_aq'), 'measurement': np.str_('PM2.5')}
The datasets
library conveniently stores metadata about the different features of the dataset.
ds.features
{'id': Value(dtype='string', id=None), 'timestamp': Sequence(feature=Value(dtype='timestamp[ms]', id=None), length=-1, id=None), 'target': Sequence(feature=Value(dtype='float64', id=None), length=-1, id=None), 'city': Value(dtype='string', id=None), 'station': Value(dtype='string', id=None), 'measurement': Value(dtype='string', id=None)}
What are the advantages of the "fev format" compared to other common formats?¶
We find the above dataset format ("fev format") more convenient and practical compared to other popular formats for storing time series data.
Long-format data frame is quite common for storing data, is human readable and widely used by practitioners.
item_id | timestamp | scaled_price | promotion_email | promotion_homepage | unit_sales | product_code | product_category | product_subcategory | location_code |
---|---|---|---|---|---|---|---|---|---|
1062_101 | 2018-01-01 | 0.87913 | 0 | 0 | 636 | 1062 | Beverages | Fruit Juice Mango | 101 |
1062_101 | 2018-01-08 | 0.994517 | 0 | 0 | 123 | 1062 | Beverages | Fruit Juice Mango | 101 |
1062_101 | 2018-01-15 | 1.00551 | 0 | 0 | 391 | 1062 | Beverages | Fruit Juice Mango | 101 |
1062_101 | 2018-01-22 | 1 | 0 | 0 | 339 | 1062 | Beverages | Fruit Juice Mango | 101 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
The long-format data frame has two main limitations compared to the "fev format".
- Static features either need to be unnecessarily duplicated for each row, or need to be stored in a separate file.
- This becomes especially problematic if static features contain information such as images or text documents.
- Dealing with large datasets is challenging.
- Obtaining individual time series requires an expensive
groupby
operation. - When sharding, we need custom logic to ensure that rows corresponding to the same
item_id
are kept in the same shard. - We either constantly need to ensure that the rows are ordered chronologically, or need to sort the rows each time the data is used.
- Obtaining individual time series requires an expensive
In contrast, the "fev format" can easily distinguish between static & dynamic features using the datasets.Features
metadata. Since one time series corresponds to a single row, it has no problems with sharding.
GluonTS format is another popular choice for storing time series data (e.g., used in LOTSA).
Each entry is encoded as a dictionary with a pre-defined schema shared across all datasets
{
"start": "2024-01-01",
"freq": "1D",
"target": [0.5, 1.2, ...],
"feat_dynamic_real": [[...]],
"past_feat_dynamic_real": [[...]],
"feat_static_cat": [...],
"feat_static_real": [...],
...,
}
This format is efficient and can be immediately consumed by some ML models. However, it also has some drawbacks compared to the "fev format".
- It hard-codes the forecasting task definition into the dataset (i.e., which columns are used as target, which columns are known in the future vs. only in the past). This often leads to data duplication.
- For example, consider a dataset that contains energy demand & weather time series for some region. If you want to evaluate a model in 3 settings (weather forecast is available for the future; weather is known only in the past; weather is ignored, only historic demand is available), you will need to create 3 copies of the dataset.
- It only supports numeric data, so it's not future-proof.
- Incorporating multimodal data such images or text into time series forecasting tasks is becoming popular. The GluonTS format cannot natively handle that.
- It relies on pandas
freq
aliases staying consistent over time - which is something that we cannot take for granted.
The "fev format" does not hard-code the task properties, natively deals with multimodal data and is not tied to the pandas versions.
How to convert my dataset into a format expected by fev
?¶
If your dataset is stored in a long-format data frame, you can convert into an fev-compatible datasets.Dataset
object using a helper function
import pandas as pd
import fev.utils
df = pd.read_csv("https://autogluon.s3.us-west-2.amazonaws.com/datasets/timeseries/grocery_sales/merged.csv")
df.head()
item_id | timestamp | scaled_price | promotion_email | promotion_homepage | unit_sales | product_code | product_category | product_subcategory | location_code | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1062_101 | 2018-01-01 | 0.879130 | 0.0 | 0.0 | 636.0 | 1062 | Beverages | Fruit Juice Mango | 101 |
1 | 1062_101 | 2018-01-08 | 0.994517 | 0.0 | 0.0 | 123.0 | 1062 | Beverages | Fruit Juice Mango | 101 |
2 | 1062_101 | 2018-01-15 | 1.005513 | 0.0 | 0.0 | 391.0 | 1062 | Beverages | Fruit Juice Mango | 101 |
3 | 1062_101 | 2018-01-22 | 1.000000 | 0.0 | 0.0 | 339.0 | 1062 | Beverages | Fruit Juice Mango | 101 |
4 | 1062_101 | 2018-01-29 | 0.883309 | 0.0 | 0.0 | 661.0 | 1062 | Beverages | Fruit Juice Mango | 101 |
ds = fev.utils.convert_long_df_to_hf_dataset(df, id_column="item_id", static_columns=["product_code", "product_category", "product_subcategory", "location_code"])
ds.features
{'item_id': Value(dtype='string', id=None), 'product_code': Value(dtype='int64', id=None), 'product_category': Value(dtype='string', id=None), 'product_subcategory': Value(dtype='string', id=None), 'location_code': Value(dtype='int64', id=None), 'timestamp': Sequence(feature=Value(dtype='timestamp[us]', id=None), length=-1, id=None), 'scaled_price': Sequence(feature=Value(dtype='float64', id=None), length=-1, id=None), 'promotion_email': Sequence(feature=Value(dtype='float64', id=None), length=-1, id=None), 'promotion_homepage': Sequence(feature=Value(dtype='float64', id=None), length=-1, id=None), 'unit_sales': Sequence(feature=Value(dtype='float64', id=None), length=-1, id=None)}
ds.with_format("numpy")[0]
{'item_id': np.str_('1062_101'), 'product_code': np.int64(1062), 'product_category': np.str_('Beverages'), 'product_subcategory': np.str_('Fruit Juice Mango'), 'location_code': np.int64(101), 'timestamp': array(['2018-01-01T00:00:00.000000', '2018-01-08T00:00:00.000000', '2018-01-15T00:00:00.000000', '2018-01-22T00:00:00.000000', '2018-01-29T00:00:00.000000', '2018-02-05T00:00:00.000000', '2018-02-12T00:00:00.000000', '2018-02-19T00:00:00.000000', '2018-02-26T00:00:00.000000', '2018-03-05T00:00:00.000000', '2018-03-12T00:00:00.000000', '2018-03-19T00:00:00.000000', '2018-03-26T00:00:00.000000', '2018-04-02T00:00:00.000000', '2018-04-09T00:00:00.000000', '2018-04-16T00:00:00.000000', '2018-04-23T00:00:00.000000', '2018-04-30T00:00:00.000000', '2018-05-07T00:00:00.000000', '2018-05-14T00:00:00.000000', '2018-05-21T00:00:00.000000', '2018-05-28T00:00:00.000000', '2018-06-04T00:00:00.000000', '2018-06-11T00:00:00.000000', '2018-06-18T00:00:00.000000', '2018-06-25T00:00:00.000000', '2018-07-02T00:00:00.000000', '2018-07-09T00:00:00.000000', '2018-07-16T00:00:00.000000', '2018-07-23T00:00:00.000000', '2018-07-30T00:00:00.000000'], dtype='datetime64[us]'), 'scaled_price': array([0.8791298 , 0.99451727, 1.005513 , 1. , 0.88330877, 0.8728938 , 0.8780195 , 0.8884807 , 0.9889777 , 1.0055426 , 0.98920846, 1.0054836 , 1. , 1. , 1.011026 , 0.9945471 , 0.99454623, 1. , 0.99451727, 1. , 1. , 0.9945471 , 1.011026 , 1.0054251 , 1.0054537 , 1. , 1.005513 , 1. , 1. , 1.0123464 , 1.006248 ], dtype=float32), 'promotion_email': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32), 'promotion_homepage': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32), 'unit_sales': array([636., 123., 391., 339., 661., 513., 555., 485., 339., 230., 202., 420., 418., 581., 472., 230., 176., 242., 270., 285., 258., 285., 377., 339., 310., 231., 393., 447., 486., 284., 392.], dtype=float32)}
To verify that the dataset was converted correctly, use the fev.utils.validate_time_series_dataset
method.
fev.utils.validate_time_series_dataset(ds, id_column="item_id", timestamp_column="timestamp")
You can save the dataset to disk as a parquet file
# ds.to_parquet(DATASET_PATH)
Or directly push it to HF Hub
# ds.push_to_hub(repo_id=YOUR_REPO_ID, config_name=CONFIG_NAME)
You can then use the path to your dataset when creating a fev.Task
.