Lazy Loading Dataset

LazyDataset is a variant of Dataset that defers expensive operations (preprocessing, encoding) to __getitem__ time, preventing OOM for large or multimodal datasets.

Key Differences from Dataset

Operation Dataset LazyDataset
map Executes immediately on all data Records the operation, applies per-item in __getitem__
filter Executes immediately Executes immediately (same as Dataset, index mapping required)
mix_dataset Merges datasets immediately Records strategy, resolves indices lazily
encode Encodes all data immediately Records flag, encodes per-item in __getitem__

Lazy Map

When you call map, LazyDataset records the preprocessing function instead of applying it eagerly:

from twinkle.dataset import LazyDataset, DatasetMeta

dataset = LazyDataset(DatasetMeta(dataset_id='ms://xxx/xxx'))
dataset.add_dataset(DatasetMeta(dataset_id='ms://yyy/yyy'))

# Per-dataset preprocessing (before mix)
dataset.map(preprocess_fn_a, dataset_meta=DatasetMeta(dataset_id='ms://xxx/xxx'))
dataset.map(preprocess_fn_b, dataset_meta=DatasetMeta(dataset_id='ms://yyy/yyy'))

dataset.mix_dataset()

# Global preprocessing (after mix, applies to all items)
dataset.map(global_preprocess_fn)
  • Before mix: map is recorded per-dataset, so different datasets can have different preprocessing pipelines.

  • After mix: map is recorded globally and applies to all items regardless of source dataset.

  • All map operations are applied lazily in __getitem__ in the order they were registered.

Lazy Mix

mix_dataset supports two strategies:

dataset.mix_dataset(interleave=True)   # Round-robin interleaving (default)
dataset.mix_dataset(interleave=False)  # Concatenation
  • Interleave: Items cycle through datasets in round-robin order. Shorter datasets wrap around.

  • Concatenate: Items are accessed sequentially — all of dataset A, then all of dataset B.

Lazy Encode

Calling encode only marks the dataset for encoding. The actual template.encode() call happens inside __getitem__:

dataset.set_template('Qwen3_5Template', model_id='ms://Qwen/Qwen3.5-4B', max_length=512)
dataset.encode()

Note: truncation_strategy='split' is not supported in LazyDataset because splitting may produce multiple outputs from a single item.

Eager Filter

Unlike other operations, filter executes immediately because it needs to build the index mapping of valid items upfront:

dataset.filter(filter_fn, dataset_meta=DatasetMeta(dataset_id='ms://xxx/xxx'))

Remote Execution

LazyDataset has the @remote_class decorator and can run in Ray workers, just like Dataset.