Fixed-Length Packing Dataset

Packing datasets are used to concatenate variable-length data to a specified length. For example:

The dataset contains 4 pieces of data with length 5, and the Template component’s max_length can accept a length of 10. The packing dataset will pre-fetch the data and concatenate it into 2 samples with length 10.

ABCDE
FGHIJ
KLMNO
PQRST

Will be converted to

ABCDEFGHIJ
KLMNOPQRST

Note that this concatenation occurs after encode, i.e., on the actual model input length. In the process, the dataset will perform the following operations:

  1. Fetch buffer length samples

  2. Encode these samples

  3. Calculate based on the length of each sample using an automatic packing algorithm to find an optimal solution that minimizes the number of batches and makes the length of each sample closest to max_length

  4. Add a position_ids field to distinguish different samples.

The final data format is similar to:

{
  "input_ids": [1,2,3,4,5,6,7,8,9,10],
  "position_ids": [0,1,2,3,4,0,1,2,3,4],
  ...
}

The use of the dataset has the following differences from Dataset:

  1. Must set Template

  2. After calling encode, you need to call the pack_dataset method for final packing

dataset.pack_dataset()

This dataset also has the @remote_class decorator and can run in Ray workers.