Fixed-Length Packing Dataset
Packing datasets are used to concatenate variable-length data to a specified length. For example:
The dataset contains 4 pieces of data with length 5, and the Template component’s max_length can accept a length of 10. The packing dataset will pre-fetch the data and concatenate it into 2 samples with length 10.
ABCDE
FGHIJ
KLMNO
PQRST
Will be converted to
ABCDEFGHIJ
KLMNOPQRST
Note that this concatenation occurs after encode, i.e., on the actual model input length. In the process, the dataset will perform the following operations:
Fetch
buffer lengthsamplesEncode these samples
Calculate based on the length of each sample using an automatic packing algorithm to find an optimal solution that minimizes the number of batches and makes the length of each sample closest to
max_lengthAdd a
position_idsfield to distinguish different samples.
The final data format is similar to:
{
"input_ids": [1,2,3,4,5,6,7,8,9,10],
"position_ids": [0,1,2,3,4,0,1,2,3,4],
...
}
The use of the dataset has the following differences from Dataset:
Must set
TemplateAfter calling
encode, you need to call thepack_datasetmethod for final packing
dataset.pack_dataset()
This dataset also has the @remote_class decorator and can run in Ray workers.