vLLMSampler

vLLMSampler uses the vLLM engine for efficient inference, supporting high-throughput batch sampling.

Usage Example

from twinkle.sampler import vLLMSampler
from twinkle.data_format import SamplingParams
from twinkle import DeviceMesh

# Create sampler
sampler = vLLMSampler(
    model_id='ms://Qwen/Qwen3.5-4B',
    device_mesh=DeviceMesh.from_sizes(dp_size=2, tp_size=2),
    remote_group='sampler_group'
)

# Add LoRA
sampler.add_adapter_to_model('my_lora', 'path/to/lora')

# Set sampling parameters
params = SamplingParams(
    max_tokens=512,
    temperature=0.7,
    top_p=0.9,
    top_k=50
)

# Perform sampling
responses = sampler.sample(
    trajectories,
    sampling_params=params,
    adapter_name='my_lora',
    num_samples=4  # Generate 4 samples per prompt
)

Features

  • High Performance: Achieves high throughput using PagedAttention and continuous batching

  • LoRA Support: Supports dynamic loading and switching of LoRA adapters

  • Multi-Sample Generation: Can generate multiple samples per prompt

  • Tensor Parallel: Supports tensor parallelism to accelerate large model inference

Remote Execution

vLLMSampler supports the @remote_class decorator and can run in Ray clusters:

import twinkle
from twinkle import DeviceGroup, DeviceMesh
from twinkle.sampler import vLLMSampler

# Initialize Ray cluster
device_groups = [
    DeviceGroup(name='sampler', ranks=4, device_type='cuda')
]
twinkle.initialize('ray', groups=device_groups)

# Create remote sampler
sampler = vLLMSampler(
    model_id='ms://Qwen/Qwen3.5-4B',
    device_mesh=DeviceMesh.from_sizes(dp_size=4),
    remote_group='sampler'
)

# sample method executes in remote worker
responses = sampler.sample(trajectories, sampling_params=params)

Environment Variables

  • TWINKLE_VLLM_IPC_TIMEOUT_S: Controls the timeout (in seconds) for the IPC channel (ZMQ REQ/REP) between vLLMSampler and the vLLM worker extension. Default is 300. This value must be greater than 0.

In RLHF training, vLLMSampler is typically separated from the Actor model, using different hardware resources to avoid interference between inference and training.