# vLLMSampler vLLMSampler uses the vLLM engine for efficient inference, supporting high-throughput batch sampling. ## Usage Example ```python from twinkle.sampler import vLLMSampler from twinkle.data_format import SamplingParams from twinkle import DeviceMesh # Create sampler sampler = vLLMSampler( model_id='ms://Qwen/Qwen3.5-4B', device_mesh=DeviceMesh.from_sizes(dp_size=2, tp_size=2), remote_group='sampler_group' ) # Add LoRA sampler.add_adapter_to_model('my_lora', 'path/to/lora') # Set sampling parameters params = SamplingParams( max_tokens=512, temperature=0.7, top_p=0.9, top_k=50 ) # Perform sampling responses = sampler.sample( trajectories, sampling_params=params, adapter_name='my_lora', num_samples=4 # Generate 4 samples per prompt ) ``` ## Features - **High Performance**: Achieves high throughput using PagedAttention and continuous batching - **LoRA Support**: Supports dynamic loading and switching of LoRA adapters - **Multi-Sample Generation**: Can generate multiple samples per prompt - **Tensor Parallel**: Supports tensor parallelism to accelerate large model inference ## Remote Execution vLLMSampler supports the `@remote_class` decorator and can run in Ray clusters: ```python import twinkle from twinkle import DeviceGroup, DeviceMesh from twinkle.sampler import vLLMSampler # Initialize Ray cluster device_groups = [ DeviceGroup(name='sampler', ranks=4, device_type='cuda') ] twinkle.initialize('ray', groups=device_groups) # Create remote sampler sampler = vLLMSampler( model_id='ms://Qwen/Qwen3.5-4B', device_mesh=DeviceMesh.from_sizes(dp_size=4), remote_group='sampler' ) # sample method executes in remote worker responses = sampler.sample(trajectories, sampling_params=params) ``` ## Environment Variables - `TWINKLE_VLLM_IPC_TIMEOUT_S`: Controls the timeout (in seconds) for the IPC channel (ZMQ REQ/REP) between `vLLMSampler` and the vLLM worker extension. Default is `300`. This value must be greater than `0`. > In RLHF training, vLLMSampler is typically separated from the Actor model, using different hardware resources to avoid interference between inference and training.