Server

Ray Cluster Configuration

Before starting the Server, you must first start and configure the Ray nodes. Only after the Ray nodes are properly configured can the Server correctly allocate and occupy resources (GPU, CPU, etc.).

Starting Ray Nodes

A Ray cluster consists of multiple nodes, each of which can be configured with different resources. The startup steps are as follows:

1. Start the Head Node (First GPU Node)

# Stop existing Ray cluster (if any)
ray stop

# Start the Head node with GPU 0-3, 4 GPUs in total
CUDA_VISIBLE_DEVICES=0,1,2,3 ray start --head --num-gpus=4 --port=6379

2. Start Worker Nodes

# Second GPU node, using GPU 4-7, 4 GPUs in total
CUDA_VISIBLE_DEVICES=4,5,6,7 ray start --address=10.28.252.9:6379 --num-gpus=4

# CPU node (for running Processor and other CPU tasks)
ray start --address=10.28.252.9:6379 --num-gpus=0

Notes:

  • --head: Marks this node as the Head node (the primary node of the cluster)

  • --port=6379: The port the Head node listens on

  • --address=<IP>:<PORT>: The address for Worker nodes to connect to the Head node

  • --num-gpus=N: The number of GPUs available on this node

  • CUDA_VISIBLE_DEVICES: Restricts the GPU devices visible to this node

3. Complete Example: 3-Node Cluster

# Stop the old cluster and start a new one
ray stop && \
CUDA_VISIBLE_DEVICES=0,1,2,3 ray start --head --num-gpus=4 --port=6379 && \
CUDA_VISIBLE_DEVICES=4,5,6,7 ray start --address=10.28.252.9:6379 --num-gpus=4 && \
ray start --address=10.28.252.9:6379 --num-gpus=0

This configuration starts 3 nodes:

  • Node 0 (Head): 4 GPUs (cards 0-3)

  • Node 1 (Worker): 4 GPUs (cards 4-7)

  • Node 2 (Worker): CPU-only node

4. Set Environment Variables

Before starting the Server, you need to set the following environment variables:

export TWINKLE_TRUST_REMOTE_CODE=0       # Whether to trust remote code (security consideration)

Node Rank in YAML Configuration

In the YAML configuration file, each component needs to occupy a separate Node.

Example configuration:

applications:
  # Model service occupies GPU 0-3 (physical card numbers)
  - name: models-Qwen3.5-4B
    route_prefix: /models/Qwen/Qwen3.5-4B
    import_path: model
    args:
      nproc_per_node: 4
      device_group:
        name: model
        ranks: 4               # Number of GPUs to use
        device_type: cuda
      device_mesh:
        device_type: cuda
        dp_size: 4             # Data parallel size
        # tp_size: 1           # Tensor parallel size (optional)
        # pp_size: 1           # Pipeline parallel size (optional)
        # ep_size: 1           # Expert parallel size (optional)

  # Sampler service occupies GPU 4-5 (physical card numbers)
  - name: sampler-Qwen3.5-4B
    route_prefix: /sampler/Qwen/Qwen3.5-4B
    import_path: sampler
    args:
      nproc_per_node: 2
      device_group:
        name: sampler
        ranks: 2               # Number of GPUs to use
        device_type: cuda
      device_mesh:
        device_type: cuda
        dp_size: 2             # Data parallel size

  # Processor service occupies CPU
  - name: processor
    route_prefix: /processors
    import_path: processor
    args:
      ncpu_proc_per_node: 4
      device_group:
        name: processor
        ranks: 0               # CPU index
        device_type: CPU
      device_mesh:
        device_type: CPU
        dp_size: 4             # Data parallel size

Important notes:

  • The ranks configuration specifies the number of GPUs to allocate for the component

  • The device_mesh configuration uses parameters like dp_size, tp_size, pp_size, ep_size to define the parallelization strategy

  • Different components will be automatically assigned to different Nodes

  • Ray will automatically schedule to the appropriate Node based on resource requirements (num_gpus, num_cpus in ray_actor_options)

Startup Methods

The Server is uniformly launched through the launch_server function or CLI command, with YAML configuration files.

Method 1: Python Script Startup

# server.py
import os
from twinkle.server import launch_server

# Get configuration file path (server_config.yaml in the same directory as the script)
file_dir = os.path.abspath(os.path.dirname(__file__))
config_path = os.path.join(file_dir, 'server_config.yaml')

# Launch service, this call will block until the service is shut down
launch_server(config_path=config_path)

Method 2: Command Line Startup

python -m twinkle.server --config server_config.yaml

CLI supported parameters:

Parameter Description Default Value
-c, --config YAML configuration file path (required)
--namespace Ray namespace twinkle_cluster
--log-level Log level INFO

YAML Configuration Details

The configuration file defines the complete deployment plan for the Server, including HTTP listening, application components, and resource allocation. The Server simultaneously supports both Twinkle and Tinker clients through a unified configuration file.

Complete Configuration Example (Megatron Backend)

# HTTP proxy location: EveryNode means running one proxy per Ray node (recommended for multi-node scenarios)
proxy_location: EveryNode

# HTTP listening configuration
http_options:
  host: 0.0.0.0        # Listen on all network interfaces
  port: 8000            # Service port number

# Application list: Each entry defines a service component deployed on the Server
applications:

  # 1. TinkerCompatServer: Central API service
  # Handles client connections, training run tracking, checkpoint management, etc.
  # route_prefix uses /api/v1, compatible with both Tinker and Twinkle clients
  - name: server
    route_prefix: /api/v1
    import_path: server
    args:
      server_config:
        per_token_model_limit: 3      # Maximum number of models (adapters) per token (server-globally enforced)
      supported_models:
        - Qwen/Qwen3.5-4B
    deployments:
      - name: TinkerCompatServer
        max_ongoing_requests: 50
        autoscaling_config:
          min_replicas: 1
          max_replicas: 1
          target_ongoing_requests: 128
        ray_actor_options:
          num_cpus: 0.1

  # 2. Model service: Hosts the base model
  # Executes forward propagation, backward propagation and other training computations
  - name: models-Qwen3.5-4B
    route_prefix: /api/v1/model/Qwen/Qwen3.5-4B
    import_path: model
    args:
      use_megatron: true                               # Use Megatron-LM backend
      model_id: "ms://Qwen/Qwen3.5-4B"               # ModelScope model identifier
      max_length: 10240
      nproc_per_node: 2                                # Number of GPU processes per node
      device_group:                                    # Logical device group
        name: model
        ranks: 2                                       # Number of GPUs to use
        device_type: cuda
      device_mesh:                                     # Distributed training mesh
        device_type: cuda
        dp_size: 2                                     # Data parallel size
      queue_config:
        rps_limit: 100                                 # Max requests per second
        tps_limit: 10000                               # Max tokens per second per user
        max_input_tokens: 10000                        # Maximum input tokens per request
      adapter_config:
        adapter_timeout: 30                            # Idle adapter timeout unload time (seconds)
        adapter_max_lifetime: 36000                    # Maximum adapter lifetime (seconds)
      max_loras: 1                                     # Maximum number of LoRA adapters per model
    deployments:
      - name: ModelManagement
        autoscaling_config:
          min_replicas: 1
          max_replicas: 1
          target_ongoing_requests: 16
        ray_actor_options:
          num_cpus: 0.1
          runtime_env:
            env_vars:
              TWINKLE_TRUST_REMOTE_CODE: "0"

  # 3. Sampler service: Inference sampling
  # Uses vLLM engine for inference, supports LoRA adapters
  - name: sampler-Qwen3.5-4B
    route_prefix: /api/v1/sampler/Qwen/Qwen3.5-4B
    import_path: sampler
    args:
      model_id: "ms://Qwen/Qwen3.5-4B"               # ModelScope model identifier
      nproc_per_node: 2                                # Number of GPU processes per node
      sampler_type: vllm                               # Inference engine: vllm (high performance) or torch
      engine_args:                                     # vLLM engine parameters
        max_model_len: 4096                            # Maximum sequence length
        gpu_memory_utilization: 0.5                    # GPU memory usage ratio (0.0-1.0)
        enable_lora: true                              # Support loading LoRA during inference
        logprobs_mode: processed_logprobs              # Logprobs output mode
      device_group:                                    # Logical device group
        name: sampler
        ranks: 1                                       # Number of GPUs to use
        device_type: cuda
      device_mesh:
        device_type: cuda
        dp_size: 1
      queue_config:
        rps_limit: 100                                 # Max requests per second
        tps_limit: 100000                              # Max tokens per second
    deployments:
      - name: SamplerManagement
        autoscaling_config:
          min_replicas: 1
          max_replicas: 1
          target_ongoing_requests: 16
        ray_actor_options:
          num_cpus: 0.1
          runtime_env:
            env_vars:
              TWINKLE_TRUST_REMOTE_CODE: "0"

  # 4. Processor service: Data preprocessing
  # Executes tokenization, template conversion, and other preprocessing tasks on CPU
  - name: processor
    route_prefix: /api/v1/processor
    import_path: processor
    args:
      ncpu_proc_per_node: 2
      device_group:
        name: model
        ranks: 2
        device_type: CPU
      device_mesh:
        device_type: CPU
        dp_size: 2
    deployments:
      - name: ProcessorManagement
        autoscaling_config:
          min_replicas: 1
          max_replicas: 1
          target_ongoing_requests: 128
        ray_actor_options:
          num_cpus: 0.1

Transformers Backend

The difference from the Megatron backend is only in the use_megatron parameter of the Model service:

  - name: models-Qwen3.5-4B
    route_prefix: /api/v1/model/Qwen/Qwen3.5-4B
    import_path: model
    args:
      use_megatron: false                              # Use Transformers backend
      model_id: "ms://Qwen/Qwen3.5-4B"
      nproc_per_node: 2
      device_group:
        name: model
        ranks: 2
        device_type: cuda
      device_mesh:
        device_type: cuda
        dp_size: 2
      adapter_config:
        adapter_timeout: 1800                          # Idle adapter timeout unload time (seconds)
        adapter_max_lifetime: 36000
    deployments:
      - name: ModelManagement
        autoscaling_config:
          min_replicas: 1
          max_replicas: 1
          target_ongoing_requests: 16
        ray_actor_options:
          num_cpus: 0.1

Configuration Item Description

Application Components (import_path)

import_path Description
server Central management service, handles training runs and checkpoints
model Model service, hosts base model for training
processor Data preprocessing service, executes tokenization and template conversion on CPU
sampler Inference sampling service

device_group and device_mesh

  • device_group: Defines logical device groups, specifying how many GPUs to use

  • device_mesh: Defines distributed training mesh, controls parallelization strategy

device_group:
  name: model          # Device group name
  ranks: 2             # Number of GPUs to use
  device_type: cuda     # Device type: cuda / CPU

device_mesh:
  device_type: cuda
  dp_size: 2           # Data parallel size
  # tp_size: 1         # Tensor parallel size (optional)
  # pp_size: 1         # Pipeline parallel size (optional)
  # ep_size: 1         # Expert parallel size (optional)

Important configuration parameters:

Parameter Type Description
ranks int Number of GPUs to use for this component
dp_size int Data parallel size
tp_size int (optional) Tensor parallel size
pp_size int (optional) Pipeline parallel size
ep_size int (optional) Expert parallel size (for MoE models)

Environment variables:

export TWINKLE_TRUST_REMOTE_CODE=0       # Whether to trust remote code