# 服务端（Server）

## Ray 集群配置

在启动 Server 之前，**必须先启动并配置 Ray 节点**。只有正确配置了 Ray 节点后，Server 才能正确分配和占用资源（GPU、CPU 等）。

### 启动 Ray 节点

Ray 集群由多个节点（Node）组成，每个节点可以配置不同的资源。启动步骤如下：

#### 1. 启动 Head 节点（第一个 GPU 节点）

```bash
# 停止已有的 Ray 集群（如果有）
ray stop

# 启动 Head 节点，使用 GPU 0-3，共 4 个 GPU
CUDA_VISIBLE_DEVICES=0,1,2,3 ray start --head --num-gpus=4 --port=6379
```

#### 2. 启动 Worker 节点

```bash
# 第二个 GPU 节点，使用 GPU 4-7，共 4 个 GPU
CUDA_VISIBLE_DEVICES=4,5,6,7 ray start --address=10.28.252.9:6379 --num-gpus=4

# CPU 节点（用于运行 Processor 等 CPU 任务）
ray start --address=10.28.252.9:6379 --num-gpus=0
```

**说明：**
- `--head`：标记此节点为 Head 节点（集群的主节点）
- `--port=6379`：Head 节点监听端口
- `--address=<IP>:<PORT>`：Worker 节点连接到 Head 节点的地址
- `--num-gpus=N`：该节点可用的 GPU 数量
- `CUDA_VISIBLE_DEVICES`：限制该节点可见的 GPU 设备

#### 3. 完整示例：3 节点集群

```bash
# 停止旧集群并启动新集群
ray stop && \
CUDA_VISIBLE_DEVICES=0,1,2,3 ray start --head --num-gpus=4 --port=6379 && \
CUDA_VISIBLE_DEVICES=4,5,6,7 ray start --address=10.28.252.9:6379 --num-gpus=4 && \
ray start --address=10.28.252.9:6379 --num-gpus=0
```

此配置启动了 3 个节点：
- **Node 0**（Head）：4 个 GPU（卡 0-3）
- **Node 1**（Worker）：4 个 GPU（卡 4-7）
- **Node 2**（Worker）：纯 CPU 节点

#### 4. 设置环境变量

在启动 Server 之前，需要设置以下环境变量：

```bash
export TWINKLE_TRUST_REMOTE_CODE=0       # 是否信任远程代码（安全考虑）
```

### YAML 配置中的 Node Rank

在 YAML 配置文件中，**每个组件需要占用一个独立的 Node**。

**示例配置：**

```yaml
applications:
  # 模型服务占用 GPU 0-3（物理卡号）
  - name: models-Qwen3.5-4B
    route_prefix: /models/Qwen/Qwen3.5-4B
    import_path: model
    args:
      nproc_per_node: 4
      device_group:
        name: model
        ranks: 4               # 使用的 GPU 数量
        device_type: cuda
      device_mesh:
        device_type: cuda
        dp_size: 4             # 数据并行大小
        # tp_size: 1           # 张量并行大小（可选）
        # pp_size: 1           # 流水线并行大小（可选）
        # ep_size: 1           # 专家并行大小（可选）

  # Sampler 服务占用 GPU 4-5（物理卡号）
  - name: sampler-Qwen3.5-4B
    route_prefix: /sampler/Qwen/Qwen3.5-4B
    import_path: sampler
    args:
      nproc_per_node: 2
      device_group:
        name: sampler
        ranks: 2               # 使用的 GPU 数量
        device_type: cuda
      device_mesh:
        device_type: cuda
        dp_size: 2             # 数据并行大小

  # Processor 服务占用 CPU
  - name: processor
    route_prefix: /processors
    import_path: processor
    args:
      ncpu_proc_per_node: 4
      device_group:
        name: processor
        ranks: 0               # CPU 编号
        device_type: CPU
      device_mesh:
        device_type: CPU
        dp_size: 4             # 数据并行大小
```
**重要提示：**
- `ranks` 配置指定为该组件分配的 **GPU 数量**
- `device_mesh` 配置使用 `dp_size`、`tp_size`、`pp_size`、`ep_size` 等参数定义并行策略
- 不同组件会自动分配到不同的 Node 上
- Ray 会根据资源需求（`ray_actor_options` 中的 `num_gpus`、`num_cpus`）自动调度到合适的 Node

## 启动方式

Server 统一通过 `launch_server` 函数或 CLI 命令启动，配合 YAML 配置文件。

### 方式一：Python 脚本启动

```python
# server.py
import os
from twinkle.server import launch_server

# 获取配置文件路径（与脚本同目录的 server_config.yaml）
file_dir = os.path.abspath(os.path.dirname(__file__))
config_path = os.path.join(file_dir, 'server_config.yaml')

# 启动服务，此调用将阻塞直到服务关闭
launch_server(config_path=config_path)
```

### 方式二：命令行启动

```bash
python -m twinkle.server --config server_config.yaml
```

CLI 支持的参数：

| 参数 | 说明 | 默认值 |
|------|------|-------|
| `-c, --config` | YAML 配置文件路径（必须） | — |
| `--namespace` | Ray 命名空间 | `twinkle_cluster` |
| `--log-level` | 日志级别 | `INFO` |

## YAML 配置详解

配置文件定义了 Server 的完整部署方案，包括 HTTP 监听、应用组件和资源分配。Server 同时支持 Twinkle 和 Tinker 两种客户端，通过统一的配置文件部署所有服务组件。

### 完整配置示例（Megatron 后端）

```yaml
# HTTP 代理位置：EveryNode 表示每个 Ray 节点运行一个代理（多节点场景推荐）
proxy_location: EveryNode

# HTTP 监听配置
http_options:
  host: 0.0.0.0        # 监听所有网络接口
  port: 8000            # 服务端口号

# 应用列表：每个条目定义一个部署在 Server 上的服务组件
applications:

  # 1. TinkerCompatServer：中央 API 服务
  # 负责处理客户端连接、训练运行跟踪、检查点管理等
  # route_prefix 使用 /api/v1，兼容 Tinker 和 Twinkle 客户端
  - name: server
    route_prefix: /api/v1
    import_path: server
    args:
      server_config:
        per_token_model_limit: 3      # 每个 token 最多可关联的模型（适配器）数量（服务器全局生效）
      supported_models:
        - Qwen/Qwen3.5-4B
    deployments:
      - name: TinkerCompatServer
        max_ongoing_requests: 50
        autoscaling_config:
          min_replicas: 1
          max_replicas: 1
          target_ongoing_requests: 128
        ray_actor_options:
          num_cpus: 0.1

  # 2. Model 服务：承载基座模型
  # 执行前向传播、反向传播等训练计算
  - name: models-Qwen3.5-4B
    route_prefix: /api/v1/model/Qwen/Qwen3.5-4B
    import_path: model
    args:
      use_megatron: true                               # 使用 Megatron-LM 后端
      model_id: "ms://Qwen/Qwen3.5-4B"               # ModelScope 模型标识
      max_length: 10240
      nproc_per_node: 2                                # 每节点 GPU 进程数
      device_group:                                    # 逻辑设备组
        name: model
        ranks: 2                                       # 使用的 GPU 数量
        device_type: cuda
      device_mesh:                                     # 分布式训练网格
        device_type: cuda
        dp_size: 2                                     # 数据并行大小
      queue_config:
        rps_limit: 100                                 # 最大请求速率（每秒）
        tps_limit: 10000                               # 单用户最大 token 速率（每秒）
        max_input_tokens: 10000                        # 每次请求最大输入 token 数
      adapter_config:
        adapter_timeout: 30                            # 空闲适配器超时卸载时间（秒）
        adapter_max_lifetime: 36000                    # 适配器最大生命周期（秒）
      max_loras: 1                                     # 每个模型最多加载的 LoRA 数量
    deployments:
      - name: ModelManagement
        autoscaling_config:
          min_replicas: 1
          max_replicas: 1
          target_ongoing_requests: 16
        ray_actor_options:
          num_cpus: 0.1
          runtime_env:
            env_vars:
              TWINKLE_TRUST_REMOTE_CODE: "0"

  # 3. Sampler 服务：推理采样
  # 使用 vLLM 引擎执行推理，支持 LoRA 适配器
  - name: sampler-Qwen3.5-4B
    route_prefix: /api/v1/sampler/Qwen/Qwen3.5-4B
    import_path: sampler
    args:
      model_id: "ms://Qwen/Qwen3.5-4B"               # ModelScope 模型标识
      nproc_per_node: 2                                # 每节点 GPU 进程数
      sampler_type: vllm                               # 推理引擎：vllm（高性能）或 torch
      engine_args:                                     # vLLM 引擎参数
        max_model_len: 4096                            # 最大序列长度
        gpu_memory_utilization: 0.5                    # GPU 显存使用比例（0.0-1.0）
        enable_lora: true                              # 支持推理时加载 LoRA
        logprobs_mode: processed_logprobs              # logprobs 输出模式
      device_group:                                    # 逻辑设备组
        name: sampler
        ranks: 1                                       # 使用的 GPU 数量
        device_type: cuda
      device_mesh:
        device_type: cuda
        dp_size: 1
      queue_config:
        rps_limit: 100                                 # 最大请求速率（每秒）
        tps_limit: 100000                              # 最大 token 速率（每秒）
    deployments:
      - name: SamplerManagement
        autoscaling_config:
          min_replicas: 1
          max_replicas: 1
          target_ongoing_requests: 16
        ray_actor_options:
          num_cpus: 0.1
          runtime_env:
            env_vars:
              TWINKLE_TRUST_REMOTE_CODE: "0"

  # 4. Processor 服务：数据预处理
  # 在 CPU 上执行 tokenization、模板转换等预处理任务
  - name: processor
    route_prefix: /api/v1/processor
    import_path: processor
    args:
      ncpu_proc_per_node: 2
      device_group:
        name: model
        ranks: 2
        device_type: CPU
      device_mesh:
        device_type: CPU
        dp_size: 2
    deployments:
      - name: ProcessorManagement
        autoscaling_config:
          min_replicas: 1
          max_replicas: 1
          target_ongoing_requests: 128
        ray_actor_options:
          num_cpus: 0.1
```

### Transformers 后端

Transformers 后端与 Megatron 后端的区别仅在 Model 服务的 `use_megatron` 参数：

```yaml
  - name: models-Qwen3.5-4B
    route_prefix: /api/v1/model/Qwen/Qwen3.5-4B
    import_path: model
    args:
      use_megatron: false                              # 使用 Transformers 后端
      model_id: "ms://Qwen/Qwen3.5-4B"
      nproc_per_node: 2
      device_group:
        name: model
        ranks: 2
        device_type: cuda
      device_mesh:
        device_type: cuda
        dp_size: 2
      adapter_config:
        adapter_timeout: 1800                          # 空闲适配器超时卸载时间（秒）
        adapter_max_lifetime: 36000
    deployments:
      - name: ModelManagement
        autoscaling_config:
          min_replicas: 1
          max_replicas: 1
          target_ongoing_requests: 16
        ray_actor_options:
          num_cpus: 0.1
```

## 配置项说明

### 应用组件（import_path）

| import_path | 说明 |
|-------------|------|
| `server` | 中央管理服务，处理训练运行和检查点 |
| `model` | 模型服务，承载基座模型进行训练 |
| `processor` | 数据预处理服务，在 CPU 上执行 tokenization、模板转换等 |
| `sampler` | 推理采样服务 |

### device_group 与 device_mesh

- **device_group**：定义逻辑设备组，指定使用多少 GPU
- **device_mesh**：定义分布式训练网格，控制并行策略

```yaml
device_group:
  name: model          # 设备组名称
  ranks: 2             # 使用的 GPU 数量
  device_type: cuda     # 设备类型：cuda / CPU

device_mesh:
  device_type: cuda
  dp_size: 2           # 数据并行大小
  # tp_size: 1         # 张量并行大小（可选）
  # pp_size: 1         # 流水线并行大小（可选）
  # ep_size: 1         # 专家并行大小（可选）
```

**重要配置参数说明：**

| 参数 | 类型 | 说明 |
|------|------|------|
| `ranks` | int | **使用的 GPU 数量** |
| `dp_size` | int | 数据并行大小 |
| `tp_size` | int (可选) | 张量并行大小 |
| `pp_size` | int (可选) | 流水线并行大小 |
| `ep_size` | int (可选) | 专家并行大小（用于 MoE 模型） |

**环境变量：**

```bash
export TWINKLE_TRUST_REMOTE_CODE=0       # 是否信任远程代码
```