Unity 性能优化系列（1）：Draw Call 原理与性能优化

2024/7/12 编程技术 Unity 性能优化 DrawCall

1. Draw Call 的硬件与软件流程

应用层面（Unity 调用）
- MeshRenderer、Graphics.DrawMesh、DrawMeshInstanced 等 API 触发一次渲染请求。
- Unity 将这些请求封装为命令（Command）并写入 Command Buffer。
引擎层面（Render Thread）
- SetPass 阶段：绑定 Shader 程序、材质属性、贴图、混合状态等，构建当前渲染状态对象（Pipeline State Object）。
- DrawCall 阶段：将顶点（Vertex Buffer）、索引（Index Buffer）以及常量缓冲区（Constant/Uniform Buffer）绑定至 GPU，并发出绘制指令。
驱动层面（GPU 驱动与硬件）
- 驱动接收命令缓冲区，通过 PCIe 总线将指令与数据传输至 GPU。
- GPU 接收后进行 命令解码、流水线调度，最后执行顶点着色（Vertex Shader）、光栅化（Rasterization）、像素着色（Fragment/Pixel Shader）等阶段。

> [Unity Script] → [Command Buffer] → [Render Thread: SetPass + DrawCall] → [GPU Driver] → [GPU Pipeline]

2. Draw Call 开销细分

阶段	主要成本	典型消耗	可量化指标
SetPass Binding	切换管线状态（Pipeline State Object）	~5–20 μs/次	SetPass Calls
数据传输	顶点/索引/常量数据上载	取决于大小	GPU Upload Bandwidth
命令提交与排队	API 调用、队列管理	~1–5 μs/次	CPU Submit Overhead
驱动解析	命令拆解、合并、调度	~数微秒	Driver Overhead
GPU 执行	着色、光栅化、输出合成	数毫秒	GPU Frame Time

3. 为什么 Draw Call 会成为瓶颈？

CPU–GPU 分离架构：现代渲染采用异步指令队列，CPU 必须先行构建好所有渲染命令，才能统一提交，频繁中断队列会带来巨额开销。
状态对象切换：每次 Draw Call 前的状态切换（Shader、材质属性、混合/深度测试模式）在 GPU 端需要重新配置流水线，触发流水线刷新（Pipeline Flush），导致早期指令丢弃与重填。
命令缓冲区压力：当 Draw Call 数量激增时，命令缓冲区（Command Buffer）中待提交命令堆积，CPU 需等待 GPU 处理完毕才能继续写入，出现管线阻塞（Stall）。
驱动开销不容忽视：GPU 驱动为渲染命令生成底层硬件指令，涉及多次 CPU-驱动交互，在高 Draw Call 场景下开销成倍放大。

二、优化切入点拆解

根据上述三大瓶颈，我们可以切入以下优化方向：

瓶颈环节	优化思路
状态切换（SetPass）	合批（Batching）\|实例化（Instancing）
命令提交	Command Buffer \| Multi-Draw Indirect
流水线刷新	SRP Batcher（URP/HDRP）\|Shader 关键字管理

三、优化方案实战

以「场景中 500 个相同立方体」为例，每步均用 Profiler → Rendering → Batches 验证效果。

3.1 原始状态

场景：500 个 MeshRenderer + 相同材质的 Cube
指标：
- Draw Call ≈ 2000
- CPU Frame Time ≈ 10 ms

3.2 静态合批 (Static Batching)

原理：编辑器打包时将所有标记为 Static 的 Mesh 合并成一个大网格，只需一次 SetPass + 一条 Draw Call。
步骤：
1. 选中所有 Cube → Inspector → 勾选 Static → Batching Static
2. 构建（Build）并运行
效果：
- Draw Call ≈ 10（含摄像机、灯光等）
- CPU render thread ≈ 5 ms
注意：会增加运行时内存（约 +0.5 MB / 100 个对象）。

3.3 动态合批 (Dynamic Batching)

原理：运行时自动将顶点数 < 300、材质相同的小 Mesh 拼入同一 Draw Call。
步骤：
1. 确保每个 Cube 顶点数 < 300，并使用相同材质
2. PlayerSetting——Player——OtherSetting——DynamicBatching
效果：
- Draw Call ≈ 15
- CPU Frame Time ≈ 6 ms
限制：合批逻辑在运行时执行，仍会消耗少量 CPU 周期。

3.4 GPU 实例化 (GPU Instancing)

原理：一次 SetPass，渲染 N 个实例，Shader 通过 SV_InstanceID 或 UNITY_INSTANCE_ID 读取实例化数据。

场景与资源准备

新建空场景
- 清空默认摄像机和灯光，仅保留一个空 GameObject 作为“实例化管理器”。
创建立方体 Prefab
- 在层级面板新建一个 Cube → 添加 MeshRenderer & MeshFilter → 赋予默认材质（或自定义 PBR 材质）→ 将其拖入 Project 面板，生成 CubePrefab。

材质与 Shader 配置

GPU Instancing 要求材质对应的 Shader 支持实例化宏。

检查/编辑 Shader
- 如果使用内置的 Standard Shader，无需改动；若是自定义Shader，请确保包含实例化宏
启用 Instancing
- 在 Project 面板中选中材质 → Inspector → 勾选 Enable GPU Instancing。
脚本实现步骤
在 “实例化管理器” 上挂载脚本 CubeInstancer.cs，实现以下功能：

    [Header("资源引用")]
    public Mesh      cubeMesh;        // Scene 中用到的立方体网格
    public Material cubeMaterial;    // 已启用 Instancing 的材质

    [Header("实例参数")]
    public int       instanceCount = 500;
    public float     spacing       = 1.5f;

    private Matrix4x4[] matrices;

    void Start()
    {
        // 1. 限制：Unity 单次 DrawMeshInstanced 最多 1023 个实例
        instanceCount = Mathf.Clamp(instanceCount, 1, 1023);

        // 2. 生成变换矩阵数组
        matrices = new Matrix4x4[instanceCount];
        for (int i = 0; i < instanceCount; i++)
        {
            // 按 grid 布局分布
            int row = i / 25;
            int col = i % 25;
            Vector3 pos = new Vector3(col * spacing, 0f, row * spacing);
            matrices[i] = Matrix4x4.TRS(pos, Quaternion.identity, Vector3.one);
        }
    }

    void Update()
    {
        // 每帧调用一次 DrawMeshInstanced
        Graphics.DrawMeshInstanced(
            cubeMesh,
            0,
            cubeMaterial,
            matrices,
            instanceCount,
            null,                     // 可选 MaterialPropertyBlock
            ShadowCastingMode.On,
            receiveShadows: true,
            layer: gameObject.layer
        );
    }

效果：

Draw Call < 10
CPU Frame Time < 3 ms

依赖：Shader 中需使用 UNITY_INSTANCING_BUFFER_* 宏。

3.5 SRP Batcher（URP/HDRP）

原理：Scriptable Render Pipeline 根据 Shader Variant ID 自动聚合 Draw Call，显著降低 SetPass Calls。
步骤：
1. 安装并启用 Universal RP 或 High Definition RP
2. Project Settings → Graphics → 关联对应 SRP Asset
效果：
- Draw Call ≈ 10
- CPU Frame Time ≈ 3 ms
优势：全局生效，无需手动标记或代码改动。

3.6 Command Buffer ＆间接绘制 (Indirect Draw)

3.6.1 Command Buffer 原理：在脚本中累积多条 Draw Call，或使用 GPU 间接绘制减少 CPU 提交次数。

Command Buffer 是 Unity 提供的一个脚本化命令列表容器，你可以在任意时刻往里“录制”渲染命令（DrawMesh、Blit、SetRenderTarget 等）。
执行时机：调用 Graphics.ExecuteCommandBuffer(cmd)，Unity 会一次性将命令批量提交给 GPU，避免多次 Graphics.Draw* 接口调用导致的 CPU–GPU 同步。
典型流程：
1. 创建 CommandBuffer
2. 录制命令（循环中多次 cmd.DrawMesh(...)）
3. 执行并清空：Graphics.ExecuteCommandBuffer(cmd) → cmd.Clear() 或 cmd.Release()

using UnityEngine;
using UnityEngine.Rendering;

[ExecuteAlways]
public class CommandBufferInstancedExample : MonoBehaviour
{
    [Header("Batch Settings")]
    public Mesh mesh;
    public Material material;
    public int instanceCount = 500;
    public float spacing = 1.5f;

    private Matrix4x4[] matrices;
    private CommandBuffer commandBuffer;
    private Camera targetCamera;

    void OnEnable()
    {
        // 初始化变换矩阵
        int count = Mathf.Max(1, instanceCount);
        matrices = new Matrix4x4[count];
        int width = Mathf.CeilToInt(Mathf.Sqrt(count));
        for (int i = 0; i < count; i++)
        {
            int x = i % width;
            int z = i / width;
            Vector3 pos = new Vector3(x * spacing, 0f, z * spacing);
            matrices[i] = Matrix4x4.TRS(pos, Quaternion.identity, Vector3.one);
        }

        // 创建并录制一次 Instanced Draw
        commandBuffer = new CommandBuffer { name = "InstancedBatch" };
        commandBuffer.DrawMeshInstanced(
            mesh: mesh,
            submeshIndex: 0,
            material: material,
            shaderPass: 0,
            matrices: matrices,
            count: count
        );

        // 挂载到主相机的渲染管线
        targetCamera = Camera.main;
        if (targetCamera != null)
        {
            targetCamera.AddCommandBuffer(CameraEvent.AfterForwardOpaque, commandBuffer);
        }
    }

    void OnDisable()
    {
        // 移除并释放 CommandBuffer
        if (targetCamera != null && commandBuffer != null)
        {
            targetCamera.RemoveCommandBuffer(CameraEvent.AfterForwardOpaque, commandBuffer);
        }
        if (commandBuffer != null)
        {
            commandBuffer.Release();
            commandBuffer = null;
        }
    }

    void OnValidate()
    {
        // 编辑器中更新效果
        if (Application.isPlaying)
        {
            OnDisable();
            OnEnable();
        }
    }
}

效果：Draw Call < 3；CPU Frame Time < 3 ms

3.6.2 间接绘制（Indirect Draw）原理

DrawMeshInstancedIndirect：Unity 将渲染参数（顶点计数、实例数量、起始偏移等）封装在一个 GPU 可读的 ComputeBuffer 中，真正的渲染调用由 GPU 自己发起。
流程：
1. 构造参数缓冲区 (ComputeBufferType.IndirectArguments)
2. 填充参数：[ vertexCountPerInstance, instanceCount, startVertex, startInstance ]
3. 调用：Graphics.DrawMeshInstancedIndirect(mesh, 0, material, bounds, argsBuffer);
4. GPU 端执行：无需每帧都经过 CPU–GPU 同步，只要参数缓冲区更新即可动态控制实例数量或形态。

using UnityEngine;
using UnityEngine.Rendering;

[ExecuteAlways]
public class IndirectDrawExample : MonoBehaviour
{
    [Header("Instance Settings")]
    public Mesh mesh;
    public Material material;                // Shader must support instancing and read matrixBuffer
    public int instanceCount = 500;
    public float spacing = 1.5f;

    private ComputeBuffer argsBuffer;        // Indirect draw arguments buffer
    private ComputeBuffer matrixBuffer;      // Per-instance matrix buffer
    private Bounds drawBounds;
    private uint[] args = new uint[5];       // [indexCountPerInstance, instanceCount, startIndex, baseVertex, startInstance]

    void OnEnable()
    {
        if (mesh == null || material == null) return;

        // Initialize instance matrices
        var matrices = new Matrix4x4[instanceCount];
        int width = Mathf.CeilToInt(Mathf.Sqrt(instanceCount));
        for (int i = 0; i < instanceCount; i++)
        {
            int x = i % width;
            int z = i / width;
            Vector3 pos = new Vector3(x * spacing, 0f, z * spacing);
            matrices[i] = Matrix4x4.TRS(pos, Quaternion.identity, Vector3.one);
        }

        // Upload matrices to GPU
        matrixBuffer = new ComputeBuffer(instanceCount, 16 * sizeof(float)); // 16 floats in Matrix4x4
        matrixBuffer.SetData(matrices);
        material.SetBuffer("matrixBuffer", matrixBuffer);

        // Setup indirect draw args
        args[0] = (uint)mesh.GetIndexCount(0);
        args[1] = (uint)instanceCount;
        args[2] = (uint)mesh.GetIndexStart(0);
        args[3] = (uint)mesh.GetBaseVertex(0);
        args[4] = 0;

        argsBuffer = new ComputeBuffer(1, args.Length * sizeof(uint), ComputeBufferType.IndirectArguments);
        argsBuffer.SetData(args);

        // Calculate drawing bounds
        float extent = width * spacing * 0.5f;
        drawBounds = new Bounds(new Vector3(extent, 0, extent), new Vector3(extent * 2, extent * 2, extent * 2));
    }

    void Update()
    {
        if (argsBuffer == null || matrixBuffer == null) return;

        // Indirect draw with single draw call (no shadows, no receive)
        Graphics.DrawMeshInstancedIndirect(
            mesh,
            0,
            material,
            drawBounds,
            argsBuffer,
            0,
            null,
            ShadowCastingMode.Off,
            false,
            gameObject.layer,
            Camera.main,
            LightProbeUsage.Off,
            null
        );
    }

    void OnDisable()
    {
        // Release GPU buffers
        if (argsBuffer != null)
        {
            argsBuffer.Release();
            argsBuffer = null;
        }
        if (matrixBuffer != null)
        {
            matrixBuffer.Release();
            matrixBuffer = null;
        }
    }
}

效果：Draw Call < 3；间接在 GPU 端完成批量渲染。

3.6.3 动态控制与 ComputeShader 驱动

实时实例数量变化：可以通过 ComputeShader 在 GPU 端更新 argsBuffer，在下一帧直接生效，无需 CPU 干预。
示例思路：
1. 在 ComputeShader 中写入新的 instanceCount 或变换矩阵（StructuredBuffer）。
2. Dispatch ComputeShader → 间接参数缓冲自动更新。
3. DrawMeshInstancedIndirect 读取最新参数，一次下发多实例渲染。

// ComputeShader 示例 (HLSL)
#pragma kernel CSMain
RWStructuredBuffer<float4x4> matrixBuffer;
RWStructuredBuffer<uint>        argsBuffer;

[numthreads(64,1,1)]
void CSMain (uint3 id : SV_DispatchThreadID)
{
    // 更新矩阵
    matrixBuffer[id.x] = ComputeMatrix(id.x);

    // 线程 0 更新实例总数
    if (id.x == 0)
    {
        argsBuffer[1] = totalInstanceCount; 
    }
}

性能对比

方法	Draw Call 数	CPU Frame Time	特点
逐个 `Graphics.DrawMesh`	N	高 (~5–10 ms)	最简单，但 CPU–GPU 调用开销最大
Command Buffer	~3	中 (~1–3 ms)	批量提交，兼容性好，但依然需要 CPU 录制命令
Indirect Draw	~2	低 (<2 ms)	全部调度交给 GPU，CPU 负载最低，需支持 ComputeBuffer

Bounds 设置：DrawMeshInstancedIndirect 需要准确的 Bounds，否则可能被剔除或一直渲染。

平台兼容：部分移动设备或旧显卡可能不支持 ComputeBuffer 类型的间接绘制。

Buffer 对齐：间接参数缓冲区的结构体需按 uint4（16 字节）对齐，否则渲染结果未知。

资源释放：务必在不再使用时 Release() ComputeBuffer 和 CommandBuffer，避免内存泄漏。

混合使用：在 SRP（URP/HDRP）中，同样可以将 Command Buffer 挂载到 CameraEvent，与管线阶段紧密结合。

四、小结

先量化再优化
- 利用 Profiler → Batches/SetPass Calls 定位性能瓶颈。
按场景选方案
- 静态元素 → Static Batching
- 小规模小模型 → Dynamic Batching
- 大量重复实例 → GPU Instancing / Indirect Draw
- 复杂材质场景 → SRP Batcher
关注副作用
- 合批可能增加内存与构建时长
- Indirect Draw 需 GPU 与 API 支持
持续监控
- 在迭代中定期对比 Draw Call 与 Frame Time，确保优化稳定有效。

LOADING