Quickstart#
Prerequisites#
GPU#
OS: Linux
Python: 3.12
GPU: NVIDIA compute capability 8.0+ (e.g., L20, L40, H20)
CUDA: CUDA Version 12.8
vLLM: v0.9.2
NPU#
OS: Linux
Python: >= 3.9, < 3.12
NPU: Atlas 800 A2/A3 series
CANN: CANN Version 8.1.RC1
vLLM: v0.9.2
vLLM Ascend: v0.9.2rc1
Installation#
Before you start with UCM, please make sure that you have installed UCM correctly by following the GPU Installation guide or NPU Installation guide.
Features Overview#
UCM supports two key features: Prefix Cache and Sparse attention.
Each feature supports both Offline Inference and Online API modes.
For quick start, just follow the usage guide below to launch your own inference experience;
For further research on Prefix Cache, more details are available via the link below:
Various Sparse Attention features are now available, try GSA Sparsity via the link below:
Usage#
Offline Inference
You can use our official offline example script to run offline inference as following commands:
cd examples/
# Change the model path to your own model path
export MODEL_PATH=/home/models/Qwen2.5-14B-Instruct
python offline_inference.py
OpenAI-Compatible Online API
For online inference , vLLM with our connector can also be deployed as a server that implements the OpenAI API protocol.
First, specify the python hash seed by:
export PYTHONHASHSEED=123456
Create a config yaml like following and save it to your own directory:
# UCM Configuration File Example
# Refer to file unified-cache-management/examples/ucm_config_example.yaml for more details
ucm_connector_name: "UcmNfsStore"
ucm_connector_config:
storage_backends: "/mnt/test"
Run the following command to start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model and your config file path:
# Change the model path to your own model path
export MODEL_PATH=/home/models/Qwen2.5-14B-Instruct
vllm serve ${MODEL_PATH} \
--served-model-name vllm_cpu_offload \
--max-model-len 20000 \
--tensor-parallel-size 2 \
--gpu_memory_utilization 0.87 \
--trust-remote-code \
--port 7800 \
--kv-transfer-config \
'{
"kv_connector": "UCMConnector",
"kv_connector_module_path": "ucm.integration.vllm.ucm_connector",
"kv_role": "kv_both",
"kv_connector_extra_config": {
"UCM_CONFIG_FILE": "/workspace/unified-cache-management/examples/ucm_config_example.yaml"
}
}'
If you see log as below:
INFO: Started server process [32890]
INFO: Waiting for application startup.
INFO: Application startup complete.
Congratulations, you have successfully started the vLLM server with UCM!
After successfully started the vLLM server,You can interact with the API as following:
curl http://localhost:7800/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "vllm_cpu_offload",
"prompt": "Shanghai is a",
"max_tokens": 7,
"temperature": 0
}'
Note: If you want to disable vLLM prefix cache to test the cache ability of UCM, you can add --no-enable-prefix-caching to the command line.