# Quickstart ## Prerequisites ### GPU - OS: Linux - Python: 3.12 - GPU: NVIDIA compute capability 8.0+ (e.g., L20, L40, H20) - CUDA: CUDA Version 12.8 - vLLM: v0.9.2 ### NPU - OS: Linux - Python: >= 3.9, < 3.12 - NPU: Atlas 800 A2/A3 series - CANN: CANN Version 8.1.RC1 - vLLM: v0.9.2 - vLLM Ascend: v0.9.2rc1 ## Installation Before you start with UCM, please make sure that you have installed UCM correctly by following the [GPU Installation](./installation_gpu.md) guide or [NPU Installation](./installation_npu.md) guide. ## Features Overview UCM supports two key features: **Prefix Cache** and **Sparse attention**. Each feature supports both **Offline Inference** and **Online API** modes. For quick start, just follow the [usage](./quick_start.md) guide below to launch your own inference experience; For further research on Prefix Cache, more details are available via the link below: - [Prefix Cache](../user-guide/prefix-cache/index.md) Various Sparse Attention features are now available, try GSA Sparsity via the link below: - [GSA Sparsity](../user-guide/sparse-attention/gsa.md) ## Usage

Offline Inference

You can use our official offline example script to run offline inference as following commands: ```bash cd examples/ # Change the model path to your own model path export MODEL_PATH=/home/models/Qwen2.5-14B-Instruct python offline_inference.py ```

OpenAI-Compatible Online API

For online inference , vLLM with our connector can also be deployed as a server that implements the OpenAI API protocol. First, specify the python hash seed by: ```bash export PYTHONHASHSEED=123456 ``` Create a config yaml like following and save it to your own directory: ```yaml # UCM Configuration File Example # Refer to file unified-cache-management/examples/ucm_config_example.yaml for more details ucm_connector_name: "UcmNfsStore" ucm_connector_config: storage_backends: "/mnt/test" ``` Run the following command to start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model and your config file path: ```bash # Change the model path to your own model path export MODEL_PATH=/home/models/Qwen2.5-14B-Instruct vllm serve ${MODEL_PATH} \ --served-model-name vllm_cpu_offload \ --max-model-len 20000 \ --tensor-parallel-size 2 \ --gpu_memory_utilization 0.87 \ --trust-remote-code \ --port 7800 \ --kv-transfer-config \ '{ "kv_connector": "UCMConnector", "kv_connector_module_path": "ucm.integration.vllm.ucm_connector", "kv_role": "kv_both", "kv_connector_extra_config": { "UCM_CONFIG_FILE": "/workspace/unified-cache-management/examples/ucm_config_example.yaml" } }' ``` If you see log as below: ```bash INFO: Started server process [32890] INFO: Waiting for application startup. INFO: Application startup complete. ``` Congratulations, you have successfully started the vLLM server with UCM! After successfully started the vLLM server，You can interact with the API as following: ```bash curl http://localhost:7800/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "vllm_cpu_offload", "prompt": "Shanghai is a", "max_tokens": 7, "temperature": 0 }' ```

Note: If you want to disable vLLM prefix cache to test the cache ability of UCM, you can add `--no-enable-prefix-caching` to the command line.