Troubleshooting

Troubleshooting#

This page lists common error codes, log messages, and troubleshooting steps for UCM.

Error Codes#

UCM uses the following error codes in its underlying C++ stores. These codes appear in log messages and exception outputs.

Error Code	Name	Meaning	Common Cause
0	`OK`	Success	Operation completed normally
-1	`Error`	General error	Unclassified error, check the attached message
-50000	`InvalidParam`	Invalid parameter	Parameter is illegal, empty, or has wrong format
-50001	`OutOfMemory`	Out of memory	Memory allocation failed
-50002	`OsApiError`	OS API error	System call failed (e.g. file I/O, thread operation)
-50003	`DuplicateKey`	Duplicate key	Attempting to insert a key that already exists
-50004	`Retry`	Retry required	Transient error, retry recommended
-50005	`NotFound`	Not found	Requested object or resource does not exist
-50006	`SerializeFailed`	Serialization failed	Error during data serialization
-50007	`DeserializeFailed`	Deserialization failed	Error during data deserialization
-50008	`Unsupported`	Unsupported operation	Feature or operation not supported in current context
-50009	`NoSpace`	No space	Storage space (disk/cache) is full
-50010	`Timeout`	Timeout	Operation timed out

Common Issues#

Installation & Build Issues#

`PLATFORM` environment variable not set#

Log message:

WARNING: PLATFORM environment variable is not set!
Please set PLATFORM to one of: cuda, ascend, ascend-a3, musa, maca

Cause: The PLATFORM env var is required for building UCM. Without it, the build defaults to simu (simulation mode) which is only for CI testing.

Solution:

export PLATFORM=cuda       # For CUDA
export PLATFORM=ascend     # For Ascend A2
export PLATFORM=ascend-a3  # For Ascend A3
pip install -v -e . --no-build-isolation

Build fails with CMake errors#

Cause: Missing build dependencies (cmake, gcc, or CUDA toolkit not in path).

Solution:

Ensure cmake >= 3.18 is installed: cmake --version
Ensure the CUDA toolkit is installed and nvcc is accessible: nvcc --version
For Ascend, ensure the CANN toolkit is properly installed

`pip install` fails with wheel build errors#

Cause: Building C++ extensions requires a proper compiler environment. On some systems, --no-build-isolation may fail if the build toolchain is not set up.

Solution: Use the pre-built Docker image or wheel package from PyPI instead.

Configuration Issues#

Error Code -50000 (InvalidParam)#

Typical log:

invalid param ... InvalidParam(...)

Common causes:

YAML configuration file has syntax errors
Required parameter is missing
Parameter value type is wrong (e.g. string where integer is expected)

Solution:

Check YAML syntax: validate with python -c "import yaml; yaml.safe_load(open('your_config.yaml'))"
Verify parameter names match the Pipeline Store or NFS Store documentation
Ensure storage_backends path is a valid string, not empty

`UCM_CONFIG_FILE` not found#

Log message:

UCM config file not found: <path>

Cause: The config file path in kv_connector_extra_config is incorrect or the file does not exist.

Solution: Verify the file path exists and is accessible from the container or process. In Docker, ensure the file is mounted or copied into the container.

Unsupported connector type#

Log message:

Unsupported connector type: <name>

Cause: The ucm_connector_name in the config is not a registered connector.

Solution: Use one of the supported connector names: UcmPipelineStore or UcmNfsStore.

Unknown store pipeline#

Log message:

unknown store pipeline: <name>

Cause: The store_pipeline value is not registered in the PipelineStore registry.

Solution: Use a registered pipeline such as "Cache|Posix", "Cache|Ds3fs", "Cache|Compress|Posix", etc. See ucm/store/pipeline/connector.py for all registered pipelines.

Runtime Issues#

Error Code -50001 (OutOfMemory)#

Common causes:

cache_buffer_capacity_gb is set too large for available /dev/shm
Multiple DP instances on the same node each allocate their own CacheStore buffer
MLA models require more shared memory

Solution:

Check system memory: free -h
Check shared memory: df -h /dev/shm
Reduce cache_buffer_capacity_gb in the config (default is 256 GB)
When running multiple DP instances on one node, reduce the value proportionally

Error Code -50002 (OsApiError)#

Common causes:

The storage_backends directory does not exist
Permission denied on the storage directory
Disk I/O error or filesystem issue

Solution:

Verify the path exists: ls <storage_backends_path>
Check write permissions: touch <storage_backends_path>/test_file && rm <storage_backends_path>/test_file
For NFS mounts, verify the mount is healthy: mount | grep <path>
Check system logs: dmesg | tail or journalctl -xe

Error Code -50009 (NoSpace)#

Common causes:

Disk space is full at the storage_backends path
posix_capacity_gb is set larger than actual disk capacity
GC is not enabled or not aggressive enough

Solution:

Check disk space: df -h <storage_backends_path>
Set posix_capacity_gb to a value matching your available disk capacity (e.g. set to 80% of actual capacity)
GC is automatically enabled when posix_capacity_gb > 0 — verify it is set correctly

Error Code -50010 (Timeout)#

Common causes:

Network or storage bandwidth is abnormal (e.g. NFS mount degraded)
System load is too high (CPU, I/O saturation)
timeout_ms is set too low for the workload

Solution:

Test storage bandwidth using vdbench or fio on the mount point or local disk to verify the environment is healthy
Check system load: top or htop
Check network: ping and iperf for NFS over network
Increase timeout_ms if needed (default 30000 ms)
For NFS mounts, check mount options — adding noac or adjusting actimeo may help

vLLM Integration Issues#

Prefix Caching not working (no cache hits)#

Log message:

request_id: xxx, total_blocks_num: N, hit hbm: 0, hit external: 0

Cause: First-time requests have no cached KV blocks. Subsequent requests with the same prefix should show hits.

Solution:

Run the same request twice — the second should hit cached blocks
Verify --no-enable-prefix-caching is not accidentally enabled (it disables vLLM’s native HBM prefix cache, used only for SSD benchmarking)
Verify the UCM_CONFIG_FILE path is correct in the launch command

`Unsupported device platform for UCMDirectConnector`#

Cause: The current platform is neither CUDA nor Ascend NPU.

Solution: UCM currently supports CUDA and Ascend platforms only. Check the Support Matrix for details.

KV dump/load errors#

Typical logs:

dump kv cache failed. <error>
wait for dump kv cache failed. <error>
submit dump task failed. <error>

Cause: Store backend (Posix/NFS) encountered I/O errors during KV cache transfer.

Solution:

Check storage backend health (disk space, permissions, network)

Set UC_LOGGER_LEVEL=debug for detailed transfer logs:

[UC][D] Cache task(...) dispatching.
[UC][D] Posix task(...) dispatching.

Check if the error code matches one from the Error Codes table above

Log Debugging#

To view detailed UCM transfer logs, set the environment variable:

export UC_LOGGER_LEVEL=debug

This produces per-task logs with task IDs, operation types, data sizes, and timing:

[UC][D] Cache task({task_id},{operation},{subtask_number},{size}) dispatching. [PID,TID]
[UC][D] Cache task({task_id},{operation},{subtask_number},{size}) finished, cost {time}ms. [PID,TID]
[UC][D] Posix task({task_id},{operation},{subtask_number},{size}) dispatching. [PID,TID]
[UC][D] Posix task({task_id},{operation},{subtask_number},{size}) finished, cost {time}ms. [PID,TID]

Field	Meaning
`task_id`	Unique identifier for the store task
`operation`	Cache Store: `DUMP` (Device→Host) or `LOAD` (Host→Device); Posix Store: `Cache2Backend` (dump to storage) or `Backend2Cache` (load from storage)
`subtask_number`	Number of subtasks in this operation
`size`	Total data size transferred in bytes
`cost`	Execution time in ms

To customize the log directory:

export UCM_LOG_PATH=my_log_dir

By default, logs are placed under the log directory relative to where the vLLM service was started.

Troubleshooting

Contents

Troubleshooting#

Error Codes#

Common Issues#

Installation & Build Issues#

PLATFORM environment variable not set#

Build fails with CMake errors#

pip install fails with wheel build errors#

Configuration Issues#

Error Code -50000 (InvalidParam)#

UCM_CONFIG_FILE not found#

Unsupported connector type#

Unknown store pipeline#

Runtime Issues#

Error Code -50001 (OutOfMemory)#

Error Code -50002 (OsApiError)#

Error Code -50009 (NoSpace)#

Error Code -50010 (Timeout)#

vLLM Integration Issues#

Prefix Caching not working (no cache hits)#

Unsupported device platform for UCMDirectConnector#

KV dump/load errors#

Log Debugging#

`PLATFORM` environment variable not set#

`pip install` fails with wheel build errors#

`UCM_CONFIG_FILE` not found#

`Unsupported device platform for UCMDirectConnector`#