Troubleshooting#
This page lists common error codes, log messages, and troubleshooting steps for UCM.
Error Codes#
UCM uses the following error codes in its underlying C++ stores. These codes appear in log messages and exception outputs.
Error Code |
Name |
Meaning |
Common Cause |
|---|---|---|---|
0 |
|
Success |
Operation completed normally |
-1 |
|
General error |
Unclassified error, check the attached message |
-50000 |
|
Invalid parameter |
Parameter is illegal, empty, or has wrong format |
-50001 |
|
Out of memory |
Memory allocation failed |
-50002 |
|
OS API error |
System call failed (e.g. file I/O, thread operation) |
-50003 |
|
Duplicate key |
Attempting to insert a key that already exists |
-50004 |
|
Retry required |
Transient error, retry recommended |
-50005 |
|
Not found |
Requested object or resource does not exist |
-50006 |
|
Serialization failed |
Error during data serialization |
-50007 |
|
Deserialization failed |
Error during data deserialization |
-50008 |
|
Unsupported operation |
Feature or operation not supported in current context |
-50009 |
|
No space |
Storage space (disk/cache) is full |
-50010 |
|
Timeout |
Operation timed out |
Common Issues#
Installation & Build Issues#
PLATFORM environment variable not set#
Log message:
WARNING: PLATFORM environment variable is not set!
Please set PLATFORM to one of: cuda, ascend, ascend-a3, musa, maca
Cause: The PLATFORM env var is required for building UCM. Without it, the build defaults to simu (simulation mode) which is only for CI testing.
Solution:
export PLATFORM=cuda # For CUDA
export PLATFORM=ascend # For Ascend A2
export PLATFORM=ascend-a3 # For Ascend A3
pip install -v -e . --no-build-isolation
Build fails with CMake errors#
Cause: Missing build dependencies (cmake, gcc, or CUDA toolkit not in path).
Solution:
Ensure
cmake >= 3.18is installed:cmake --versionEnsure the CUDA toolkit is installed and
nvccis accessible:nvcc --versionFor Ascend, ensure the CANN toolkit is properly installed
pip install fails with wheel build errors#
Cause: Building C++ extensions requires a proper compiler environment. On some systems, --no-build-isolation may fail if the build toolchain is not set up.
Solution: Use the pre-built Docker image or wheel package from PyPI instead.
Configuration Issues#
Error Code -50000 (InvalidParam)#
Typical log:
invalid param ... InvalidParam(...)
Common causes:
YAML configuration file has syntax errors
Required parameter is missing
Parameter value type is wrong (e.g. string where integer is expected)
Solution:
Check YAML syntax: validate with
python -c "import yaml; yaml.safe_load(open('your_config.yaml'))"Verify parameter names match the Pipeline Store or NFS Store documentation
Ensure
storage_backendspath is a valid string, not empty
UCM_CONFIG_FILE not found#
Log message:
UCM config file not found: <path>
Cause: The config file path in kv_connector_extra_config is incorrect or the file does not exist.
Solution: Verify the file path exists and is accessible from the container or process. In Docker, ensure the file is mounted or copied into the container.
Unsupported connector type#
Log message:
Unsupported connector type: <name>
Cause: The ucm_connector_name in the config is not a registered connector.
Solution: Use one of the supported connector names: UcmPipelineStore or UcmNfsStore.
Unknown store pipeline#
Log message:
unknown store pipeline: <name>
Cause: The store_pipeline value is not registered in the PipelineStore registry.
Solution: Use a registered pipeline such as "Cache|Posix", "Cache|Ds3fs", "Cache|Compress|Posix", etc. See ucm/store/pipeline/connector.py for all registered pipelines.
Runtime Issues#
Error Code -50001 (OutOfMemory)#
Common causes:
cache_buffer_capacity_gbis set too large for available/dev/shmMultiple DP instances on the same node each allocate their own CacheStore buffer
MLA models require more shared memory
Solution:
Check system memory:
free -hCheck shared memory:
df -h /dev/shmReduce
cache_buffer_capacity_gbin the config (default is 256 GB)When running multiple DP instances on one node, reduce the value proportionally
Error Code -50002 (OsApiError)#
Common causes:
The
storage_backendsdirectory does not existPermission denied on the storage directory
Disk I/O error or filesystem issue
Solution:
Verify the path exists:
ls <storage_backends_path>Check write permissions:
touch <storage_backends_path>/test_file && rm <storage_backends_path>/test_fileFor NFS mounts, verify the mount is healthy:
mount | grep <path>Check system logs:
dmesg | tailorjournalctl -xe
Error Code -50009 (NoSpace)#
Common causes:
Disk space is full at the
storage_backendspathposix_capacity_gbis set larger than actual disk capacityGC is not enabled or not aggressive enough
Solution:
Check disk space:
df -h <storage_backends_path>Set
posix_capacity_gbto a value matching your available disk capacity (e.g. set to 80% of actual capacity)GC is automatically enabled when
posix_capacity_gb > 0— verify it is set correctly
Error Code -50010 (Timeout)#
Common causes:
Network or storage bandwidth is abnormal (e.g. NFS mount degraded)
System load is too high (CPU, I/O saturation)
timeout_msis set too low for the workload
Solution:
Test storage bandwidth using
vdbenchorfioon the mount point or local disk to verify the environment is healthyCheck system load:
toporhtopCheck network:
pingandiperffor NFS over networkIncrease
timeout_msif needed (default 30000 ms)For NFS mounts, check mount options — adding
noacor adjustingactimeomay help
vLLM Integration Issues#
Prefix Caching not working (no cache hits)#
Log message:
request_id: xxx, total_blocks_num: N, hit hbm: 0, hit external: 0
Cause: First-time requests have no cached KV blocks. Subsequent requests with the same prefix should show hits.
Solution:
Run the same request twice — the second should hit cached blocks
Verify
--no-enable-prefix-cachingis not accidentally enabled (it disables vLLM’s native HBM prefix cache, used only for SSD benchmarking)Verify the
UCM_CONFIG_FILEpath is correct in the launch command
Unsupported device platform for UCMDirectConnector#
Cause: The current platform is neither CUDA nor Ascend NPU.
Solution: UCM currently supports CUDA and Ascend platforms only. Check the Support Matrix for details.
KV dump/load errors#
Typical logs:
dump kv cache failed. <error>
wait for dump kv cache failed. <error>
submit dump task failed. <error>
Cause: Store backend (Posix/NFS) encountered I/O errors during KV cache transfer.
Solution:
Check storage backend health (disk space, permissions, network)
Set
UC_LOGGER_LEVEL=debugfor detailed transfer logs:[UC][D] Cache task(...) dispatching. [UC][D] Posix task(...) dispatching.
Check if the error code matches one from the Error Codes table above
Log Debugging#
To view detailed UCM transfer logs, set the environment variable:
export UC_LOGGER_LEVEL=debug
This produces per-task logs with task IDs, operation types, data sizes, and timing:
[UC][D] Cache task({task_id},{operation},{subtask_number},{size}) dispatching. [PID,TID]
[UC][D] Cache task({task_id},{operation},{subtask_number},{size}) finished, cost {time}ms. [PID,TID]
[UC][D] Posix task({task_id},{operation},{subtask_number},{size}) dispatching. [PID,TID]
[UC][D] Posix task({task_id},{operation},{subtask_number},{size}) finished, cost {time}ms. [PID,TID]
Field |
Meaning |
|---|---|
|
Unique identifier for the store task |
|
Cache Store: |
|
Number of subtasks in this operation |
|
Total data size transferred in bytes |
|
Execution time in ms |
To customize the log directory:
export UCM_LOG_PATH=my_log_dir
By default, logs are placed under the log directory relative to where the vLLM service was started.