🌟 GSA: Geometric Sparse Attention for Efficient Inference of LLMs#
🔍 Overview#
GSA (Geometric Sparse Attention) simultaneously tackles the high computational complexity of long sequences and the concurrency limitations imposed by the HBM capacity wall. UCM GSA aims to develop a sparse framework compatible with mainstream inference engines, incorporating sparse representation algorithms, offloading and prefetching mechanisms, and collaborative XPU-CPU execution.
🎯 Key Innovations#
Representation-based Sparse Selection✅: To reduce the complexity of sparsity selection, we introduce a lightweight Sparsity Selector that pre-computes per-block representational scores during the Prefill phase and reuses them for zero-overhead top-k pruning in the Decode phase.
Cross-hardware Support✅: To ensure cross-platform portability of GSA across heterogeneous accelerators (e.g., NVIDIA GPUs and Huawei Ascend NPUs), we introduce a Top-K offloading engine that asynchronously offloads attention queries (Q) to CPU memory for decoupled sparse selection computations.
Efficient KV Transition⌛: We have designed a PrefetchEngine to orchestrate KV-cache offloading and prefetching, incorporating three key components: (1) sparse-block metadata management, (2) asynchronous prefetch worker threads, and (3) adaptive prefetch algorithms.
Request-level Sparse Strategy(Not yet supported ❎): We plan to design a sparse-policy module that, for every incoming request, perform a fast distribution estimation and then decides the optimal sparsification strategy.
P+D Multi-stage Sparsity(Not yet supported ❎): We plan to introduce layer-wise sparsification in the pre-fill stage to reduce TTFT for workloads with short decode lengths.
🔥 Key Results#
In both performance and accuracy evaluations, we deployed the DeepSeek-R1-Distill-Qwen-32B model on two H20 GPUs.
🏆 Performance Highlights#
End-to-End Performance with 80 % Prefix-Cache Hit Ratio#
Below are the end-to-end throughput results for inference scenarios without KVCache offloading. PC Baseline refers to the full attention method with an 80% prefix cache hit rate. The GSA method sparsifies each input request to 6K tokens, and in the experiments, each request generates 4K tokens of output.
End-to-End Performance with 80 % Prefix-Cache Hit Ratio (HBM-bound scenario)#
Below are the end-to-end results of boosting inference concurrency through KV-Cache off-loading and prefetching under HBM-bound workloads; please note that this feature is not yet fully supported in the current open-source release, and we will make it available as soon as possible.
📈 Accuracy Benchmarks#
Inference Accuracy with Various Tasks#
As shown in the table below, we evaluated full attention and the GSA algorithm across multiple datasets for single-document QA, multi-document QA, and summarization tasks. The GSA method employs a mean-based block representation along with q-offloaded CPU top-k computation. In this experiment, we select requests longer than 4k from the datasets and set the sparsification ratio to 30%.
| Dataset | NarrativeQA | MFQA_ZH | HotpotQA | DuReader_ZH | GovReport | VCSUM_ZH | Average |
|---|---|---|---|---|---|---|---|
| Full Attention | 23.01 | 54.97 | 39.8 | 24.86 | 24.45 | 15.13 | 30.37 |
| GSA(Mean) | 22.42 | 52.95 | 36.99 | 24.32 | 23.28 | 14.4 | 29.06 |
🚦 Quick Start#
Basic Usage#
Similar to UCM’s offline_inference_esa.py examples. We only need to specify ucm_sparse_method to be GSA as shown below.
...
ktc = KVTransferConfig(
kv_connector=name,
kv_connector_module_path="ucm.integration.vllm.ucm_connector",
kv_role="kv_both",
kv_connector_extra_config={
"ucm_connector_name": "UcmNfsStore",
"ucm_connector_config": {
"storage_backends": kv_store_path,
"transferStreamNumber":16
},
"ucm_sparse_config": {
"GSA": {}
}
}
)
...
Thus, an example command for launching the online LLM service is as follows:
export ENABLE_SPARSE=TRUE
vllm serve /home/models/DeepSeek-R1-Distill-Qwen-32B \
--served-model-name DeepSeek-R1-Distill-Qwen-32B \
--max-model-len 131000 \
--tensor-parallel-size 2 \
--gpu_memory_utilization 0.87 \
--trust-remote-code \
--port 8090 \
--block-size 128 \
--no-enable-prefix-caching \
--kv-transfer-config \
'{
"kv_connector": name,
"kv_connector_module_path": "ucm.integration.vllm.ucm_connector",
"kv_role": "kv_both",
"kv_connector_extra_config": {
"ucm_connector_name": "UcmNfsStore",
"ucm_connector_config": {
"storage_backends": kv_store_path,
"transferStreamNumber":16
},
"ucm_sparse_config": {
"GSA": {}
}
}
}'
📊 Supported Models#
Model |
Size |
Support |
|---|---|---|
Qwen3-14B |
14B |
✅ |
DeepSeek-R1-Distill-Qwen-14B |
14B |
✅ |
Qwen3-32B |
32B |
✅ |
QwQ-32B |
32B |
✅ |
DeepSeek-R1-Distill-Qwen-32B |
32B |
✅ |
🤝 Contributing#
We welcome contributions! Please see our Contributing Guide for details.