Does V100 support flash-attention? #13010

lingyezhixing · 2025-04-18T10:10:04Z

lingyezhixing
Apr 18, 2025

My device environment is a 4060 laptop connected to a V100 via USB4. The community informed me that the V100 does not support flash-attention. However, when I include -fa in the parameters, regardless of whether the model is running on the 4060, the V100, or split across both GPUs, the speed is significantly improved, and the memory usage is also reduced considerably. Moreover, the output is normal without any repetition or interruption. What is going on here?

JohannesGaessler · 2025-04-18T12:04:10Z

JohannesGaessler
Apr 18, 2025
Collaborator

I implemented FlashAttention as in the algorithm for CUDA hardware in llama.cpp/ggml. This implementation is independent of FlashAttention (2) as in the library. The ggml implementation supports all CUDA hardware.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does V100 support flash-attention? #13010

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Does V100 support flash-attention? #13010

lingyezhixing Apr 18, 2025

Replies: 1 comment

JohannesGaessler Apr 18, 2025 Collaborator

lingyezhixing
Apr 18, 2025

JohannesGaessler
Apr 18, 2025
Collaborator