Does V100 support flash-attention? #13010
Unanswered
lingyezhixing
asked this question in
Q&A
Replies: 1 comment
-
I implemented FlashAttention as in the algorithm for CUDA hardware in llama.cpp/ggml. This implementation is independent of FlashAttention (2) as in the library. The ggml implementation supports all CUDA hardware. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
My device environment is a 4060 laptop connected to a V100 via USB4. The community informed me that the V100 does not support flash-attention. However, when I include -fa in the parameters, regardless of whether the model is running on the 4060, the V100, or split across both GPUs, the speed is significantly improved, and the memory usage is also reduced considerably. Moreover, the output is normal without any repetition or interruption. What is going on here?
Beta Was this translation helpful? Give feedback.
All reactions