The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
-
Updated
Mar 28, 2025 - Python
The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Official release of InternLM series (InternLM, InternLM2, InternLM2.5, InternLM3).
📚A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism etc.
📚Modern CUDA Learn Notes: 200+ Tensor/CUDA Cores Kernels🎉, HGEMM, FA2 via MMA and CuTe, 98~100% TFLOPS of cuBLAS/FA2.
FlashInfer: Kernel Library for LLM Serving
MoBA: Mixture of Block Attention for Long-Context LLMs
InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.
[CVPR 2025 Highlight] The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss". A super memory-efficiency CLIP training scheme.
📚FFPA(Split-D): Yet another Faster Flash Attention with O(1) GPU SRAM complexity large headdim, 1.8x~3x↑🎉 faster than SDPA EA.
Triton implementation of FlashAttention2 that adds Custom Masks.
Train llm (bloom, llama, baichuan2-7b, chatglm3-6b) with deepspeed pipeline mode. Faster than zero/zero++/fsdp.
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
Fast and memory efficient PyTorch implementation of the Perceiver with FlashAttention.
Python package for rematerialization-aware gradient checkpointing
A flexible and efficient implementation of Flash Attention 2.0 for JAX, supporting multiple backends (GPU/TPU/CPU) and platforms (Triton/Pallas/JAX).
Utilities for efficient fine-tuning, inference and evaluation of code generation models
An simple pytorch implementation of Flash MultiHead Attention
🚀 Automated deployment stack for AMD MI300 GPUs with optimized ML/DL frameworks and HPC-ready configurations
Add a description, image, and links to the flash-attention topic page so that developers can more easily learn about it.
To associate your repository with the flash-attention topic, visit your repo's landing page and select "manage topics."