Skip to content

This repository contains the source code for testing large models on given datasets, developed by the Network Information Center of Shanghai Jiao Tong University. The testing script is based on the official dataset code from GitHub, with modifications made only to the model to adapt it to the new large model architecture.

Notifications You must be signed in to change notification settings

DifficultA/Dataset-Performance-Test-for-LLMs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Dataset-Performance-Test-for-LLMs

本项目为上海交通大学交我算「轻量级大模型测评」项目的测试代码库,系统整合了多模态与文本大模型领域的主流评测数据集官方代码,我们对这些代码基于本地部署模型进行了适配,代码涵盖:

多模态大模型评估测试集

  • MME:视觉理解综合能力评测框架

  • MM-Vet:视觉理解综合能力评测框架

  • MMMU:面向大学级别的多学科多模态理解和推理能力评测框架

  • MathVista:数学推理能力测试集

  • POPE:目标检测幻觉评估工具

文本大模型评估测试集

  • MMLU:英文跨学科知识评测框架

  • C-Eval:中文跨学科知识评测框架

  • MATH-500:数学竞赛级问题测评

  • HumanEval:代码生成能力测试

  • GPQA-Diamond:博士级科学问题理解和推理能力评估

项目目录结构


├── Text_Understanding_tests/   #文本模型测试代码
│   ├── MATH-500-test/
│   │   ├── main.py                #运行代码
│   │   └── utils/                 #原始工具代码
│   ├── C-Eval-test/
│   │   └── ...
│   ├── gpqa-test/
│   │   ├── ...
│   ├── HumanEval-test/
│   │   ├── ...
│   └── MMLU-test/
│       └── ...
├── Image_Understanding_tests #图像理解模型测试代码
│   ├── ...

项目测试环境

ARM + 昇腾NPU
OS:openEuler 22.03 LTS
CPU: kunpeng 920
NPU:Ascend 910B

评测工具集使用方法

镜像环境

前往 昇腾社区/开发资源 下载适配目标模型的镜像包:如 1.0.0-800I-A2-py311-openeuler24.03-lts

测试示例

MATH-500测试集的使用为例:

python ~/LLM-datasets/MATH-500-test/main.py

About

This repository contains the source code for testing large models on given datasets, developed by the Network Information Center of Shanghai Jiao Tong University. The testing script is based on the official dataset code from GitHub, with modifications made only to the model to adapt it to the new large model architecture.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages