hellaswag: display estimated score confidence interval #12797

stduhpf · 2025-04-07T12:34:17Z

Display the margins of errors for the hellaswag benchmark using Wilson score interval approximation. (95% confidence hardcoded)

* master: (123 commits) cuda : add f32 to bf16 copy op (ggml-org#12806) llava: improve clip_ctx destructor to not memleak load_image_size (ggml-org#12834) llama : fix FA when KV cache is not used (i.e. embeddings) (ggml-org#12825) server : fix thread.join() on exit (ggml-org#12831) llava: add more helper functions to check projector types in clip context (ggml-org#12824) arg : Including limits file on AIX (ggml-org#12822) server : webui : Improve Chat Input with Auto-Sizing Textarea (ggml-org#12785) Revert "sycl:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor" (ggml-org#12812) gguf-py : support lazy tensor splitting (ggml-org#12809) llama : Support llama 4 text-only (ggml-org#12791) opencl: better identify Adreno GPU (ggml-org#12760) hellaswag: display estimated score confidence interval (ggml-org#12797) cuda : fix HIP and MUSA BF16 (#0) sync : ggml ggml : simplify Arm fp16 CPU logic (ggml/1177) CUDA: don't convert BF16 weights to FP32 (ggml/1174) cpu: move all the operators into a separate c++ file (except mul_mat) (ggml/1167) sycl: remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor (ggml-org#12734) ci : no curl on ggml-ci (ggml-org#12796) cmake : enable curl by default (ggml-org#12761) ... # Conflicts: # common/arg.cpp # common/common.cpp # common/common.h

hellaswag: display estimated score confidence interval

dc81348

github-actions bot added the examples label Apr 7, 2025

ggerganov approved these changes Apr 7, 2025

View reviewed changes

ggerganov merged commit 4ccea21 into ggml-org:master Apr 7, 2025
51 of 55 checks passed

stduhpf deleted the hs-interval branch April 7, 2025 19:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hellaswag: display estimated score confidence interval #12797

hellaswag: display estimated score confidence interval #12797

stduhpf commented Apr 7, 2025

hellaswag: display estimated score confidence interval #12797

hellaswag: display estimated score confidence interval #12797

Conversation

stduhpf commented Apr 7, 2025