Benchmarks

LOOM-Eval supports 23 comprehensive benchmarks across 7 cognitive dimensions.

Note

Configuration files and additional resources for each benchmark can be found in the GitHub repository.

Overview

Dimension	Benchmarks
Faithfulness	L_CiteEval, LongCite
General	BAMBOO, LEval, LongBench, LooGLE, RULER
Generation	LongWriter
Reasoning	Ada_LEval, Counting_Stars, LVEval, LongBench_v2, LongIns, babilong
Retrieval	InfiniteBench, NIAH, NThread, NoLiMa
Reward	LongRewardBench
Specialization	CLongEval, LIBRA, LongHealth, LongSafety

Running a Benchmark

Basic Command

loom-eval.run \
    --model_path <model_name> \
    --cfg_path ./benchmarks/<ability>/<benchmark>/configs/<benchmark>.yaml \
    --device 0 1 \
    --gp_num 2 \
    --eval \
    --save_tag <your_tag>

Example: L_CiteEval

loom-eval.run \
    --model_path Meta-Llama/Meta-Llama-3.1-8B-Instruct \
    --cfg_path ./benchmarks/Faithfulness/L_CiteEval/configs/L_CiteEval.yaml \
    --device 0 1 \
    --gp_num 2 \
    --eval \
    --save_tag Llama3_1_8B_INSTRUCT_L_CiteEval

LOOMBench (Lightweight Suite)

LOOMBench is a streamlined evaluation suite that allows comprehensive evaluation within 6 hours.

Single Benchmark Evaluation

loom-eval.run \
    --model_path Meta-Llama/Meta-Llama-3.1-8B-Instruct \
    --cfg_path ./benchmarks/Faithfulness/L_CiteEval/configs/L_CiteEval.yaml \
    --device 0 1 2 3 \
    --eval \
    --lightweight

Full LOOMBench Evaluation

Edit Tools/LOOMBench.sh
Add your model configuration
Run: bash Tools/LOOMBench.sh

All Benchmarks

Faithfulness Benchmarks

General Benchmarks

Generation Benchmarks

LongWriter

Reasoning Benchmarks

Retrieval Benchmarks

Reward Benchmarks

LongRewardBench