Benchmarks

LOOM-Eval supports 23 comprehensive benchmarks across 7 cognitive dimensions.

Note

Configuration files and additional resources for each benchmark can be found in the GitHub repository.

Overview

Dimension

Benchmarks

Faithfulness

L_CiteEval, LongCite

General

BAMBOO, LEval, LongBench, LooGLE, RULER

Generation

LongWriter

Reasoning

Ada_LEval, Counting_Stars, LVEval, LongBench_v2, LongIns, babilong

Retrieval

InfiniteBench, NIAH, NThread, NoLiMa

Reward

LongRewardBench

Specialization

CLongEval, LIBRA, LongHealth, LongSafety

Running a Benchmark

Basic Command

loom-eval.run \
    --model_path <model_name> \
    --cfg_path ./benchmarks/<ability>/<benchmark>/configs/<benchmark>.yaml \
    --device 0 1 \
    --gp_num 2 \
    --eval \
    --save_tag <your_tag>

Example: L_CiteEval

loom-eval.run \
    --model_path Meta-Llama/Meta-Llama-3.1-8B-Instruct \
    --cfg_path ./benchmarks/Faithfulness/L_CiteEval/configs/L_CiteEval.yaml \
    --device 0 1 \
    --gp_num 2 \
    --eval \
    --save_tag Llama3_1_8B_INSTRUCT_L_CiteEval

LOOMBench (Lightweight Suite)

LOOMBench is a streamlined evaluation suite that allows comprehensive evaluation within 6 hours.

Single Benchmark Evaluation

loom-eval.run \
    --model_path Meta-Llama/Meta-Llama-3.1-8B-Instruct \
    --cfg_path ./benchmarks/Faithfulness/L_CiteEval/configs/L_CiteEval.yaml \
    --device 0 1 2 3 \
    --eval \
    --lightweight

Full LOOMBench Evaluation

  1. Edit Tools/LOOMBench.sh

  2. Add your model configuration

  3. Run: bash Tools/LOOMBench.sh

All Benchmarks

Faithfulness Benchmarks

General Benchmarks

Generation Benchmarks

Reasoning Benchmarks

Retrieval Benchmarks

Reward Benchmarks

Specialization Benchmarks