Benchmarks
LOOM-Eval supports 23 comprehensive benchmarks across 7 cognitive dimensions.
Note
Configuration files and additional resources for each benchmark can be found in the GitHub repository.
Overview
Dimension |
Benchmarks |
|---|---|
Faithfulness |
|
General |
|
Generation |
|
Reasoning |
Ada_LEval, Counting_Stars, LVEval, LongBench_v2, LongIns, babilong |
Retrieval |
|
Reward |
|
Specialization |
Running a Benchmark
Basic Command
loom-eval.run \
--model_path <model_name> \
--cfg_path ./benchmarks/<ability>/<benchmark>/configs/<benchmark>.yaml \
--device 0 1 \
--gp_num 2 \
--eval \
--save_tag <your_tag>
Example: L_CiteEval
loom-eval.run \
--model_path Meta-Llama/Meta-Llama-3.1-8B-Instruct \
--cfg_path ./benchmarks/Faithfulness/L_CiteEval/configs/L_CiteEval.yaml \
--device 0 1 \
--gp_num 2 \
--eval \
--save_tag Llama3_1_8B_INSTRUCT_L_CiteEval
LOOMBench (Lightweight Suite)
LOOMBench is a streamlined evaluation suite that allows comprehensive evaluation within 6 hours.
Single Benchmark Evaluation
loom-eval.run \
--model_path Meta-Llama/Meta-Llama-3.1-8B-Instruct \
--cfg_path ./benchmarks/Faithfulness/L_CiteEval/configs/L_CiteEval.yaml \
--device 0 1 2 3 \
--eval \
--lightweight
Full LOOMBench Evaluation
Edit
Tools/LOOMBench.shAdd your model configuration
Run:
bash Tools/LOOMBench.sh