Benchmarks ========== LOOM-Eval supports 23 comprehensive benchmarks across 7 cognitive dimensions. .. note:: Configuration files and additional resources for each benchmark can be found in the `GitHub repository `_. Overview -------- .. list-table:: :header-rows: 1 :widths: 25 75 * - Dimension - Benchmarks * - **Faithfulness** - :doc:`L_CiteEval `, :doc:`LongCite ` * - **General** - :doc:`BAMBOO `, :doc:`LEval `, :doc:`LongBench `, :doc:`LooGLE `, :doc:`RULER ` * - **Generation** - :doc:`LongWriter ` * - **Reasoning** - :doc:`Ada_LEval `, :doc:`Counting_Stars `, :doc:`LVEval `, :doc:`LongBench_v2 `, :doc:`LongIns `, :doc:`babilong ` * - **Retrieval** - :doc:`InfiniteBench `, :doc:`NIAH `, :doc:`NThread `, :doc:`NoLiMa ` * - **Reward** - :doc:`LongRewardBench ` * - **Specialization** - :doc:`CLongEval `, :doc:`LIBRA `, :doc:`LongHealth `, :doc:`LongSafety ` Running a Benchmark ------------------- Basic Command ~~~~~~~~~~~~~ .. code-block:: bash loom-eval.run \ --model_path \ --cfg_path ./benchmarks///configs/.yaml \ --device 0 1 \ --gp_num 2 \ --eval \ --save_tag Example: L_CiteEval ~~~~~~~~~~~~~~~~~~~ .. code-block:: bash loom-eval.run \ --model_path Meta-Llama/Meta-Llama-3.1-8B-Instruct \ --cfg_path ./benchmarks/Faithfulness/L_CiteEval/configs/L_CiteEval.yaml \ --device 0 1 \ --gp_num 2 \ --eval \ --save_tag Llama3_1_8B_INSTRUCT_L_CiteEval LOOMBench (Lightweight Suite) ----------------------------- LOOMBench is a streamlined evaluation suite that allows comprehensive evaluation within 6 hours. **Single Benchmark Evaluation** .. code-block:: bash loom-eval.run \ --model_path Meta-Llama/Meta-Llama-3.1-8B-Instruct \ --cfg_path ./benchmarks/Faithfulness/L_CiteEval/configs/L_CiteEval.yaml \ --device 0 1 2 3 \ --eval \ --lightweight **Full LOOMBench Evaluation** 1. Edit ``Tools/LOOMBench.sh`` 2. Add your model configuration 3. Run: ``bash Tools/LOOMBench.sh`` All Benchmarks -------------- Faithfulness Benchmarks ~~~~~~~~~~~~~~~~~~~~~~~ .. toctree:: :maxdepth: 1 L_CiteEval LongCite General Benchmarks ~~~~~~~~~~~~~~~~~~ .. toctree:: :maxdepth: 1 BAMBOO LEval LongBench LooGLE RULER Generation Benchmarks ~~~~~~~~~~~~~~~~~~~~~ .. toctree:: :maxdepth: 1 LongWriter Reasoning Benchmarks ~~~~~~~~~~~~~~~~~~~~ .. toctree:: :maxdepth: 1 Ada_LEval Counting_Stars LVEval LongBench_v2 LongIns babilong Retrieval Benchmarks ~~~~~~~~~~~~~~~~~~~~ .. toctree:: :maxdepth: 1 InfiniteBench NIAH NThread NoLiMa Reward Benchmarks ~~~~~~~~~~~~~~~~~ .. toctree:: :maxdepth: 1 LongRewardBench Specialization Benchmarks ~~~~~~~~~~~~~~~~~~~~~~~~~ .. toctree:: :maxdepth: 1 CLongEval LIBRA LongHealth LongSafety