JudgeLRM

This Space demonstrates the JudgeLRM model, designed to evaluate the quality of two AI assistant responses. JudgeLRM is a family of judgment-oriented LLMs trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards. JudgeLRM models consistently outperform both SFT-tuned and state-of-the-art reasoning models. Notably, JudgeLRM-3B surpasses GPT-4, and JudgeLRM-7B outperforms DeepSeek-R1 by 2.79\% in F1 score, particularly excelling in judge tasks requiring deep reasoning.

Enter an instruction and two responses, and the model will think, reason and score them on a scale of 1-10 (higher is better).

You can also select Hugging Face models to automatically generate responses for evaluation.

Select Judge Model

Instruction/Question

Select Model 1

Or enter custom model path

Select Model 2

Or enter custom model path

Response from Model 1

Response from Model 2

Evaluation Results

Examples

Instruction/Question	Select Model 1	Select Model 2

@misc{nuo2025judgelrm,
      title={JudgeLRM: Large Reasoning Models as a Judge}, 
      author={Nuo Chen, Zhiyuan Hu, Qingyun Zou, Jiaying Wu, Qian Wang, Bryan Hooi, Bingsheng He},
      year={2025},
      eprint={2504.00050},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.00050}, 
}
@misc{wang2025assessingjudgingbias,
  title     = {Assessing Judging Bias in Large Reasoning Models: An Empirical Study},
  author    = {Qian Wang, Zhanzhi Lou, Zhenheng Tang, Nuo Chen, Xuandong Zhao, Wenxuan Zhang, Dawn Song, Bingsheng He},
  year={2025},
  eprint={2504.09946},
  archivePrefix={arXiv},
  primaryClass={cs.CY},
  url={https://arxiv.org/abs/2504.09946}, 
}