Jiajun Shi*¹^,2, Chaoren Wei*¹^,2, Liqun Yang*², Zekun Moore Wang¹^,2, Chenghao Yang³^,4, Ge Zhang¹^,3, Stephen Huang¹^,3, Tao Peng³, Jian Yang†², Zhoufutu Wen†¹^,3,

¹Multimodal Art Projection, ²Beihang University, ³ByteDance.Inc, ⁴University of Science and Technology of China,

*Equal Contribution
†Corresponding Authors

arXiv Code Leaderboard

Introduction

CryptoX is an evaluation framework that, for the first time, combines existing benchmarks and cryptographic, to quantify the compositional reasoning capacity of LLMs.Building upon CryptoX, we construct CryptoBench,which integrates these principles into several benchmarks for systematic evaluation. We conduct detailed experiments on widely used open-source and closed-source LLMs using CryptoBench, revealing a huge gap between open-source and closed-source LLMs.We further conduct thorough interpretability experiments to reveal the inner mechanism of LLMs' compositional reasoning, involving subproblem decomposition, subproblem inference, and summarizing subproblem conclusions. Through analysis based on CryptoBench, we highlight the value of independently studying compositional reasoning and emphasize the need to enhance the compositional reasoning abilities of LLMs.

📖Overview

Inspired by cryptographic techniques, CryptoX flexibly transforms existing benchmarks into CryptoBench using instruction encryption and instruction transformation. Instruction encryption randomly encodes part of each instruction in the benchmarks using a given codebook. Instruction transformation defines additional projection rules from the original answer to the CryptoX answer, e.g. the original correct choice answers in MMLU require an additional Numeric Transformation operation A →1,B → 2 , ...) to be viewed correct in Crypto-MMLU.All the additional rules for instruction encryption and transformation are clearly stated in the given concatenated instructions.By incorporating instruction encryption and instruction transformation, CryptoBench benchmarks aim to assess LLM's CR capabilities in a flexible manner.

🏅Leaderboard

We conduct detailed experiments on widely used open-source and closed-source LLMs using CryptoBench, revealing a huge gap between open-source and closed-source LLMs. Through the following experiments,we find that (1)Most existing LLMs have weak CR abilities, and the proposed CryptoBench can measure the CR ability gap between different LLMs. (2) The CR ability of the model is influenced by various factors, such as model size, architecture, and other relevant factors.

Open-Source Proprietary

0 words encoded (Tap to switch to 5 words)

Model	AUC	Avg	Crypto-Math	Crypto-MBPP	Crypto-BBH	Crypto-MMLU	Crypto-MMLU-Num	Crypto-MMLU-Alpha	Crypto-Needle-30K

BibTeX


          @misc{shi2025cryptoxcompositionalreasoning,
            title={CryptoX : Compositional Reasoning Evaluation of Large Language Models}, 
            author={Jiajun Shi and Chaoren Wei and Liqun Yang and Zekun Moore Wang and Chenghao Yang and Ge Zhang and Stephen Huang and Tao Peng and Jian Yang and Zhoufutu Wen},
            year={2025},
            eprint={2502.07813},
            archivePrefix={arXiv},
            primaryClass={cs.CR},
            url={https://arxiv.org/abs/2502.07813}, 
      }