Logo

Jiajun Shi*1,2, Chaoren Wei*1,2, Liqun Yang*2, Zekun Moore Wang1,2, Chenghao Yang3,4, Ge Zhang1,3, Stephen Huang1,3, Tao Peng3, Jian Yang†2, Zhoufutu Wen†1,3,

1Multimodal Art Projection, 2Beihang University, 3ByteDance.Inc, 4University of Science and Technology of China,

*Equal Contribution
†Corresponding Authors
geometric reasoning
geometric reasoning

Introduction

CryptoX is an evaluation framework that, for the first time, combines existing benchmarks and cryptographic, to quantify the compositional reasoning capacity of LLMs.Building upon CryptoX, we construct CryptoBench,which integrates these principles into several benchmarks for systematic evaluation. We conduct detailed experiments on widely used open-source and closed-source LLMs using CryptoBench, revealing a huge gap between open-source and closed-source LLMs.We further conduct thorough interpretability experiments to reveal the inner mechanism of LLMs' compositional reasoning, involving subproblem decomposition, subproblem inference, and summarizing subproblem conclusions. Through analysis based on CryptoBench, we highlight the value of independently studying compositional reasoning and emphasize the need to enhance the compositional reasoning abilities of LLMs.

📖Overview

algebraic reasoning

Inspired by cryptographic techniques, CryptoX flexibly transforms existing benchmarks into CryptoBench using instruction encryption and instruction transformation. Instruction encryption randomly encodes part of each instruction in the benchmarks using a given codebook. Instruction transformation defines additional projection rules from the original answer to the CryptoX answer, e.g. the original correct choice answers in MMLU require an additional Numeric Transformation operation A →1,B → 2 , ...) to be viewed correct in Crypto-MMLU.All the additional rules for instruction encryption and transformation are clearly stated in the given concatenated instructions.By incorporating instruction encryption and instruction transformation, CryptoBench benchmarks aim to assess LLM's CR capabilities in a flexible manner.

🏅Leaderboard

We conduct detailed experiments on widely used open-source and closed-source LLMs using CryptoBench, revealing a huge gap between open-source and closed-source LLMs. Through the following experiments,we find that (1)Most existing LLMs have weak CR abilities, and the proposed CryptoBench can measure the CR ability gap between different LLMs. (2) The CR ability of the model is influenced by various factors, such as model size, architecture, and other relevant factors.


Open-Source Proprietary

0 words encoded (Tap to switch to 5 words)
Model AUC Avg Crypto-Math Crypto-MBPP Crypto-BBH Crypto-MMLU Crypto-MMLU-Num Crypto-MMLU-Alpha Crypto-Needle-30K

BibTeX


          @misc{shi2025cryptoxcompositionalreasoning,
            title={CryptoX : Compositional Reasoning Evaluation of Large Language Models}, 
            author={Jiajun Shi and Chaoren Wei and Liqun Yang and Zekun Moore Wang and Chenghao Yang and Ge Zhang and Stephen Huang and Tao Peng and Jian Yang and Zhoufutu Wen},
            year={2025},
            eprint={2502.07813},
            archivePrefix={arXiv},
            primaryClass={cs.CR},
            url={https://arxiv.org/abs/2502.07813}, 
      }