GG Logo

Guaranteed Guess: A Language Modeling Approach for CISC-to-RISC Transpilation with Testing Guarantees

MBZUAI    Cornell University
GG System Overview

GG (Guaranteed Guess) is the first assembly-to-assembly transpiler that combines LLM-based translation with rigorous software testing to transpile x86 binaries into efficient ARM and RISC-V equivalents with testing guarantees.

Abstract

The hardware ecosystem is rapidly evolving, with increasing interest in translating low-level programs across different instruction set architectures (ISAs) in a quick, flexible, and correct way to enhance the portability and longevity of existing code. A particularly challenging class of this transpilation problem is translating between complex- (CISC) and reduced- (RISC) hardware architectures, due to fundamental differences in instruction complexity, memory models, and execution paradigms.

In this work, we introduce GG (Guaranteed Guess), an ISA-centric transpilation pipeline that combines the translation power of pre-trained large language models (LLMs) with the rigor of established software testing constructs. Our method generates candidate translations using an LLM from one ISA to another, and embeds such translations within a software-testing framework to build quantifiable confidence in the translation. We evaluate our GG approach over two diverse datasets, enforce high code coverage (>98%) across unit tests, and achieve functional/semantic correctness of 99% on HumanEval programs and 49% on BringupBench programs, respectively. Further, we compare our approach to the state-of-the-art Rosetta 2 framework on Apple Silicon, showcasing 1.73× faster runtime performance, 1.47× better energy efficiency, and 2.41× better memory usage for our transpiled code, demonstrating the effectiveness of GG for real-world CISC-to-RISC translation tasks.

Key Contributions

  • First CISC-to-RISC Transpiler: We introduce GG, the first CISC-to-RISC transpiler built via a custom-trained, architecture-aware LM achieving a test accuracy of 99.39% on ARMv8 and 89.93% on RISC-V64.
  • Testing-Driven Validation: A methodology to measure and build confidence into transpilation output via software testing approaches ("guaranteeing" the guess), including detailed analysis of correctness, errors, and hallucinations.
  • Hardware-Informed Design: An in-depth analysis into the inner workings of our transpiler, including hardware-informed design decisions to best train an accurate LLM model for assembly transpilation.
  • Real-World Case Study: We perform a case-study using our transpiler in a real-world setting, by comparing it directly to Apple Rosetta's x86 to ARM virtualization engine. Results show that GG's generated assembly achieves 1.73x runtime speedup while delivering 1.47x better energy efficiency and 2.41x memory efficiency.

Training Data and Methodology

GG is trained on 1.32M samples derived from AnghaBench (1M programs) and The StackV2 (306k programs), compiled to both x86 (CISC) and ARM/RISC-V (RISC) targets under optimization levels -O0 and -O2 to expose models to both semantically transparent and performance-optimized binaries.

The training process uses DeepSeek-Coder and Qwen2.5-Coder as base models, with specialized tokenizer extensions for assembly opcodes and registers, RoPE extrapolation for longer context, and beam search decoding for improved accuracy.

Evaluation Benchmarks

We evaluate GG using two complementary benchmarks: HumanEval-C with 164 programming problems and BringUpBench with 65 bare-metal programs (85-5751 lines of code), providing comprehensive coverage from isolated functions to full project structures with internal libraries.

Token Counts by ISA and Benchmark

Results and Performance

Transpilation Accuracy

GG models significantly outperform all baseline models across different architectures and optimization levels. Most baseline models achieve 0% accuracy, highlighting the unique difficulty of low-level ISA translation.

Model ARMv5 (-O0) ARMv8 (-O0) ARMv8 (-O2) BringupBench (-O0)
GPT-4o 8.48% 10.3% 4.24% 1.54%
Qwen2.5-Coder-1.5B 0% 0% 0% 0%
StarCoder2-3B 0% 0% 0% 0%
GG-DeepSeek-1.3B 79.25% 75.15% 10.3% 3.08%
GG-0.5B 90.85% 86.06% 25.45% 27.69%
GG-1.5B 93.71% 99.39% 45.12% 49.23%

Real-World Performance vs Rosetta 2

We conducted a real-world study on Apple M2 Pro comparing GG against Rosetta 2 across execution time, CPU energy, and memory usage. GG achieves near-native performance while significantly outperforming Rosetta 2 across all metrics.

Metric Rosetta 2 GG (Ours) Native Improvement
Execution Time (ms) 13.94 8.03 7.39 1.73× faster
CPU Energy (J) 7.50 5.09 5.07 1.47× better
RAM Usage (MB) 2.49 1.03 1.03 2.41× better

Ablation Study

Our ablation study shows incremental improvements from each component, with data scaling providing the biggest leap and architectural choices compounding gains toward near-perfect accuracy.

Component ARMv8 Accuracy Impact (Δ)
Qwen2.5-Coder Baseline 0%
+ 1M AnghaBench 93.94% +93.94%
+ 0.3M Stackv2 95.38% +1.44%
+ RoPE Extrapolation 97.14% +1.76%
+ Extended Tokenizer 98.18% +1.04%
+ 8 Beam Search 99.39% +1.21%

ISA Similarity Analysis

We observe a direct correlation between ISA similarity and transpilation accuracy. ARMv8 exhibits the highest similarity to x86 (40.19%), followed by ARMv5 (25.09%) and RISC-V64 (21.41%), directly correlating with model accuracy performance across these architectures.

Additionally, we analyze how compiler optimization levels affect opcode usage patterns in ARMv8. At -O2 optimization, mov instructions become dominant (+14.8%), indicating more register reuse and reduced memory traffic, which makes the learning task more challenging for the model.

Opcode Shift and CHRF Similarity Analysis

BibTeX

@article{heakl2025guaranteed,
  title={Guaranteed Guess: A Language Modeling Approach for CISC-to-RISC Transpilation with Testing Guarantees},
  author={Heakl, Ahmed and Hashmi, Sarim and Abi, Chaimaa and Lee, Celine and Mahmoud, Abdulrahman},
  journal={arXiv preprint},
  year={2025},
  note={MBZUAI, Cornell University}
}