FrontierOR

Benchmarking LLMs' Capacity for Efficient Algorithm Design
in Large-Scale Optimization

Research Team

Minwei Kong1,*, Chonghe Jiang1,2,*, Ao Qu1,2,†, Wenbin Ouyang2, Zhaoming Zeng3, Xiaotong Guo4,‡, Zhekai Li2, Junyi Li1, Yi Fan5, Xinshou Zheng6, Xi Jing6, Yikai Zhang6, Zhiwei Liang7, Seonghoo Kim8, Runqing Yang6, Zijian Zhou12, Sirui Li9,‡, Han Zheng2, Wangyang Ying10, Ou Zheng10, Chonghuan Wang11, Jinglong Zhao6, Hanzhang Qin12, Cathy Wu2, Paul Pu Liang1,2, Jinhua Zhao1,2,§, Hai Wang1,13,§

  1. Singapore-MIT Alliance for Research and Technology
  2. Massachusetts Institute of Technology
  3. Northeastern University
  4. Uber
  5. Shanghai Jiao Tong University
  6. Boston University
  7. Emory University
  8. Northwestern University
  9. Microsoft
  10. Zhiling Research
  11. University of Texas at Dallas
  12. National University of Singapore
  13. Singapore Management University
* Equal contribution  Project lead  Work done outside the authors' employment § Corresponding author

Latest News

Overview of FrontierOR: problem domains, formulation types, application fields, and instance scales.
Figure 1. FrontierOR spans diverse problem domains, formulation types, and application fields. Each optimization problem involves 102 to 107 decision variables and constraints, with Gurobi failing to reach optimality on 46% of large-scale instances within a one-hour time budget. We construct the benchmark by collecting problems from leading OR journals, and ensure data quality through multi-round expert review.

Introduction

LLMs are increasingly used for optimization modeling and solver-code generation, yet practical operations research problems require a harder capability: designing scalable algorithms that exploit problem structure and outperform direct formulation-and-solve baselines. Existing benchmarks are limited to small or simplified examples far below real-world scale.

We introduce FrontierOR, among the first benchmarks to systematically evaluate LLM-based efficient algorithm design for realistic large-scale optimization problems. It includes 180 tasks derived from methodologically diverse papers in top-tier OR venues, with instances spanning up to ~10⁷ decision variables and constraints, each task paired with a hidden, expert-verified evaluation suite. We evaluate seven LLMs spanning frontier, cost-effective, and open-source models under one-shot and test-time evolution settings.

Evaluation Protocol

A FrontierOR task feeds the problem description and instructions into an LLM to generate an end-to-end optimization program. The program is first sanity-checked on a tiny instance; if it passes, it is evaluated on large instance sets for feasibility and compared with a Gurobi reference in terms of solution quality and computational efficiency.

Evaluation pipeline from problem description to LLM program to evaluated metrics.
Figure 2. Evaluation protocol of a FrontierOR benchmark task.

Experiments Results

Our experiments are organized around two research questions: (1) how well do current LLMs perform under one-shot algorithm generation, and (2) to what extent can test-time self-evolution improve the generated algorithms.

One-Shot Algorithm Generation

Three patterns emerge on Full (n=180) and Hard (n=50):

  1. 01
    Frontier models clearly outperform cost-effective ones on both Full and Hard.

    On Full, frontier feasibility clusters at 0.60–0.62 vs. 0.18–0.42 for cost-effective; on Hard, the frontier band shifts to 0.49–0.64 while cost-effective falls to 0.13–0.37. The two-tier gap persists at both scales.

  2. 02
    Execution is no longer the primary bottleneck for the strongest models.

    GPT-5.3-Codex reaches exec rate 0.98 and both Gemini 3.1 Pro and Claude Opus 4.6 hit 0.93 on Full — yet feasibility and quality stay substantially lower. The real difficulty is generating algorithms that remain valid at scale, not producing runnable code.

  3. 03
    Hard exposes sharp differences in algorithmic robustness.

    On Full the three frontier models are tightly bunched (QTE 0.25–0.31); on Hard the band widens and leaders re-separate. Claude Opus 4.6 retains the highest QTE on both subsets (0.31 / 0.32), while GPT-5.3-Codex's Hard feasibility, solution quality, and QTE all fall furthest among the frontier (0.49 / 0.30 / 0.18) — its programs run, but rarely combine validity with Gurobi-level quality and speed.

Algorithmic Patterns & Failure Modes

Looking inside the generated code reveals two complementary signals: which algorithm families each model prefers, and where each model's failures occur. Together they explain the QTE gaps observed in the leaderboard.

  • Model choices differ sharply. LLaMA-4-Maverick almost always emits monolithic solver calls (99%); Claude Opus 4.6 is balanced (37% monolithic / 27% local-search / 27% matheuristic).
  • Frontier ≈ non-monolithic. Gemini, Codex, and Opus more often pick decomposition, local-search, or matheuristic styles; weaker models default to direct solver calls (Qwen3 72%, LLaMA 99%).
  • Hybrid & search-based methods win. At the frontier, stronger performance is consistently associated with hybrid or local-search approaches — algorithm diversity explains much of the QTE gap (Opus's balanced mix ↔ best QTE on both subsets).
Algorithm-family bubble chart: per-model strategy preferences and quality–time efficiency against Gurobi.
Figure 3(a). Per-model strategy preferences and performance. Bubble size and percentage encode the share of the model's tasks in each family; bubble color encodes the quality–time efficiency compared with Gurobi.
  • Formulation flaws dominate. For most models, formulation-design flaw is the single largest failure category — failures happen before implementation.
  • Weaker models fail earlier. Grok, Qwen, LLaMA show large shares of constraint-spec and schema-violation errors — they struggle to translate the NL task into a valid optimization algorithm.
  • Stronger models shift the bottleneck. Opus is the clear exception: fewer formulation failures, more heuristic-search failures — interpretation is usually right; remaining errors come from insufficient refinement on large instances.
Failure-mode composition per LLM: stacked bars showing percentage of tasks per failure category plus correct solutions.
Figure 3(b). Failure-mode composition per LLM. Segments show the percentage of tasks per failure mode plus correct solutions.

Performance Improvement with Self-Evolution

Direct one-shot generation, even with the strongest frontier models, leaves many hard tasks unsolved or far below the Gurobi reference. Self-evolving agents can iteratively refine candidate programs using internal feedback — three frameworks (CORAL, OpenEvolve, EoH) initialized from the same one-shot seed, with GPT-5.3-Codex as the shared backbone, all significantly outperform one-shot on every metric.

Under a 30-attempt budget, CORAL attains the best solution quality and runtime efficiency, outperforming the runner-up OpenEvolve by 6% and 1% respectively. EoH shows limited improvement, suggesting that prompt-only crossover and mutation is unstable and prone to regressions on candidates that looked promising on dev sets.

Self-evolution trajectories on the hard tasks: best-so-far signed gap to Gurobi (left) and speed advantage (right) across iterations.
Figure 4. Self-evolution trajectories on the hard tasks. Left: best-so-far signed gap to Gurobi (lower is better). Right: best-so-far speed advantage g − tsolve)/τg (higher is better). Each curve averages across the selected tasks (±1 SEM).

BibTeX

@article{kong2026frontieror,
  title   = {FrontierOR: Benchmarking LLMs' Capacity for Efficient Algorithm Design in Large-Scale Optimization},
  author  = {Kong, Minwei and Jiang, Chonghe and Qu, Ao and Ouyang, Wenbin and Zeng, Zhaoming and Guo, Xiaotong and Li, Zhekai and Li, Junyi and Fan, Yi and Zheng, Xinshou and others},
  journal = {arXiv preprint arXiv:2605.25246},
  year    = {2026}
}