FrontierOR
Seven LLMs evaluated
one-shot on FrontierOR Full (n=180)
and Hard (n=50): the
strongest model matches Gurobi on both quality and time in only
31% of cases.
Three self-evolving frameworks with GPT-5.3-Codex as
backbone push this rate to
50% on selected
Hard tasks.
Hover over a metric column header (Exec. rate / Feasibility / Sol. quality / QTE) to see its definition and value range.
FrontierOR Full (n=—)
One-shot performance of different LLMs. Ranked by QTE.
Performance on Hard Set
FrontierOR Hard (n=—)
The Hard subset comprises 50 tasks whose problem class or instance structure is computationally demanding and where Gurobi fails to reach optimality within a one-hour budget — a Gurobi-saturated tail of the full benchmark.