FrontierOR

Seven LLMs evaluated one-shot on FrontierOR Full (n=180) and Hard (n=50): the strongest model matches Gurobi on both quality and time in only 31% of cases. Three self-evolving frameworks with GPT-5.3-Codex as backbone push this rate to 50% on selected Hard tasks.

View per-task performance →

Time window

—

— —

Hover over a metric column header (Exec. rate / Feasibility / Sol. quality / QTE) to see its definition and value range.

FrontierOR Full (n=—)

One-shot performance of different LLMs. Ranked by QTE.

Performance on Hard Set

FrontierOR Hard (n=—)

The Hard subset comprises 50 tasks whose problem class or instance structure is computationally demanding and where Gurobi fails to reach optimality within a one-hour budget — a Gurobi-saturated tail of the full benchmark.