I an also working in OR and I have had the complete opposite experience with respect to MILP optimization(and the research actually agrees; there was a big survey paper published earlier this year showing LLMs were mostly correct on textbook problems but got more and more useless as complexity and novelty increased.)
The results are boiler plate at best, but misleading and insidious at worst, especially when you get into detailed tasks. Ever try to ask a LLM what a specific constraint does or worse ask it to explain the mathematical model of some proprietary CPLEX syntactic sugar? It hallucinates the math, the syntax, the explanation, everything.
Can you point me to that paper? What version of the model were they using?
Have you tried again with the latest LLMs? ChatGPT4 actually (correctly) explains what each constraint does in English -- it doesn't just provide the constraint when you ask it for the formulation. Also, not sure if CPLEX should be involved at all -- I usually just ask it for mathematical formulations, not CPLEX calling code (I don't use CPLEX). The OR literature primarily contains math formulations and that's where LLMs can best do pattern matching to problem shape.
I was referring to section 4 of A Survey for Solving Mixed Integer Programming via Machine Learning(2024): https://arxiv.org/pdf/2401.03244.
I’ve heard (but not so much observed) that there is substantial difference between recent models, so it’s possible that they are better than when this was written.
Anyways, CPLEX has an associated modeling language that features syntactic sugar which has the effect of providing opaqueness to the underlying MILP that it solves. I find LLMs essentially unable to even make an attempt at determining the MILP from that language.
PS: How is Xpress? Is there some reason to prefer it to Gurobi or Mosek?
Thanks for sharing that, I appreciate it. It looks like they used open-source Llama models which are not great. I tested these models offline using Ollama and outside of being character chat bots, they weren't very good at much (the only models that give good answers are Sonnet 3.5 or ChatGPT 4). However the paper's conclusion is essentially correct even for state-of-the-art models:
"Overall, while LLM made several errors, the provided formulations can serve as a starting point for OR experts to create mathematical models. However, OR experts should not rely on LLM to accurately create mathematical models, especially for less common or complex problems. Each output needs to be thoroughly verified and adjusted by the experts to ensure correctness and relevance."
I wouldn't recommend anyone inexperienced to use LLMs to create entire models from scratch, but rather use LLMs as a search tool for specific formulations which are then verified and plugged into a larger model. For this, it works really well and saves me a ton of time. As MIP modeler, I have an intuition on the shape of the answer, so even if ChatGPT makes mistakes, I know how to extract the correct bits and it still saves me a ton of time.
The CPLEX API doesn't have a lot of good examples out in the wild, so I don't expect the training to be good. I've always used CPLEX through a modeling language like AMPL, and even AMPL code is rare so I can't expect an LLM to decipher any of it. On the other hand, MIP formulations abound in PDFs of journal publications.
In the vibes department, I feel Xpress is second to Gurobi and CPLEX and it does the job just fine. But it's been a while since I used CPLEX and Gurobi so I have no recent points of comparison (corporate licensing is prohibitively expensive).
I had the same experience with computational geometry.
Very good at giving a textbook answer ("give a Python/ Numpy function that returns the Voronoi diagram of set of 2d points").
Now, I ask for the Laguerre diagram, a variation that is not mentioned in textbooks, but very useful in practice. I can spend a lot of time spoon-feeding the answer, I just have the bullshiting student answers.
I tried other problems like numerical approximation, physics simulation, same experience.
I don't get the hype. Maybe it's good at giving variations of glue code ie. Stack Overflow meet autocomplete ? As a search tool it's bad because it's so confidently incorrect, you may be fooled by bad answers.
The results are boiler plate at best, but misleading and insidious at worst, especially when you get into detailed tasks. Ever try to ask a LLM what a specific constraint does or worse ask it to explain the mathematical model of some proprietary CPLEX syntactic sugar? It hallucinates the math, the syntax, the explanation, everything.