2:40 "I do like how the pelican's feet are on the pedals." "That's a rare detail that most of the other models I've tried this on have missed."
4:12 "The bicycle was flawless."
5:30 Re generating documentation: "It nailed it. It gave me the exact information I needed. It gave me full architectural overview. It was clearly very good at consuming a quarter million tokens of rust." "My trust issues are beginning to fall away"
I feel like we need to move on from using the same test on models since as time goes on the information about these specific test is out there in the training data and while i am not saying that it's happened in this case there is nothing stopping model developers from adding extra data for theses tests directly in the training data to make their models seem better than they are
Honestly, I have mixed feelings about him appearing there. His blog posts are a nice way to be updated about what's going on, and he deserves the recognition, but he's now part of their marketing content. I hope that doesn't make him afraid of speaking his mind when talking about OpenAI's models. I still trust his opinions, though.
Exactly. Too many videos - too little real data / benchmarks on the page. Will wait for vibe check from simonw and others