How up to date are you on current open weights models? After playing around with it for a few hours I find it to be nowhere near as good as Qwen3-30B-A3B. The world knowledge is severely lacking in particular.
dolphin3.0-llama3.1-8b Q4_K_S [4.69 GB on disk]: correct in <2 seconds
deepseek-r1-0528-qwen3-8b Q6_K [6.73 GB]: correct in 10 seconds
gpt-oss-20b MXFP4 [12.11 GB] low reasoning: wrong after 6 seconds
gpt-oss-20b MXFP4 [12.11 GB] high reasoning: wrong after 3 minutes !
Yea yea it's only one question of nonsense trivia. I'm sure it was billions well spent.
It's possible I'm using a poor temperature setting or something but since they weren't bothered enough to put it in the model card I'm not bothered to fuss with it.
I think your example reflects well on oss-20b, not poorly. It (may) show that they've been successful in separating reasoning from knowledge. You don't _want_ your small reasoning model to waste weights memorizing minutiae.
> gpt-oss-20b MXFP4 [12.11 GB] high reasoning: wrong after 3 minutes !
To be fair, this is not the type of questions that benefit from reasoning, either the model has this info in it's parametric memory or it doesn't. Reasoning won't help.
Not true:
During World War II the Imperial Japanese Navy referred to Midway Island in their communications as “Milano” (ミラノ). This was the official code word used when planning and executing operations against the island, including the Battle of Midway.
Right... knowledge is one of the things (the one thing?) that LLMs are really horrible at, and that goes double for models small enough to run on normal-ish consumer hardware.
Shouldn't we prefer to have LLMs just search and summarize more reliable sources?
Even large hosted models fail at that task regularly. It's a silly anecdotal example, but I asked the Gemini assistant on my Pixel whether [something] had seen a new release to match the release of [upstream thing].
It correctly chose to search, and pulled in the release page itself as well as a community page on reddit, and cited both to give me the incorrect answer that a release had been pushed 3 hours ago. Later on when I got around to it, I discovered that no release existed, no mention of a release existed on either cited source, and a new release wasn't made for several more days.
Reliable sources that are becoming polluted by output from knowledge-poor LLMs, or overwhelmed and taken offline by constant requests from LLMs doing web scraping …
This is a little misleading. The data they quote is based on their previous article[1], which just uses this analysis[2] provided by a VC company. Funnily enough the same VC company put a seperate clickbaitish article just a year before that one, claiming the exact opposite findings (about startups ditching SV).
I would guess a lot of these annual trends are just random fluctuations in their dataset, though to be honest I wonder how they're even trying to estimate this kind of information.
The corporate politics at Meta is the result of Zuck's own decisions. Even in big tech, Meta is (along with Amazon) rather famous for its highly political and backstabby culture.
This is because these two companies have extremely performance-review oriented cultures where results need to be proven every quarter or you're grounds for laying off.
Labs known for being innovative all share the same trait of allowing researchers to go YEARS without high impact results. But both Meta and Scale are known for being grind shops.
Can't upvote this enough. From what I saw at Meta, the idea of a high performance culture (which I generally don't have an issue with) found its ultimate form and became performance review culture. Almost every decision made filtered through "but how will this help me during the next review". If you ever wonder about some of the moves you see at Meta, perf review optimization was probably at the root of it.
I may or may not have worked there for 4 years and may or may not be able to confirm that Meta is one of the most poorly run companies I've ever seen.
They are, at best, 25-33% efficient at taking talent+money and turning it into something. Their PSC process creates the wrong incentives, they either ignore or punish the type of behavior you actually want, and talented people either leave (especially after their cliff) or are turned into mediocre performers by Meta's awful culture.
Beyond that, the leaders at Facebook are deeply unlikeable, well beyond the leaders at Google, which is not a low bar. I know more people who reflexively ignore Facebook recruiters than who ignore recruiters from any other company. With this announcement, they have found a way to make that problem even worse.
Interesting that "high-impact" on the one hand, and innovative/successful in the marketplace on the other, should be at odds at Meta. Makes one wonder how they measure impact.
It doesn't matter much how they measure if it's empirical. Once they say the scoring system, all the work that scores well gets done, and the work that resists measurement does not get done.
The obvious example was writing eng docs. It was probably the single most helpful things you could do with your time, but there was no way to get credit because we couldn't say exactly how much time your docs might have saved others (the quantifiable impact from your work). That meant that we only ever developed a greater and greater unfilled need for docs, but it only ever got riskier and riskier to your career to try to dive into that work.
People were split on how to handle this. Some said "do the work that most needs doing and the perf review system will work it out long term." Other said, "just play the perf game to win."
I listened to the first group because I'm what you call a "believer." In a tech role I think my responsibility is primarily to users. I was let go (escorted off campus) after bottoming out a stack ranking during a half in which I did a lot of great work for the company that half (I think) but utterly failed to get a good score by the rules of the perf game (specifically I missed the deadline to land a very large PR and so most of my work for the half failed to make the key criteria for perf review: it had to be *landed* impact)
I think I took it graciously, but also I will never think of these companies as a home or a family again.
The US has crashed its own stock market, tanked its own government's approval ratings, and had its own business leaders speak out against the government. This definitely does not increase leverage.
Was the plan to incentivise Korea, China and Japan to work together? You know you've screwed up when bitter rivals decide to cooperate together.
Every government that has to do something undesirable to their citizens have been given something to blame: a beautiful excuse that will be milked relentlessly to the disadvantage of the US.
Businesses get screwed by flaky management - and now we get to see the US get screwed by its unreliable self-centered Prez.
Even countries like Singapore, hit with a 10% tariff (marginal, and far lower than most countries) absolutely panicked.
Interestingly, I looked up the size of consumer markets - the US is twice the size of the next biggest market, the EU.
So I don’t blame countries for panicking.
And if you’re really conspiratorial, one could question whether the push for free trade was really an attempt to put the US into a situation where it had leverage against most of the globe.
With the bounce back the Dow Jones lost 400 points. For comparison, the German DAX lost 900 points, but is only half the size. It is a reckless and probably counter-productive strategy, but I do think Trump will be able to extract short-term benefits from a number of countries. I just hope the EU will finally decrease reliance on the US going forward. Our extreme reliance on US tech and defense makes us easy marks.
I'm very skeptical on this, the paper they linked is not convincing. It says that GPT-4 is correct at predicting the experiment outcome direction 69% of the time versus 66% of the time for human forecasters. But this is a silly benchmark because people are not trusting human forecasters in the first place, that's the whole purpose for why the experiment is run. Knowing that GPT-4 is slightly better at predicting experiments than some human guessing doesn't make it a useful substitute for the actual experiment.
Well, they looked at papers that weren't published as of the original model release. But GPT very likely had unannounced model updates. Is it not possible that many of the post 2021 papers were in the version of GPT they actually worked with?
Furthermore, there’s a replication crisis in social sciences. The last thing we need is to accumulate less data and let an LLM tell us the “right” answer.
Predicting the actual results of real unpublished experiments with a 0.9 correlation factor is a very non-trivial result. The human forecasts comparison is not the central finding
I totally agree. So many people are missing the point here.
Also important is that in Psychology/Sociology, it's the counter-intuitive results that get published. But these results disproportionately fail to replicate!
Nobody cares if you confirm something obvious, unless it's on something divisive (e.g. sexual behavior, politics), or there is an agenda (dieting, etc). So people can predict those ones more easily than predicting a randomly generated premise. The ones that made their way into the prediction set were the ones researchers expected to be counter-intuitive (and likely P-hacked a significant proportion of them to find that result). People know this (there are more positive confirming papers than negative/fail-to-replicate).
This means the counter-intuitive, negatively forecast results, are the ones that get published i.e. the dataset saying that 66% of human forecasters is disproportionately constructed of studies that found counter-intuitive results compared to the overall neutral pre-published set of studies, because scientists and grant winners are incentivised to publish counter-intuitive work. I would even suggest the selected studies are more tantalizing that average in most of these studies, they are key findings, rather than the miniature of comments on methods or re-analysis.
By the way the 66% result has not held up super well in other research, for example, only 58% could predict if papers would replicate later on: https://www.bps.org.uk/research-digest/want-know-whether-psy... - Results with random people show that they are better than chance for psychology, but on average by less than 66% and with massive variance. This figure doesn't differ from psychology professors which should tell you the stat represents more the context of the field and it's research apparatus itself rather than capability to predict research. What if we revisit this GPT-4 paper in 20 years, see which have replicated, ask people to predict that - will GPT-4 still be higher if it's data is frozen today? If it is up to date? Will people hit 66%, 58%, or 50%?
My point is, predicting the results now is not that useful because historically, up to "most" of the results have been wrong anyhow. Predicting which results will be true and remain true would be more useful. The article tries to dismiss the issue of the replication crisis by avoiding it, and by using pre-registered studies, but such tools are only bandages. Studies still get cancelled, or never proposed after internal experimentation, we don't have a "replication reputation meter" to measure those (which affect and increase false positive results), and we likely never will, with this model of science for psychology/sociology statistics. If the authors read my comment and disagree, they should use predictions for underway replications with GPT-4 and humans, wait a few years for the results, and then conduct analysis.
Also, more to the point, as a Psychology grant funded once told me - the way to get a grant in Psychology is to:
1) Acquire a result with a counter-intuitive result first. Quick'n'dirty research method like students filling in forms, small sample size, not even published, whatever. Just make the story good for this one and get some preliminary numbers on some topic by casting a big web of many questions (a few will get P < 0.05 by chance eventually in most topics anyway at this sample size)
2) Find an angle whereby said result says something about culture or development (e.g. "The Marhsmallow experiment shows that poverty is already determined by your response to tradeoffs at a young age", or better still "The Marshmallow experiment is rubbish because it's actually entirely explained by SES as a third factor, and wealth disparity in the first place is ergo the cause". Importantly, change the research method to something more "proper" and instead apply P-hacking if possible when you actually carry out the research. The biggest P-hack is so simple and obvious nobody cares: you drop results that contradict or are insignificant, and just don't report them - carrying out alternate analysis, collecting slightly different data, switching from online to in person experiments, whatever you canto get a result.
3) Upon the premise of further tantalizing results, propose several studies which can fund you over 5 years, apply some of the buzz words of the day. Instead of "Thematic Analysis", It's "AI Summative Assessment" for the Word Frequency amounts, etc. If you know the grant judgers, avoid contradicting whatever they say, but be just outside of the dogma enough (usually, culturally) to represent movement/progress of "science".
This is how 99% of research works. The grant holder directs the other researchers. When directing them to carry out an alternate version of the experiment or change what we are analyzing, you motivate them that it's for the good of the future, society, being at the cutting edge, and supporting the overarching theory (which ofcourse, already has "hundreds" of supporting evidence from other studies constructed in the same fashion).
As to sociology/psychology experiments -
Do social experiments represent language and culture more than people and groups? Randomly.
Do they represent what would be counter-intuitive or support developing and entrenching models and agendas? Yes.
90% of social science studies have insufficient data to say anything at P < 0.01 level which should realistically be our goal if we even want to do statistics with the current dogma for this field (said kindly because some large datasets are genuine enough and used for several studies to make up the numbers in the 10%). I strongly see a revolution in psychology/sociology within the next 50 years to redefine a new basis.
Even considering a historic bias for counter-intuitive results in social science, this has no bearing on the results of the paper being discussed. Most of the survey experiments that the researchers used in their analyses came from TESS, an NSF-funded program that collects well-powered nationally representative samples for researchers. A key thing to note here is that not every study from TESS gets published. Of course, some do, but the researchers find that GPT4 can predict the results of both published and unpublished studies at a similar rate of accuracy (r = 0.85 for published studies and r = 0.90 for unpublished studies). Also, given that the majority of these studies 1) were pre-registered (even pre-registering sample size), 2) had their data collected through TESS (an independent survey vendor), and 3) well-powered + nationally-representative, makes it extremely unlikely for them to have been p-hacked. Therefore, regardless of what the researchers hypothesized, TESS still collected the data and the data is of the highest quality within social science.
Moreover, the researchers don't just look at psychology or sociology studies, there are studies from other fields like political science and social policy, for example, so your critiques about psychology don't apply to all the survey experiments.
Lastly, the study also includes a number of large-scale behavioral field experiments and finds that GPT4 can accurately predict the results of these field experiments, even when the dependent variable is a behavioral metric and not just a text-based response (e.g., figuring out which text messages encourage greater gym attendance). It's hard for me to see how your critique works in light of this fact also.
Yes, I am sure you should have said the same about the research before 2011 with the replication crisis, when it was always claimed that scientists like Bell (premonition) and Baumeister (Ego-depletion) could not possibly be faking their findings - they contributed so much, their models have "theoretical validity", they had hundreds of studies and other researchers building on their work! They had big samples. Regardless of TESS/NSF, the studies it focuses are have been funded (as you mention) and they were simply not chosen randomly. People had to apply to grants. They had to bring in early, previous or prototype results to convince people of funding.
The specificness to psychology applies to most fields in the soft sciences with their typical research techniques.
The main point is that prior research shows absolutely no difference between field experts and random people in predicting the results of studies, per-registered, replications, and others.
GPT-4 achieving the same approximate success rate as any person has nothing whatsoever to do with it simulating people. I suspect an 8 year old could reliably predict psychology replications after 10 years with about the same accuracy. It's also key that in prior studies, like the one I linked, this same lack of difference occurred even when the people involved were provided additional recent resources from the field, although with higher prediction accuracy.
The meat of the issue is simple - show me a true positive study, make the predictions on whether it will replicate, and let's see in 10 years when replication efforts have been taken out, whether GPT-4 is any higher than a random 10 year old who no information on the study. The implied claim here is that since GPT-4 can supposedley simulate sociology experiments and so more accurately judge the results, we can iterate it and eventually conduct science that way or speed up the scientific process. I am telling you that the simulation aspect has nothing to do with the success of the algorithm, which is not really outpeforming humans because to put it simply, humans are bad at using any subject-specific or case knowledge to predict the replication/success of a specific study(there is no difference between lay people and experts) and the entire set of published work is naturally biased anyhow. In other words, this style may elicit higher test score results, by altering the prompt.
The description of the role of GPT-4 here as simulating is a human theoretical construction. We know that people with a knowledge advantage are not able to apply this to predicting output results any more accurately than lay people. That is because they are trying to predict a biased dataset. The field of sociology as a whole, as are most studies that involve humans (because they are vastly underfunded for large samples) struggles to replicate or conduct scientific in a reliable, repeatable way, and until we resolve that, the GPT-4 claims of simulating people, are spurious and unrelated at best, misleading at worst.
I'm not sure how to respond to your point about Bem and Baumeister's work since those cases are the most obvious culprits for being vulnerable to scientific weakness/malpractice (in particular, because they came before the time of open access science, pre-registration, and sample sizes calculated from power analyses).
I also don't get your point about TESS. It seems obvious that there are many benefits for choosing the repository of TESS studies from the authors' perspective. Namely, it conveniently allows for a consistent analytic approach since many important things are held constant between studies such as 1) the studies have the exact same sample demographics (which prevents accidental heterogeneity in results due to differences in participant demographics) and 2) the way in which demographic variables are measured is standardized so that the only difference between survey datasets is the specific experiment at hand (this is crucial because the way in which demographic variables are measured varies can affect the interpretation of results). This is apart from the more obvious benefits that the TESS studies cover a wide range of social science fields (like political science, sociology, psychology, communication, etc., allowing for the testing of robustness in GPT predictions across multiple fields) and all of the studies are well-powered nationally representative probability samples.
Re: your point about experts being equal to random people in predicting results of studies, that's simply not true. The current evidence on this shows that, most of the time, experts are better than laypeople when it comes to predicting the results of experiments. For example, this thorough study (https://www.nber.org/system/files/working_papers/w22566/w225...) finds that the average of expert predictions outperforms the average of laypeople predictions. One thing I will concede here though is that, despite social scientists being superior at predicting the results of lab-based experiments, there seems to be growing evidence that social scientists are not particularly better than laypeople at predicting domain-relevant societal change in the real world (e.g., clinical psychologists predicting trends in loneliness) [https://www.cell.com/trends/cognitive-sciences/abstract/S136... ; full-text pdf here: https://www.researchgate.net/publication/374753713_When_expe...]. Nonetheless, your point about there being no difference in the predictive capabilities of experts vs. laypeople (which you raise multiple times) is just not supported by any evidence since, especially in the case of the GPT study we're discussing, most of the analyses focus on predicting survey experiments that are run by social science labs.
Also, based on what the paper is suggesting, the authors don't seem to be suggesting that these are "replications" of the original work. Rather, GPT4 is able to simulate the results of these experiments like true participants. To fully replicate the work, you'd need to do a lot more (in particular, you'd want to do 'conceptual replications' wherein you the underlying causal model is validated but now with different stimuli/questions).
Finally, to address the previous discussion about the authors finding that GPT4 seems to be comparable to human forecasters in predicting the results of social science experiments, let's dig deeper into this. In the paper, but specifically in the supplemental material, the authors note that they "designed the forecasting study with the goal of giving forecasters the best possible chance to make accurate predictions." The way they do this is by showing laypeople the various conditions of the experiment and have the participants predict where the average response for a given dependent variable would be within each of those conditions. This is very different from how GPT4 predicts the results of experiments in the study. Specifically, they prompt GPT to be a respondent and do this iteratively (feeding it different demographic info each time). The result of this is essentially the same raw data that you would get from actually running the experiment. In light of this, it's clear that this is a very conservative way of testing how much better GPT is than humans at predicting results and they still find comparable performance. All that said, what's so nice about GPT being able to predict social science results just as well as (or perhaps better than) humans? Well, it's much cheaper (and efficient) to run thousands of GPT queries than is to recruit thousands of human participants!
Fair enough, you might have indeed rejected those authors - however, vast swathes, for Baumeister the majority, did not at the time. It's almost certainly true now for existing authors we are yet to identify, or maybe never will.
I admit the point on TESS, I didn't research that enough. I'll look into that at a later point as I have an interest in learning more.
To address your studies regarding expert / study forecasting - thank you for sharing some papers. I had time and knew some papers in the area so I have formulated a response because, as you allude to later regarding cultural predictions, there is debate in the question of the usefulness of expert vs non-expert forecasts (and e.g. there is a wide base of research on recession/war predictions showing the error rate is essentially random at a certain number of years out). I have not fully comprehended the first paper but I understand the gift of it.
Economics bridges sociology and the harder science of mathematics, and I do think it makes sense for it to be more predictable than psychology studies by experts(and note the studies being predicted were not survey-response like most are in psychology), but even this one paper does not particularly support your point. Critically, one conclusion in the paper you cite is that "Forecasters with higher vertical, horizontal, or contextual expertise do not make more accurate forecasts.", "If forecasts are used just to rank treatments, non-experts, including even an easy-to-recruit
online sample, do just as well as experts", and "Fourth, experts as a group do better than non-experts, but not if accuracy is defined as rank ordering treatments.". "The experts are indistinguishable with respect to absolute forecast error, as Column 7 of Table 4 also shows... Thus, various measures of expertise do not increase accuracy". Critically at a glance, of the selected statements, almost 40% are outperformed by non-experts anyhow in Table 2 (the last column). I also question the use of Mturk Workers as lay people(because of historic influences of language and culture on IQ tests, the lay person group would be better being at least geographically or WEIRD-ly similar to the expert groups), but that's a minimal point.
Another point that further domain information, simulation or other tactics does not impact the root issue of the biased dataset of published papers - "Sixth, using these measures we identify `superforecasters' among the non-experts who outperform the experts out of sample.". Might we be in danger, with some claim 8 years later with LLMs, of the very "As academics we know so little about the accuracy of expert forecasts that we appear to hold incorrect beliefs about expertise and are not well calibrated in our accuracy. " that the paper warns against?
I know what you are getting at that these are not replications, that it feels elementally exciting that GPT-4 could simulate a study taking place - rather than a replication as such - and determine the result more accurately than a human forecast. But what I am saying is, historically, we have needed replication data to assess if human forecasts (expert and non-expert) are correct long term anyhow, and we need those to be for future or current replications to avoid the training data including the results, to draw any conclusion about the method of GPT-4 in getting this accuracy in forecasting results with any method, simulation or direct answer. The idea that it is cheaper to run GPT queries than recruit human participants makes me wonder if you are actively trolling though - you can't be serious? Fields in which awful statistics and research goes on all the time, awaiting an evolution to a better basic method, and a result that is accurate 3% higher than a group of experts, when we don't even know whether those studies will replicate in the long run (and yes, even innocently pre-registered research tends to proliferate more false positives because the proportion of pre-registered studies published is not close to 100% and thus the results of false-positive publishing still occur https://www.youtube.com/watch?v=42QuXLucH3Q
The problem is until we have fundamentals more stable, small increments and large claims on behaviour are repeating the mistake of anthropomorphizing biological and computational systems before we understand them to the level we need to, to make those claims. I am saying the future is bright in this regard- we will likely understand these systems better and one day be able to make these claims, or counter-claims. And that is exciting.
Now this is a seperate topic/argument, but here is why I really care about these non-substantial, but newsworthy claims: Lets not jump the gun for credence. I read a PhD AI paper in 2011. It was the very furthest from making bold claims - people were so low-mooded about AI. That is because AI was pretty much at its lowest in 2011, especially with cuts after the recession. It was a cold part of the "AI winter". Now that AI is raring at full speed, people overclaim. This will cause a new, 3rd AI winter. Trust me, it will, so many members of faculty I know started feeling this way even back in 2020. It's harmful not only to the field but our understanding really, to do this.
From experience in payments/spending forecasting, I've found that deep learning generally underperform gradient-boosted tree models. Deep learning models tend to be good at learning seasonality but do not handle complex trends or shocks very well. Economic/financial data tends to have straightforward seasonality with complex trends, so deep learning tends to do quite poorly.
I do agree with this paper - all of the good deep learning time series architectures I've tried are simple extensions of MLPs or RNNs (e.g. DeepAR or N-BEATS). The transformer-based architectures I've used have been absolutely awful, especially the endless stream of transformer-based "foundational models" that are coming out these days.
Transformers are just MLPs with extra steps. So in theory they should be just as powerful. The problem with transformers is simultaneously their big advantage: They scale extremely well with larger networks and more training data. Better so than any other architecture out there. So if you had enormous datasets and unlimited compute budget, you could probably do amazing things in this regard as well. But if you're just a mortal data scientist without extra funding, you will be better off with more traditional approaches.
I think what you say is true when comparing transformers to CNNs/RNNs, but not to MLPs.
Transformers, RNNs, and CNNs are all techniques to reduce parameter count compared to a pure-MLP model. If you took a transformer model and replaced each self-attention layer with a linear layer+activation function, you'd have a pure MLP model that can model every relationship the transformer does, but can model more possible relationships as well (but at the cost of tons more parameters). MLPs are more powerful/scalable but transformers are more efficient.
Compared to MLPs, transformers save on parameter count by skimping on the number of parameters devoted to modeling the relationship between tokens. This works in language modeling, where relationships between tokens isn't that important - you can jumble up the words in this sentence and it still mostly makes sense. This doesn't work in time series, where relationships between tokens (timesteps) is the most important thing of all. The LTSF paper linked in the OP paper also mentions this same problem: https://arxiv.org/pdf/2205.13504 (see section 1)
Though I agree with the idea that MLPs are theoretically more "capable" than transformers, I think seeing them just as a parameter reduction technique is also excessively reductive.
Many have tried to build deep and large MLPs for a long time, but at some point adding more parameters wouldn't increase models' performance.
In contrast, transformers became so popular because their modelling power just kept scaling with more and more data and more and more parameters. It seems like the 'restriction' imposed on transformaters (the attention structure) is a verg good functional form for modelling language (and, more and more, some tasks in vision and audio).
They did not become popular because they were modest with respect to the parameters used.
>Compared to MLPs, transformers save on parameter count by skimping on the number of parameters
That is only correct if you look at models with equal parameter count from a purely theoretical perspective. In practice, it is possible to train transformers to orders of magnitude bigger scales than MLPs because they are so much more efficient. That's why I said a modern transformer will easily beat these puny modern MLPs, but only in cases where data and compute budgets allow it. That is not even a question. If you look at recent time series forecasting leaderboard entries, you'll almost always see transformers playing along at the top of it: https://github.com/thuml/Time-Series-Library
Transformers reduce the number of relationships between tokens that must be learned, too. An MLP has to separately learn all possible relationships between token 1 and 2, and 2 and 3, and 3 and 4. A transformer can learn relationships between specific values regardless of position.
I get what this piece is trying to say, but it's ignoring the fact that schools are trying to maximize learning with pupils who often don't want or care about learning (unlike with athletes or musicians who are generally learning their craft by choice).
A significant part of teaching disinterested students (not just in a grade school but in general) is about making the subject interesting enough that students will want to spend time on learning and continue to delve further in their free time.
If you're trying to teach someone web development, would you have them churn through a stack of predetermined bootcamp-style projects, or would let them try to build something they have personal interest in? I bet the latter method would turn out much better for the student in the long run.
As the gymbros say, motivation doesn't get results. Discipline does. Few children are going to be interested in learning all that they need to learn. Esp. when it comes to math. So you need to instill the discipline in them to do it -- good work and study habits, drilled into them until they... actually become habit.
That's the problem with, for example, A Mathematician's Lament. Lockhart is looking at the problem from the perspective of a seasoned mathematician, not a primary schooler without the requisite skills. He only got where he was in the field by memorizing his times tables and practicing elementary proofs until he could do them in his sleep. Only then, after having done the boring stuff, could he even begin to perceive the beauty and art in mathematics.
Gymbros are wrong. The hypefocus on effectivity is what stands in the way of healthy lifelong exercising. If your whole exercise regime is based on discipline, it will fail entirely the moment you have other stressors in the life. Because then it becomes obstacle to what you need rather then something that helps you be happy.
Gymbros are gymbros because gym is their priority 1, over everything else. They are already motivated. You know who used to actually end up exercising regularly? Guys who would meet for soccer game regularly so that they meet friends. And people who actually like the sport they do.
I think it depends on the type of motivation, when I think back to my time at high school (Australia in the 90's) there was a contrast between how English was taught and how Math was taught.
In my English class the teacher would assign the class a book, or a poem etc. Take this home and read to the end of chapter X before next class. At the start of next class the teacher would pick half a dozen random students and ask them questions in front of the class about what we had been assigned to read. These weren't the kind of questions you could bluff an answer to.
Believe me you were motivated to do the readings because no one wanted to get called up to the front of the classroom and look like an idiot by not being able to answer the question. You were motivated by fear.
In Math on the other hand we were given a textbook, told to go home and do exercises from the book to practice what we'd been taught. It entirely ran on the honor system no one checked to make sure we did the exercises, as a result I know a large portion of the class didn't bother. I wonder what would have happened if the math teacher were to have called up random students to front of the classroom and made them solve a problem on blackboard at start of each lesson.