I see this so often with academics. They’re not developers and only have basic coding ability. So they make all kinds of basic mistakes like trying to load a bunch of images converting them to float (IR images for example) and wondering where their memeory went. I took some spaghetti code that took a couple of days to process thousands of IR images, in batches, and wrote one that ran in under a minute simply because they didn’t manage their memory at all.
Even something as opening CSV files quickly can have a massive effect.
Lots of Universities switched from FORTRAN to Matlab in the last decade but many of the researchers who learned FORTRAN try to write Matlab like FORTRAN with messy nested for loops and no knowledge of vectorisation.
I'm one of those people: I'm not actually a scientist, but I studied mathematics and build my own agent-based macroeconomic models (amongst other things). I am also a fairly autodidact programmer, meaning my code is an ungodly spaghettified mess that can be used to scare small children.
I am very much aware that my ‘programs’ are of the “big untidy spaghetti script” variety, and I simply consider them a kind of work-in-progress, notepad, or prototype for later professional implementation. I also have a couple of professional programmers (cough ahem, coders, they desire to be referred to as coders) that are entirely adept at taking my horrid final ‘thing’ and converting it into production-quality code that doesn't take down the enterprise (with the same tools I use, incidentally: Mathematica, Python (>3.3) including SymPy and NumPy, and ABAP/SQL/Java for interfacing with the SAP ERP system).
The important part of this is that I never be under any illusion that I something that I toss together and make capable of ‘running’ be definitive code that can be put into production or used as-is, and that under no circumstances must the coders have any bright ideas about fudging the underlying mathematics.
> The important part of this is that I never be under any illusion that I something that I toss together and make capable of ‘running’ be definitive code that can be put into production or used as-is, and that under no circumstances must the coders have any bright ideas about fudging the underlying mathematics.
But there should be some cross germination between you and your coders? You must have learned ways to make your spaghetti code less cumbersome for them and they probably have learned something about the underlying economics so that they know what it is you’re doing and how to keep everything to your needs. Similarly you’re not reinventing the wheel each time so you must be using tools developed by them more and more as time goes on?
Most definitely. We communicate and collaborate and to a certain degree cross-pollinate (mainly I learn, there’s not so much extra macroeconomic depth to add)... but what I was emphasising was the clarity about the separation of rôles.
You're spot on with vectorisation. It needs to be emphasized in academia. Take numpy/pandas, even for basic stuff the efficiency difference vs a for loop is vast:
def product_sum(a, b):
total = 0
for a, b in zip(a, b):
total += a * b
return total
%timeit product_sum(million_a, million_b)
215 ms ± 3.01 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit (million_a * million_b).sum()
6.01 ms ± 254 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
How do you suggest they begin to bridge the gap between their entry level skills and those of someone with four years of studying and many (I assume) of work experience?
Or would it be better to simply have a go to person to help create and optimize these programs?
The ‘go to person’ is probably the best bet, most researchers are willing to learn new things so hopefully this would lead to improvements. I saw this formalised in a group in KTH where a person’s job was to help researchers and students get the most out of their computational resources.
A bigger problem is the stop start nature of software development in academic research. Students and postdocs come and go and tools are frequently abandoned to the new inexperienced student or are kept frozen in time on that one old machine that still works. Many researchers are simply afraid of coding.
We recently had a student from my former lab interview with us and he presented all his Ph.D results captured and processed in a tool I wrote several years earlier. It had seen zero development even though the commments through listed possible improvements. I doubt anyone even looked at the source code because it just worked and he had no idea I had wrote it even though my name was in the header comment.
I was to “go to person” for my university’s natural language processing research group. I built a database and accompanying REST API for bulk loading audio and transcription data for one of their projects. I was quite pleased with it.
When it came time for the researchers to submit transcripts I had the pleasure of reviewing probably the worst python program I have ever seen.
1) The request JSON was manually built using strings and string substitution. One immediate bug I saw was that the researcher forgot to wrap one of his keys in quotes. `{key: “val”}` is not valid JSON of course.
2) The python program did not actually make the web requests. It generated curl commands as strings and then printed them to std out.
3) The researcher then took all these generated curl commands and evaled them.
I disagree. While non-vectorized operations are slow in Matlab, this is the first thing you'll find with a quick Google search. Every language has some performance pitfalls (ok, Matlab has many). Knowing the basics of how to write performant programs in your language of choice is a user problem.
I've seen plenty of non-performant C code. And that's one of the most performant languages you can code in, if you know what you're doing.
It isn't just badly done math loops that can cripple performance. Years ago, some users were complaining that it was taking forever to load their data into their analysis program. It turned out they were reading thousands of structs, one element at a time with the Unix read(2) system call! I taught them about buffering and the read time went down by a factor of ten or more, I forget the exact numbers.
Even something as opening CSV files quickly can have a massive effect.
Lots of Universities switched from FORTRAN to Matlab in the last decade but many of the researchers who learned FORTRAN try to write Matlab like FORTRAN with messy nested for loops and no knowledge of vectorisation.