I don't have much to say about the R ecosystem, as I am not actively using it or even doing anything using R.
> Matloff is also upset about a commercial company swooping in to steal their precious, a common academic complaint (academics swooping in to steal ideas from commercially developed software is, of course, perfectly respectable).
This reads like some kind of sarcastic statement. In fact it is indeed perfectly respectable, that academics do this, because it lets our society advance and research progress outside of proprietary software gardens. In my opinion any such kind of action is perfectly respectable, if it shares knowledge with the world. The other way around it is not OK, because from proprietarizing ideas, there is no benefit for society. If a company is worth its salt, it will offer services on top of existing public knowledge, without creating walled gardens. If it cannot do so, then it might be time to realize, that the company should not exist.
What is not OK is, if academics take that knowledge and make it seem like they were the ones inventing stuff. One should always give credit where credit is due. Never forget academic honesty. Usually however, a lot of foundational research is done by academics, only to be picked up years later and often by commercial entities dealing in proprietary software.
> If a company is worth its salt, it will offer services on top of existing public knowledge, without creating walled gardens.
I contend that every profitable company leveraging public knowledge does offer services (broadly construed) on top of that knowledge, since otherwise no one would pay them. Maybe the minimal example is that in which the only service provided is marketing, but the range of service levels provided in practice runs the gamut. RStudio clearly goes well beyond that minimum.
In some cases, the services offered are so valuable that the company can create a walled garden - which reduces the net value of those services to the customer but provides the company with monopolistic pricing power. (It’s not clear to me that RStudio’s products have significant lock-in, for what it’s worth. R has significant lock-in, like any programming language, and RStudio benefits from the lack of competitors creating good R tooling, but that’s not really a walled garden of RStudio’s making.)
Academics might have their own grand plans for society, but funding agencies generally want industry to take over at some point and run with it. I expect they'd be happy to move on and fund something new rather than trying to compete with products that industry can provide.
Personally I think RStudio is far less clunky than, say Jupyter. I still think Matlab is by far the best tool for engineering. I think having a company supporting the technology like that has generally been very beneficial. It's certainly possible they would just cause it to stagnate and milk it for cash as the technology dies out, but their business would die with it. Hence these examples have never stopped advancing.
> Academics might have their own grand plans for society, but funding agencies generally want industry to take over at some point and run with it.
Sure, but people can have different views about where that point should be. Documenting industry-relevant know-how makes it even easier for folks to take the result and run with it (providing the 'impact' that most funders will ultimately care about), so I agree with GP that it's very much a proper field for academic research.
>because from proprietarizing ideas, there is no benefit for society.
Except for a profit motive for companies which then benefits society by lowering prices of products. If it weren't for private companies no one of us could afford a computer today.
It's competition that keeps prices low - if private companies were to properly proprietarize core ideas around computing, we'd still be stuck with room-sized computers.
Maintaining a market that's beneficial to society is an attempt to carefully tune interplay between many public and private concerns. From the point of view of society, the market is just a roundabout way of putting people on a treadmill and hanging carrots in front of them. We need companies to produce goods, provide services, and innovate better solutions, so we allow a degree of privileges and exclusivity for things that otherwise shouldn't be owned - like ideas. But we cannot allow companies to actually get the carrot and stop running - because it's the production and innovation treadmill that's the important thing (from society's POV).
This model has been so successful that it blinded a generation or two of politicians - they let the donkeys get their carrots, and now we have fat donkeys saying to us, "if you want us to run the treadmill some more, give us the carrots".
OpenStack was definitely unnecessarily complex. And it led to many different distributions that were only manageable by hiring a lot of consulting manhours.
> The other way around it is not OK, because from proprietarizing ideas, there is no benefit for society
You say this as if proprietary rules (ie. copyright and patents) weren't intentionally designed to incentivize knowledge creation & sharing by ensuring compensation for innovation. Creators can't pay bills with societal advancements and research progress. This is a really shallow "proprietary = bad" ideological take that I expect to see on the black & white pages of Stallman's website, not an (at least partly) entrepreneurial forum.
wired: IP rules are "intentionally designed to incentivize knowledge creation & sharing by ensuring compensation for innovation"
inspired: above isn't true in practice; IP rules have been thoroughly gamed, and their primary job is protecting monopolistic behavior, even if it means keeping knowledge suppressed past the point it becomes irrelevant or useless
Copyright and patents, as practiced today, have mutated into a malignant form. That doesn't mean the idea is bad, just that it was taken too far - instead of creating idea-generating pump, we've ended up with toolkits for extracting rent from society and creating wasteful artificial scarcities.
Hah I appreciate the meme and can appreciate arguments that IP rules can be improved but frankly, tearing down IP rules because of some malignant outcomes is throwing the baby out with the bathwater. At any rate, this is certainly more salient than "because from proprietarizing ideas, there is no benefit for society" which could be a quote from The Onion where the article picture is someone typing this on their iPhone.
I don't think it is as shallow as you make it seem at all.
Whatever technological / knowledge advancement any entity holds proprietary could probably be of more use when being shared with the world. Our technology enables us to have this kind of knowledge society, where knowledge is accessible at all times almost anywhere. I don't think that stopping knowledge from spreading can be a net gain, just because it enables a few selected individuals at some entity to gain from keeping it a secret. Such a thing should not be the basis of a business.
It seems that shallow to me, "rules on knowledge = bad" which you've simply repeated here. There's no thought given to how knowledge & innovation is created in the first place and how free use without compensation can damage innovation, resulting in a poorer society with less knowledge overall. There's no understanding that proprietary rules aren't a permanent block on knowledge but rather a limited embargo that often (eg. in the case of patents) requires the sharing of knowledge as part of it.
Apparently all you have to say is "stopping knowledge from spreading = bad", yet in the same breath you chastise businesses for taking from academia without returning any form of compensation. I'm not in academia and I'm not sure if you are, but best of luck to them. Academia is shrinking/slowing as they beg for government scraps and flee to proprietary business.
No, you are misunderstanding the point I am making. I am actually thinking in depth about this, but I have a different view of it than you. I will try to describe it in more detail in the following:
The idea is, that it is very unlikely, that there can be a situation, in which society as a whole gains more from only one actor having the knowledge, than the knowledge being available for everyone (and subsequently more people being able to work on progress using that knowledge). If only one actor has the knowledge, this will most likely delay progress, because this actor does not need to innovate. They can sit on their behind and profit from that knowledge, which they do not share. In comparison, if the knowledge was shared, then it would be about being the first to get to the next level, in order to be the first, who can adapt to progress. The first one to offer a new service based on that knowledge or some technology resulting from it. If you are the one making the progress, you will have a head start on everyone else.
(If you want an up-to-date example: Vaccines. If big pharma businesses had to share all knowledge during the whole research process, it might be, that we could have a better vaccine and could have had it earlier than we actually did.)
So yes, I see stopping knowledge from spreading as a behavior, that has a net negative impact, when compared to a society, where we do not stop knowledge from spreading. Most certainly a society that stops knowledge from spreading will fall behind.
Then another point is, that I am not chastizing businesses for taking from academia without returning any form of compensation, as you state. I am chastizing them for taking and then making some profit of of unshared knowledge, that they gain. It is not about paying money or compensating in any way academia, but about sharing their gained knowledge with society, with the public.
I am thinking of a quite different business world here, where business actually and exclusively serve the purpose of advancing society, as so many of them claim. This is brought about by having a strict knowledge sharing policy and providing actual services, instead of holding people captive, with their proprietary tech and secret knowledge. If a competitor surpasses them, great! Someone found a better way! Society advanced! That is the kind of mindset I am thinking of.
Businesses in my opinion should not exist to make individuals rich based on profits, which are based on keeping knowledge away from others, who could use it well, even if for a competing product. If a business makes anyone a rich person, then it should be based on that business providing beneficial services for society.
It could happen, that a business is created merely for providing a narrow service on top of some tech, and never makes any progress. It never gains any new insight or knowledge. Perhaps, because this business does not have a research and development department. That is totally fine! As long as people find that service useful, such a business can exist. It might not grow infinitely. Might even remain a small specialized business. Perhaps this business has to evolve the service at some point, but it can still be on top of existing public knowledge.
For example a business could be to set up IT infrastructure for private or other businesses. That is a lot of work, which not everyone wants to do themselves. And then to make it reliable for the specific use case ... Yet there is no new research being made by that company. It is the good quality of their work and happy customers, that generate their income. They never return anything to academia and I am not criticizing them for it.
I do not want to stick to what exists today and how things work today. I am imagining how a better situation could look like and how it could work. This is about discussing ideas and not about how any particular business, that exists today, could be saved, if the business world was restructured in this way from one day to the next. It is quite likely, that many businesses would fail, without keeping knowledge a secret. I say good riddance!
> The wave is subsiding and they now need to appear to have a viable business (so they can be sold to a bigger fish), which means there has to be a visible market they can sell into.
Might need a (2019) tag. As of 2020, RStudio is a Public Benefit Corporation and has a corporate structure that is designed for long-term investment in the scientific community, and isn't susceptible to buyout or IPO.
Absolutely. The B-corp by itself is indeed not a magic bullet, but the PBC transition (which is separate) does make a big difference. In a typical C corp the corporation is ultimately answerable to shareholders and it is shareholder pressure that causes buyouts and IPOs as shareholders need a return on their investment. A PBC is able to prioritize its mission (in our case, enabling access to scientific tools to everyone, regardless of means) over these demands.
The popularity of R amazes me. I took a one-week class in R and left with a vow to avoid it at all costs. I have never seen a more confusing, hard to understand, inconsistent software product in my 30 years as an IT professional. It's apparently targeted at scientists and sociologists who are non-programmers. I have no idea how they manage to use it.
Scientific programming is just different. Much of the culture of scientific programming is different and with good reasons not easy to understand.
Its something like how baking cookies at home and running a cookie factory are very different. To a person doing each, the behaviors and priorities of the other seem strange and it’s easy to for one to think “we are both just making cookies, why don’t they do what i do which is obviously superior “
Scientific programmers are solving problems first, not writing programs. They are solving problems in a way that is useful only to them or peers who know a whole lot about the problem being solved. The problem-first tools look strange because they deemphasize the programming niceness in favor of problem niceness.
You find the same sort of confusion when programmers are facing business types and excel usage.
There are certainly times when a piece of code starts needing the programming touch, but the right tool for the job depends on the job.
As someone currently writing scientific programs for scientists, much of the culture of scientific programming is bad and they are having rings run around them by kids who build websites for a living and it's embarassing.
You think web developers are just writing programs for the hell of it? They have operational constraints around their work as well, it's just that there's a rad open online culture of continuous process improvement that's leveling up the tooling and practices that makes what was once hard look easy.
Things scientists using computers can learn from people working in the software industry:
- use a version control oriented workflow
- write extensive tests
- data management, automated data processing/cleaning pipelines, backups
- build generic frameworks
- use continuous integration
- logging and error reporting for long running tasks
- use modern tooling for automating infrastructure and big jobs, rather than manually submitting Slurm jobs via SSH
#notallscientists but the average level of competence is not good and I think the idea that writing software for scientific research is somehow special is backwards and counterproductive.
None of those get you closer to tenure I guess... There used to be research engineers in 'good' labs, friend of mine was. They'd take your software weekly, clean and slice it, help you write tests, connect you with other colleagues or orgs, show you how you could achieve 30x perf when needed, would help you make your shit distributed, would get budget for CI and stuff. That was a dream job for me, for some time, since I'd been mentored by a demi-god of non-destructive-control (radiography, ultrasound...) and radioprotection research that was also a C++-expert sw engineer who had the most beautiful and clean/clear/modular/malleable codebase I've ever seen, and I saw the yuge potential of somewhat even passable sw engineering, especially when I understood his software was used in so many different operational contexts and also advanced research projects without his input, and his colleagues' sw were just toys, toys and more toys.
I don't see that many job posts doing this anymore. And even then it's just not valued that much.
I think the problem is that these academics literally do not understand or appreciate the benefits of having, say, a robust test suite with CI. Software development and DevOps is an alien culture and to be fair a good proportion of software developers themselves don't really get it. It's like a professional lumberjack who regularly sharpens their saw vs a forestry researcher who doesn't realise how slowly they're chopping down trees for samples because their saw is always blunt.
No. You'd be using pre-existing high-level tools for this purpose (in Python land this would be pandas, matplotlib, jupyter etc).
Not all software development should involve writing your own framework, and it's a bit of a trope that junior SWEs try to write DSLs and frameworks where they should just write some throwaway code. That said, if there's a common modelling or simulation method in your field and there's no great open source generic framework for it, then you should take a hard look at what you are all spending your time on (endlessly reimplementing the wheel).
Yes. Rewriting this stuff is much easier when you've already automated it the first time, vs scratching together instututional folklore on how to run a job.
Being "difficult to understand" is a perspective of an individual observer, not some fundamental property of the thing itself. Japanese is "difficult to understand" if you only know English, but not if you grew up speaking Japanese.
The various conventions around "accepted" programming paradigms like OO, inheritance, scoping, etc are natural if you use them every day for years, but if you're more interested in optimizing some finances or solving a physics problem, something "primitive" or "messy" like a spreadsheet formula or fortran might actually be more understandable.
I couldn't agree with more OP. I got my graduate degree in Statistics and, after working for several years in such a role, made a similar vow to Never Again™ touch R (or SAS). This effectively forced a career change. (I'm now "officially" a software developer and couldn't be happier.) My distaste for Statistics stems directly from the commonly used tools.
I appreciate your point. In most contexts, such as your comparison between Japanese and English, I agree with it. However, paradigms like OO, inheritance, scoping, etc. are hard-won, intellectual accomplishments; they're not arbitrary. They're purposefully designed to solve specific problems. My experience with R showed it to be rife with problems that have been avoidable for decades. AFAICT, it boils down to the tragic view of "I'm a (statistician|engineer|mathematician) not a coder so I don't need to care". The unfortunate truth is that despite such a view, doing analysis with R makes the analyst a programmer by definition. And so the language, ecosystem, and, consequently, users suffer from half-baked, poorly designed workarounds to problems which have long been solved (and abundantly documented in the software literature). To reduce it to a matter of perspective feels to me like a tautology: it's easy once you get it.
Clearly, I'm triggered by this. I hope I've expressed myself respectfully. My point is, I don't feel it's arbitrary. The design and complexity of R has real consequences in terms of cost and reproducibility.
I use R regularly. I love it. Doing the maths is very easy and general. The programming makes me think differently. I like how it's high level, reasonably fast, rarely involves loops or inline functions directly, and, above all, is the Swiss army chainsaw of statistical analysis. The fact that the journal of statistical software exists is a good thing!
Inheritance was invented for performance reasons [1]. It was not conceived for pure code organization, so in way it is arbitrary as any other solution that could have brought performance gains for the original garbage collector. Inheritance is an "intellectual accomplishment" as other accomplishments, that will incur into issues if applied blindly, so not having it is not an issue per se. On the contrary, today's widely accepted view of inheritance is to rather use composition instead of it [2].
R does have inheritance by the way, not as you will see frequently on "general purpose" programming languages, thing R is not.
I would say that "inheritance" was much developed in Smalltalk, and there it was not really a tool for improving performance. Rather it is a conceptual tool.
"Inheritance" is really the expression of Abstraction in code. Superclass is more general, and thus the abstraction of its subclasses.
Although Smalltalk did not have a specific keyword for "Abstract Class" it used the convention of calling "subclassResponsibility" on methods defined in abstract subperclasses which had to be implemented in subclasses.
Abstraction, generalization, these are conceptual tools. I'm not sure how "composition" models abstraction if at all. Only OOP does.
And one could think that abstraction is a tool valued by scientists. No?
I don’t know why you are saying inheritance was invented for performance reasons? Is there any evidence behind that?
Everything I was taught when OOP became mainstream in the late 80s, was that it was a performance trade off to afford code organization insight, while preserving encapsulation.
This statement about “performance reasons” is quite baffling.
> Being "difficult to understand" is a perspective of an individual observer, not some fundamental property of the thing itself.
I disagree. There are inherent qualities associated with systems that happen to be easier to understand. If you or I were designing such a system, and wanted to ensure this system imbued similar qualities — I believe it would be possible to do so.
> Japanese is "difficult to understand" if you only know English, but not if you grew up speaking Japanese.
Needing a priori knowledge to be able to understand something, is different.
> I disagree. There are inherent qualities associated with systems that happen to be easier to understand. If you or I were designing such a system, and wanted to ensure this system imbued similar qualities — I believe it would be possible to do so.
Can you formalize these inherent qualities? Until you can, it's hard to prove that a quality is or is not essential.
> Can you formalize these inherent qualities? Until you can, it's hard to prove that a quality is or is not essential.
There is an obvious one, though it's tricky to fully formalize and quantify: locality (also known as "coupling"). You ask yourself a question, "if I were to make a small change to a single piece of this program, how much of the program would I have to change/retest/keep in mind?". If the answer is, "typically, just the area around the change", it's a high locality (loosely coupled) program, a good design. If the answer is, "usually, most/all of it", then it's low locality (tight coupling), a bad design.
The reason is, of course, that human mental capacity is limited and in general, number of interactions grows superlinearly with the amount of interacting components. So this here is an objective measure of "easy/hard to understand": the more code you have to keep track of when investigating a random piece of the program, the more mental effort you have to exert - the more difficult to understand it is.
It has some comments about why good software engineering leads to good software, how good software engineering is hard right now because there's a lot of bad examples which cause people to make more bad code, and some suggestions on how things could be done.
It's no more difficult to understand for researchers coming at a programming language for the first time, and they need to learn significantly less about the language to get up & running with analysis. Two lines of code will get me data loaded and run a regression analysis & plot the results.
Especially consider that much of the audience for R come from proprietary and extremely expensive languages like SAS or SPSS Syntax, neither of which are significantly less difficult to work with but significantly more limited.
Something more intuitive would be nice, but I'm not aware of another option that has the depth of libraries for practical every type of data analysis that can also begin getting a researcher meaningful results in few lines of code than they have fingers on their hands. (Assuming there was no catastrophic accident resulting the loss of fingers.)
I sometimes quip that the difference between cooking and process engineering is that unlike the former, the latter actually gives a shit about the quality of outcome.
The issue isn't just specialized knowledge - knowing the principles underlying the process, knowing that there are principles - but also access to tooling. I'm sure plenty of lay cooks would appreciate appliances that let them be more consistent, even without knowing any related formal theory - but these are not available. As it is, an oven that's not shit is a major capital investment. It only makes sense to get one if you're more likely to call your work output a "formulated product" than "cookies".
However, software development is a peculiar occupation - one in which the best tools are free. It's like being able to buy a production line for less than a consumer-grade stove. Access to quality tools is not a limiting factor. Knowledge and awareness is.
Of course, non-programmers that code have more interesting things to do than to study software engineering. Which is why it's doubly important to make sure tools that are being popularized aren't shit. On the contrary, tooling is the perfect place to encode good practices and principles, so that even most lay (programming-wise) users succeed by default.
> Scientific programmers are solving problems first, not writing programs.
And they don't value writing programs. Or Software Engineers.
That makes productizing some research interesting. That also makes trying to get the same result as something published by your own lab a year ago an uphill battle: "it worked on John's old laptop, the huge Alienware. Never worked on any of our machine".
My description of R: It makes hard things easy and easy things hard. Ergonomically it is the absolute worst "programming language" I've touched in my life. However somehow it managed to become the official language of statistics research and has packages to do any type of analysis you can dream of.
I think the reason you and I dislike R is because we just work differently than non-programmers. Non programmers think in purely imperative, straight-forward semantics. They write one-off unmaintainable code tying together libraries that solves their immediate problem. Programmers try and write R code as if it was a proper programming language and immediately run into walls. Non-programmers never see the walls because they don't even know there's another way.
Right. Julia is looking fairly promising as a "real" programming language that is still an excellent Matlab replacement, and possibly to a lesser extent an R replacement, and it does show that there is nothing about filling the exploratory math programming niche that requires it to have the warts that the incumbents have.
Fundamentally, R and Matlab are probably best comparable to Perl in terms of how they got popular. While they are rather ugly, they were the first to provide a simple solution to a few specific problems that a lot of people had, which snowballed them into popularity and network effect advantages
The fact that Julia requires compilation seems like such a huge hurdle to adoption in light of the above. When the alternatives are a REPL and the users aren't programmers, to use Julia they have to first learn the difference between code and compiled code.
No, there is a difference. My day-to-day use case is: start the REPL, start typing some code, wanna see a graph. I might literally only want to write a single command. If that takes a minute to compile, it's a problem. Julia is pushing hard to address this and I'm really interested to see where it goes - I tried it for the first time recently and saw a lot to like.
Right, compilation time went down a lot with 1.6 by avoiding method invalidations during compilation. The other thing they did is ensure that the package manager properly compiles all dependencies in topologically sorted order when you install a package, instead of doing a lazy compile on package load where it would often waste time recompiling the same method.
One isn't precluding the other. For example, SBCL is an implementation of Common Lisp that compiles everything ahead-of-time down to native code, and like all Common Lisp implementations, it offers a powerful REPL suited for interactive development. The compiler has low overhead, so you don't even notice that it compiles everything you type into your REPL.
Another common misconception is that AOT compilers can only be used to build libraries/executables, which then get executed. Again, SBCL is an example of an AOT compiler that works at function-level granularity[0], in accordance with Lisp heritage of image-based development, where you write your program by starting it, and then adding/removing/modifying code into a running instance. SBCL achieves this by having the compiler being a part of your application runtime.
I don't know how Julia is implemented internally, but since it's essentially a Lisp clothed in syntax that's more pleasant to the masses, it inherits a lot of Lisp family heritage. That could easily include a fast, to-native, in-process AOT compiler.
Julia's JIT is a just ahead of time method jit just like SBCL, and the language is designed around making idiomatic code as fast as possible with that approach.
So every function is a multimethod that gets compiled the first time any combination of types is first encountered, and at that point the type information is propagated to all function calls inside the method body, to eliminate dynamic dispatch on those & possibly inline the appropriate method. Julia has parametric polymorphism as well, so it's common for all types in the body to be inferable.
As a result, Julia code tends to make much more heavy use of polymorphism & multimethods than common lisp (where CLOS has a significant runtime cost instead of being a near-zero cost abstraction), since the language and the runtime were designed carefully together to make that fast.
Luckily, Python >> R > Matlab, and the fight between R and Python is core to data science. Last I did any real investigation, R still had the edge in highly-theoretical and new techniques, for whatever reason it was much more used by academics, but Python was eating R's lunch for non-academic data science.
I've given SciPy a try, but its performance is grossly lower than R and Matlab in my experience. SciPy also seems to be missing plenty of things. You get pretty far (basic filter design seems easier in SciPy than Matlab), but Matlab has so many different mathematical fields in it that Matlab vs Python are just incomparable in terms of mathematical usability. Matlab has way more features.
R is statistical-slant, so you have all sorts of distributions and statistical features.
Matlab is general mathematics (but especially matricies). So you can grab say... Galois Fields between 2^2 through 2^16 and just start working with them immediately.
Matlab itself is decently replicated by Octave (which is a great project), but it should be noted that Octave aims to be largely compatible with Matlab (so all of my complaints about how awful that programming language still applies).
> somehow it managed to become the official language of statistics research
There's zero mystery to this. The intended audience is people that want to get stuff done. Professional software developers commenting on R are like this: "He's such a good salesman. He does everything the right way. He dresses right. He talks right. He has the best smile of any salesman I've ever seen. He fills out his reports on time. Granted, he doesn't make many sales, but why would you hire one of those other guys over someone that's perfect?"
Mmmm more like like they try to put the salesman in an operations role and wonder why the company has broken down.
If you tried to use R for an actual engineering project you'd wind up with a system that is always on fire. Where as these one off analysis it's fine because it's just a one off.
We want to get stuff done too, but we need it to hold up for more than a few hours so that's the mode we think in.
It's all real programming and it has all the real consequences, but there is a distinction between something that needs to live for a day, a few months, maybe years and then maybe have multiple people work on it, have it communicate with other systems or not and the level of investment and structure you need to put into a system to make sure that's possible.
Programming has a very low barrier to entry and that is incredibly powerful but there is a big gap between someone writing one off scripts to solve immediate problems and people writing highly connected systems that withstand scrutiny.
I've always personally been for accreditation for software engineers much like pretty much every other engineering discipline.
they used to split stuff like R off into the "scripting" category. But now most "real programmers" are apparently doing python or javascript so that little hierarchy quietly disappeared.
I have a colleague that use R as a general purpose language (I'm in science), and it's horrible. He runs into problems all the time and usually the answer is "more RAM".
For the things it's good at, it's great. For everything else I avoid it like the plague. More often I find it easier to use a quick Python script to generate a data table that I can then read into R to perform whatever stats or plots I need. It's almost always faster than if I had just run everything in R to begin with.
But I think your description is spot-on. Non programmers just want to get something done and if it works in R, then great. For those of us that think in terms of software engineering, good practices can be difficult in R.
The vanilla-R vs Tidyverse split has made this all worse too. These are two completely separate dialects of R that while still the same language, are completely different.
It's like the R folks took the "there's more than one way to do it" mentality from perl and said "challenge accepted!".
I agree overall. In my particular case, my employer won't let me have Python for fear I will do things with it. But it is perfectly ok for me to RStudio with 2 million useful extensions that will do similar things. For better or worse, sometimes the tools we use are the tools we are left with.
And I am saying this as a recent RStudio convert. I love how easy it makes some otherwise hard things. I hate the sheer amount of hoops you have to jump through to make some stuff work.
Just like any other programming language, it DOES take a good programmer to make a decent library in R that people can use on their own problems.
R has had a very long evolution. It is a very different beast today in its most common usage, than it was the earlier days 10, 20+ years ago. Even the Tidyverse has some libraries that are very much crafted with a programmer mindset like purrr and tidyr. These tools are decidedly non-imperative and not straight-forward in their semantics.
What makes R difficult for experienced programmers, I think, is the inconsistency of paradigms that are the result of its long history. This complicates how one writes library code.
There is, however, a "sweet-spot" for R and that would be as a "notebook" based programming language much like Mathematica, Matlab, and Julia. Which one you like, I guess, depends on your taste, your own history, and the killer libraries you want to use.
Whenever I have to describe what R is all about to excel jockeys at work, I just say it's "excel on sterioids". I think that's fair (albeit reductive) description. To be honest, I probably would have never learned R if Julia had existed when I started picking up R. I think I would have preferred a more ahistorical language with less "baggage" than R. But it's always worked out for me, so I am sticking to it at least for now.
> And it kind of explains why k8s is creating so many jobs.
Is it 'creating' jobs? I think it's merely making it easier to specialize.
A few problems are unique because of container usage, but by and large K8s is trying to do what's otherwise a difficult job. Try assembling your own distributed container system with scheduling and whatnot and see if you can build something easier to understand or that works better. Maybe you can, but there's inherent complexity.
The criticism of K8s should really be criticism of indiscriminate container usage and the attempts to ship a company's organization chart as microservices. Many applications should really be monoliths and would work better that way. Some should be split on different "services" (not _micro_) along obvious interface points. Just a minority should be architected as microservices from the get go. Distributed systems are _hard_
I imagine that you could get rid of kubernetes in 90% of the projects it is used in. We have it at my work, must have taken around six months to a year for on dev. Sure we can autoscale now, but we never actually need to. It saves us a bit in server costs, but more in maintenance / dev time.
I was a mathematics major in college, and didn't have much training in programming when I graduated. R was the first language I learned when I started my career as an actuary, and it was a breeze. Things “just work”. Want to add 2 vectors of different dimensions together? R knows what you’re getting at, and makes it work. Comparatively, learning Python was harder.
Now that I’m used to both languages, I find it funny how much R is hated by “true” programmers.
This is the key. R shouldn't be seen as a general programming language, but a domain specific language that's still open-ended. I started with SAS in my job, which was fine for statistics and handling tables. But anything beyond that, even supposedly simple things like reusing code or listing all files in a folder, was not simple. With R, it was.
R only had to be ergonomically better than the competition, and they weren't very good.
What are you "getting at" by adding two vectors of different dimensions? It's not obvious to me.
Off-by-one dimensionality errors are so common in programming. If the language does something like zero-extending instead of raising an error, it will lead to an "it runs but gives the wrong answer" bug. These are much more painful in numerical code than in logic-based code.
For instance, if you wanted to add 1 to every other element in the vector (1,2,3,4,5,6). To someone who has no programming experience, this may be a daunting task. But simply doing (1,0) + (1,2,3,4,5,6) works in R.
Indeed, no mathematician or statistician would think of `+` as concatenation. List concatenation isn't a relevant problem to most mathematicians. This is where the above comments about prior knowledge and context come into play. Likewise, when I multiply two vectors together, or multiply a scalar with a vector, I have a definite idea what I "want" out of that multiplication. For many programmers, they think of this more as an exercise in data-types than an expression of linear algebra.
Statisticians might see it differently, but I still don’t think it’s obvious that (1,0) represents effectively a repeating pattern that’s extended to the size of the other vector.
For instance, for the statement (a,b,c,d,e) + (a,b,c,d,e,f) are we really all saying that it’s clear, obvious and unambiguous that this means (a+a, b+b, c+c, d+d, e+e, f+a).
Personally if you showed me that and asked me to describe what would happen before reading this thread, I would have said it would either throw an error, concatenate the vectors or only add the matching elements. I wouldn’t personally guess that the first vector would be treated as a repeating sequence, but different people might make different assumptions - I just don’t think it’s particularly clear, and I struggle to believe it’s clear, unambiguous and obvious to statisticians too.
> I would have said it would either throw an error, concatenate the vectors or only add the matching elements
Heh this itself is common evidence of an (experienced) programmer's way of thinking. Remember, when dealing with mathematics, there's no machine (or runtime or similar abstraction) there lurking in the background, enforcing conditions, or even lending a physical reality. Operations in mathematics are defined. Notation in mathematics is just an operation encoded as an "arbitrary" symbol. Which leads to
> I just don’t think it’s particularly clear, and I struggle to believe it’s clear, unambiguous and obvious to statisticians too.
This happens to be convention in both a subset of textbooks and many programming environments. It's mostly an artifact of the notation.
If you want to explore what math notation "feels like" a bit more without learning math (which I _wholeheartedly_ recommend as it's incredibly useful), try out the APL programming language a bit. It evokes a similar atmosphere of notation conveying the idea of well-defined operations.
I guess that's the difference between people who prefer R vs. people who use more traditional languages. '+' means addition in the scientific world, if I'm trying to figure out how to add numbers the first thing I try is '+'. To me, those vectors are just that - vectors of numbers, not instances of a class (even though that is kind of true under the hood). I have a problem I want to solve, and R does a good job of giving me what I want.
In what branch of maths or science is (1,0) + (1,2,3,4,5,6) even defined? That operation shouldn't make sense to anyone; it is adding vectors of different dimensions.
The result should probably be vector promotion then (2,2,3,4,5,6). (2,2,4,4,6,6) is not a good answer.
Sure we can argue about the syntax and what ‘+’ should do but the point is that many people find the build in behavior of R to be intuitive and easy to learn, myself included.
How can something be intuitive if it maps to a totally arbitrary abstract concept? Does that operation even have a name?
That is the plus symbol and it is in the context of two vectors. The intuitive thing to do is to fail with an error if it is a mathematics operation, concatenate if it is a programming context or treat the two vectors as the same length by appending 0s to the shorter one if the goal is to be unhelpfully helpful.
Repeating a vector until the dimensions match then element-wise adding them may be convenient for you, and you may like it. Maybe even lots of people like it. But it is a tough sell for me to believe it is intuitive.
numpy calls it 'broadcasting', although this particular operation doesn't work. For example, this is valid in numpy:
`np.array([1,2]) > np.array([[1],[2]])`
The idea is that an attempt is made to handle the operation even if the arrays / matrices are incompatible. Extending the operations of matrices and vectors in this way allows for extremely concise operations that would otherwise be a pain, like in the original R example.
>>> numpy.array([1,0]) + numpy.array([1, 2, 3, 4, 5, 6])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: operands could not be broadcast together with shapes (2,) (6,)
ie, numpy correctly recognises that totally incompatible dimensions can't be added.
Both languages allow for incompatible arrays to be operated on in this manner though, just because it fails in numpy just means that numpy hasn't implemented that type of broadcasting. For instance, this operation:
np.array([1,2]) + np.array([[1],[2]])
is not broadly defined and understood in pure mathematics. Looking at it, it's not immediately obvious what it would do. You are adding a (1,2) matrix to a (2,1) matrix. It turns out, numpy extends the rules of matrix multiplication to matrix addition through the rule of broadcasting.
R is just doing the same thing, in a different way. These types of implementations are common in scientific computing because it creates easy ways to do these operations, programming robustness and rigidity be damned.
I don't do R much anymore (at the end of the day I personally have the freedom to start fresh so I choose that over dealing with technical debt in the R language itself), but the R Studio product, team, and R community I found fantastic.
R got a simplicity boost with the "tidyverse", R studio and ggplot, all driven by Hadley Wickham. at its core, R is a very straightforward language. however it never had a benevolent dictator who gave it consistency, elegance and style. Hadley is compensating that a bit.
A lot can be said in favor of R, but "straightforward" is a debatable description. The semantics of R were never really designed, and were only recently "discovered", post hoc. See "Evaluating the Design of the R Language" (http://janvitek.org/pubs/ecoop12.pdf).
I'd argue that tidyverse is entirely nonconsistent with the rest of R, though. At least base R packages all operate in an "R-way," so learning this syntax helps you with other packages that others try to write in an "R-way," while tidyverse only operates in a tidyverse way that you can't take your syntax knowledge with you to other packages.
I'd say the learning curve for making a sexy plot is a lot shorter with tidyverse, but overall, relying on it handicaps you versus spending the half hour longer to do the same thing with base graphics (or a base-like package).
> I'd argue that tidyverse is entirely nonconsistent with the rest of R
Well, yes, but that is part of the reason why it is so popular. R may be written by world-leading language design experts - but it doesn't show!
> ...relying on it handicaps you versus spending...
There is almost no reason to drop out of tidyverse if the problem domain is visualising data. Most people could go an entire career as an analyst using just ggplot + tidyverse.
If something needs to be plotted and ggplot won't work it probably makes sense to drop out of R and go straight to TikZ or OpenGL.
I think these tools hurt beginners a lot, personally. You have to learn R no matter what you use to plot, but if you want to use tidyverse you have to now learn that too, and the logic you learn there might not transfer over to the rest of R that you are gonna have to learn no matter what, and can lead to confusion. It doesn't help that stackoverflow uses R or tidyverse code interchangably and its up to the beginner to spend time figuring out what is what. I think it even adds to the notion I see, even here on HN in this thread, that R is some awkward unlearnable monstrous thing that has no place. It is that, when you start adding all these different packages and syntax conventions without thinking if your solution can be implemented trivially with base R. I find base R pretty similar to python with small caveats like you don't use loops and you apply functions instead.
I completely agree, in opposition. This is all true, and makes strong case for deprecating base R and replacing big chunks of the language with tidyverse equivalents. I would certainly encourage the people maintaining the language to reflect on your points with that in mind as an option.
I think it's somewhat more complex than that. I think tidyverse-R, which is a quasiseparate language, is only simple with complete buy in, and involves a lot of magic, shorthand, and "These are the symbols I put into the machine to get X back out".
I am primarily a python programmer, but I sometimes use R.
You are either going to use a programming language (or library, etc.) made by a programmer pretending they know about statistics, or a statistician pretending they know about programming. Oftentimes, as a programmer, the right choice is the former, but not uncommonly (because statistics is even less intuitive than programming), you really really need to know that the statistics have been done right. If someone has ported the relevant code from R to python, great. If not, bit the bullet and use R, it's where the statisticians hang out.
You know, I bet statisticians don't think any more kindly of how programmers make stuff. Our use of the '=' sign, for example. We're just used to that kind of thing, so it doesn't look like a problem to us.
R programming is fundamentally different at a conceptual level. You are operating on datasets rather than individual values. Also the GUI mechanism use reactive programming if you are using R Shiny. R is awesome for what it is designed for.
Yeah, its not so hard to groc. Its just a data driven style all the way down. You get pros and cons. Its great at working on data sets.
That said, I feel like correctness should be given a higher priority in scientific computing and yet a dynamically typed, lazily evaluated language is used.
Other than using apply functions instead of loops, coding R is a lot like coding python only you get a lot more of the data science python package functionality already baked into base R. The syntax differences are slight enough where it's pretty easy to move between the two (or find relevant stackoverflow answers instantly to common annoyances). R generally inputs your data and outputs your statistical test results in less code with less headscratching than doing the same in python in my experience. I prefer plotting in R as well.
I've written some R both for small interactive scripts and running in production, it wasn't the first language I learned - R gets some things done very well; it also has some idiosyncracies, there is stuff that is clearly patched up together and exists for backwards compatibility, and there are many ways to do the same thing in R. If you don't expect it to be perfect, it gets the job done - nothing to write home about and certainly not a language that you should avoid at all costs.
There's a difference, in my mind, between "Programmers" and "Invokers of Code".
R is a terrible programming language.
It's not a bad language for invoking code, because for many of those people, they're not taught the concepts behind any language, so it's all semi-arbitrary symbols.
model <- lm(outcome ~ variable1 + variable2 + variable3, data=data)
summary(model)
Isn't any more complex than anything else. And what R does have is a network effect - at this point, for almost any statistical task I've ever encountered, there's R code for it.
They can use it because they don't come with preconceived notions of how programming normally works. And it let's them do powerful analysis without much boilerplate code along with a library of packages for an enormous range of analytical methods and visualization, all without having to do much in the way if boilerplate coding.
Sure the syntax is going to seem alien to them, but so would any first encounter with a programming language.
The use case for R simply isn't a traditional programmer. That isn't the target user. Sure if you need an application that might need significant scale you're not going to use R Shiny, but a lot of R work are one-off bespoke analysis projects. Models that do need to be deployed for use at scale in an application take their output parameters from R models and simply implement them in the app. I do this myself, taking coefficients etc, implement a function call in a database and then use the results on the front end.
I'm a professional programmer and I don't find R hard to understand. Have a look at R for Data Science[0], maybe you'll see why scientists and statisticians find it easy for their analysis and visualizations (and conversely find Pandas+Python very complex).
That's are my thoughts, mostly, but in the end libraries are the killer feature of successful programming languages. I have got used to it and I'm more proficient now at data analysis tasks using R than Python/pandas thanks to tidyverse+ggplot.
it seems to me that the value is in hard-compressed optimized functions and to this crowd this is the most important factor
software engineering practices don't exist unless you have a gigantic program, programming language theory is either not interesting or too foreign for them
about how they manage.. it's easy, they get used to it
It's selling point is it's not matlab. It's downfall is it's not python. The python scientific compute libraries have they're problems but people are using them and math people "get them". I think they have an awful design/api from a programming perspective but most people don't care. Matlab is similarly a strange language but all of it's libraries/tools make what scientists do easy. "Click a button and your code can now run on a compute cluster".
R is very popular with stats people but new PhD candidates are beginning to write python implementations of R things (sort of how like DataFrames/pandas happened).
"R is very popular with stats people but new PhD candidates are beginning to write python implementations of R things (sort of how like DataFrames/pandas happened)."
People have been saying this since I was an undergraduate.
> People have been saying this since I was an undergraduate.
> I'm submitting my tenure packet this year.
That doesn't mean it isn't happening. If anything, the fact that the rate of adoption is slow over a long period of time would tell me that there's more stating power here. If suddenly 20% used python that would be a warning sign. If there was .5% year over year for decades.... That's different.
> R dropped from 8th place in January 2018 to become the 20th most popular language... At its peak in January 2018, R had a popularity rating of about 2.6%. But today it’s down to 0.8%, according to the TIOBE index.
In mathematics we roughly have two ways to discovering new things: one is the theoretical approach (eg Galois and his theory that proves the none-solutions for fifth degree polynomial equations) and the other is the more adhoc technique-driven approaches (e.g. you can see the ingenuity of that in many of Erdos proofs). Of course the categorization is not always so clear-cut you often get a mixed of both worlds.
I think it’s very similar when it comes to writing code. The more theoretical driven approach, the more structures you’ve got at your disposal to reason about things (eg see Mochizuki’s proof of the abc conjecture).
For example, should you use monad in a small project? Maybe not. But if you are dealing with thousands lines of code every week you may find that if you reimplement some components as monads then suddenly refactoring gets easier and you can extend things more easily without having to do a global search and modify all the occurrences of a certain thing every time you make changes to the type, etc
So ultimately you are offloading cognitive efforts to structures, which are really just constructs to optimize cost-to-transform (though it may increase the cost-to-execute (both computationally as well as mental-visualization/simulation-wise))
So is it worth the effort to work with structures? It really depends on the project you are working on or what you see yourself building in the next 10 years
Depending on how you see your career it’s always good to be a bit more ambitious
No, Mochizuki’s “proof” is likely to be fatally flawed. He has almost entirely refused to engage with careful and targeted criticism by the experts who found what they claim is a serious flaw.
Mochizuki finally went ahead and published his work in a journal where he is the editor in chief. So much for peer review.
In general this idea may be true, but for the tidyverse it's just a load of horse.
Hadley Wickham is a very talented developer, and what he's particularly talented at is writing interfaces that are easy for people to use. dplyr, a key part of the tidyverse, is a great example of that. It breaks data tidying up into a few simple steps that can be chained together. It's the descendant of an earlier iteration (plyr) and Hadley learned from that and just kept polishing the interface.
There's a similar story with tidyr. Reshaping data from wide to long is a complex operation. Base R has a reshape() function and using it has given me permanent PTSD. It was impossible to get right until you read the documentation. After you read the docs - still impossible. Hadley wrote the reshape2 package, which improved things a bit. Then there was tidyr::gather() and spread(). Finally, we got tidyr::pivot_wider() and pivot_longer(), and at last I can be reasonably confident of getting the results I need without too many tries.
The tidyverse is hugely popular for just this reason. Without it, I'd probably have abandoned R. I certainly wouldn't dream of teaching it. Ditto Rstudio. Calling these people parasites is absurd.
I agree with your assessment of the claims made in this post, but I want to note that for me the tidyverse culture became the reason to abandon R for all use cases where it has decent alternatives. R is not pretty, but it is also not super hard, and it has some internal logic. tidyverse's logic is orthogonal to that of R, it is a different data-manipulation paradigm, which I know have to know in addition to having to know R itself---not that you can really skip this part. Also, as it has been already mentioned here, this layering doesn't help with debugging.
ggplot2 did a similar thing with plotting. Together these two projects made R more accessible-like and, IMO, seriously damaged the culture around it.
tidyverse has ruined stackoverflow. So many questions could be answered with base R, and sometimes you might need to do base R if you want your code widely compatible, but people insist on submitting some arcane ggplot2 or dpylr code instead, and nothing is learned about R.
Sure! It's brought lots of neophytes into the language, who don't really understand base R. That's a problem of success. If we were still stuck using interfaces like base::reshape, base::merge and (shudder) the lattice package, then there'd be fewer users and they'd all be more expert by necessity.
It's interesting, though, that those "clean" interfaces come with the horrors of non-standard evaluation and have caused a rather large divide between the tidyverse and base R.
I fully disagree with the disparaging aspersions or even motives that TFA leans towards, though. I can see how this would arise out of a local minimum.
The "beauty" of R has always been its core DataFrame abstraction and the fact that a table is a language primitive — and that's where 90% of the consistency came from (from my outsiders' vantage point).
Hah, no I don't mean its _behaviors_ are beautiful by any means! The "beauty" is just that it had a complicated data structure built in as a primitive _looong_ ago that all packages standardized their APIs on.
Just to add an anecdatum, as a python programming who has started using R in production, I find dplyr to be less clear than base R. Not that either one is clear.
I think the author ignores the fact that creating simple, consistent systems is a very hard problem, and rarely compensated (short term). So most apps, APIs, platforms will eventually become complicated and messy until there is significant push from a competition to force people to do the dirty work of cleaning up architecture
So if you want a clean and simple R ecosystem, go support Julia or some alternative like that.
"Never ascribe to conspiracy, that which is adequately explained by incompetence."
Not that complexity means the programmer was "incompetent", per se, but not sufficiently competent to keep things simple, because as you point out, keeping things simple is really hard.
"Simple Made Easy" by Rich Hickey:
https://www.infoq.com/presentations/Simple-Made-Easy/
"We should aim for simplicity because simplicity is a prerequisite for reliability. Simple is often erroneously mistaken for easy. "Easy" means "to be at hand", "to be approachable". "Simple" is the opposite of "complex" which means "being intertwined", "being tied together". Simple != easy. ..."
> Once a collection of complicated packages exist, it is in RStudio’s interest to get as many other packages using them, as quickly as possible. Infect the host quickly, before anybody notices; all the while telling people how much the company is investing in the community that it cares about (making lots of money from).
More than RStudio, it's in Tidyverse users' interests to extend and embrace; even without any plan or malice aforethought, it is simply natural that as Tidyverse users do their own thing, their packages and scripts will rely on the Tidyverse ever more, and their downstream users will thenceforth rely on it whether those users like it or not. Thus the situation where you install.package some harmless looking package and now R is installing 20 or 40 packages from the Tidyverse.
I missed your article when it first came out, and I'm glad to have encountered it here; thank you, I think it helps explain some things here and more generally. I do think that this explains why the "Base R" people feel so threatened by the Tidyverse.
What I find interesting about the quoted excerpt, and the article most broadly, is about how it spends very little time discussing what RStudio's economic interest in propagating the Tidyverse actually is. Because, notably, RStudio doesn't charge for the Tidyverse packages. There's no Tidyverse Enterprise Edition, there's no Tidyverse support plans. They mostly make money off their IDE and server products, and those products don't have a lot of meaningful synergies with the Tidyverse packages; there's few IDE features that integrate directly with the Tidyverse, and those that exist aren't very significant. Meanwhile, nothing about the Tidyverse is written to work better with RStudio than with VS Code using the R language server, or with Emacs, or what have you.
This isn't to say that RStudio is doing this for any reason other than economic interest, but the economic interest here is not in convincing R users to use Tidyverse packages; RStudio makes the same amount of money if you do your plots in ggplot or not, and if you switch to ggplot there's nothing about it pushing you to change your IDE/editor.
So what is the business model for the Tidyverse, then? It's pretty straightforward: the goal of the Tidyverse from the perspective of RStudio is to drive R adoption, mostly at places who are using commercial closed-source alternatives to R, like Matlab, or SAS or SPSS. (You could argue that they're competing for mindshare with Python, too, but RStudio has moved to treating Python the way that post-Azure Microsoft treats Linux, as a part of its product portfolio. RStudio's flagship products, the IDE and the Connect server, both advertise first-class Python support. Whether or not they have achieved this is another question, but they certainly are trying.) Once you have converted people to using R, you can sell them IDEs and servers.
I suspect that the "Base R" people have a hard time grappling with this aspect of the Tidyverse business model, because it implies that Base R is _inherently less popular_ than the Tidyverse, which makes their losing less about the whims of a corporation that they can argue is acting in bad faith, but about the preferences of the R community, which they have no power to gatekeep.
Because the point of the Tidyverse is driving adoption among people who were not previously R users, the Tidyverse is able to win over the majority of the community not by persuading holdouts but simply by growing the community with new Tidyverse supporters. I understand how that can feel threatening to someone who was an R user pre-Tidyverse, but you understand why they don't want to focus their message around shrinking the R community. And their efforts to persuade these new people to switch from the Tidyverse to Base R is undermined by the fact that the best argument for Base R over Tidyverse, the familiarity of Base R to someone who learned Base R to begin with, doesn't apply to them at all. They're fighting a losing war with bad weapons.
A very unusual claim we have here, on the link-through to Matloff: “The Tidyverse also borrows from other "purist" computer science (CS) philosophies, notably functional programming (FP). The latter is abstract and theoretical, difficult even for CS students, and thus it is clear Tidy is an unwise approach for nonprogrammer students of R.”
This seems to be the root of the argument, and it is a completely bizarre statement.
I agree with Matloff's overall point, though. "Tidy programming" (which came to mean non-standard evaluation) is very hard to understand, even for R professionals. It relies on directly handling symbols, and encourages using new notations to do it. Debugging is even more complex with NST code, and people learning the language will be doing a lot of debugging. I can't imagine a good way to introduce functions to newer users when using NST. You'd have to first mention environments, scoping, and symbols.
My rant aside, I did find this next quote from Matloff a better argument for using dplyr:
> mean(Nile[80:100])
"printing the mean Nile River flow during a certain range of years. Incredibly, not only would this NOT be in a first lesson with Tidy, the students in a Tidy course may actually never learn how to do this. Typical Tidiers don't consider vectors very important for learners, let alone vector subscipts."
With dplyr, you'd subset based on a `filter()`, likely specifying the years to keep. It encourages self-explanatory code. Matloff's vector subscript tells me nothing about why certain elements are kept.
I've accepted that my open source project, htmx, will likely never drive any significant income. It's too simple. Whenever someone starts talking about consulting, I tell them to read the docs and maybe jump on the discord, but that they'll probably figure it out.
In addition to the revenue angle, a lot of developers don't like admitting "that seems too complicated" because it can be interpreted as "I'm not smart enough". Arguing for simplicity often requires a lot of backbone (or disagreeableness, according to taste.)
Complexity is a source of income everywhere, not just open source. It’s even worse in the Microsoft world. Just because you can’t see the complexity doesn’t mean it isn’t there.
The blog post is projecting a lot on the situation. It’s a blog. Take it for what it is.
When choosing between Office 365 and Google Workspace, you may find the familiarity of Office is an expensive choice with all the IT people you need to hire to tame it.
This article strikes me as odd. The premise is based on a "pattern of behavior" but to me it feels the author is attributing malice where there likely is none.
> For the last nine-months I have noticed that the term Tidyverse is being used more regularly to describe what had been the Hadley-verse. And???
Hadley never liked the name 'Hadleyverse' and the first appearance of the `tidyverse` package on CRAN is from 2016.
> RStudio management rode in on the data science wave, raising money from VCs.
their IDE is clunky and will be wiped out if a top of the range product, such as Jetbrains, adds support for R
I strongly disagree. I've used both and don't find RStudio clunky. And Jetbrains would need to do more than simply add support for another language to compete. RStudio is more than just and IDE. For one, there's Shiny, which makes it trivial to deploy slick web-based data products to end users. Then there's the fact that RStudio has multiple methods of compiling code and analysis into formal document formats suitable for publishing or presentation.
RStudio isn't just an IDE, it's an entire ecosystem for coding, analysis, app development, and publication. It's not like Jetbrains couldn't make something like that, but it would essentially be a completely different product, not just one more supported language.
I can emphasize with the argument that complexity is a source of income. But, I don't think that it's intentional. Rather, the constant change of developers introduced little inconsistencies that over time accumulate and generate complexity. Recently, I even looked at some open source projects[1] to measure the complexity by counting import/include statements.
Consultancies made billions with unnecessarily complex JEE/Enterprise patterns systems that needed hundreds of developer to maintain. So I'm not surprised all that pseudo agile/DRY/SOLID stuff were pushed as "good practices" in the 2000's and the 2010's.
"No you can't just use if statements, you need to use a chain of responsibility with many classes cause it's SOLID".
It's not a mystery that one of the author of a popular front-end framework was a JEE consultant, and this popular framework now feels like a solution looking for a problem...
What he's touching on is Complexity Fallacy. (i.e. we "over-complexify" things). Why? "We often find it easier to face a complex problem than a simple one."[1]
Humans also possess Simplicity Bias. (i.e. we "over-simplify" things). "Simplicity bias is a cognitive bias towards holding views which can be explained by a simple narrative."[2]
Ironically, this very article may exemplify the Simplicity Bias to explain a Complexity Fallacy!!
It seems the ultimate answer is the perennial, "know thyself." Practice self-awareness; and know which Bias you're leaning on, and more important, why you're leaning on it (in light of the goal).
This is kind of what happens with Postgres. Postgres in itself is great and simple to use. However, managing replication, maintenance, sharding... Is often quoted as very complex in the conference themselves. And that's precisely what most of the companies around Postgres live on
There's plenty effort improving those aspects. Including by employees in companies making money on PG support. But there's plenty hard problems - there are in other areas too, but work on those often started much earlier. These days the result of complexity around replication leads users to RDS et al, which reduces income for the various PG companies, it doesn't increase it.
This is why I dislike commentary like the one you are responding to.
Replace "postresql" with literally any tool and the sentence holds true: redis, mysql, mongodb, cassandra, mssql, HDFS, Kubernetes, React, Vue...
The fact that cottage industries pop up around powerful tooling is not a bad thing. Some things are just complicated, and there shouldn't be such an aversion to those things.
The thing is that SQL was originally built to be an easy accessible DSL for simple database queries, and was not intended to be used to build highly complex stuff. Datalog-like query languages would do the latter much better since they are actually capable of letting you write reusable predicates and libraries for common usecases, while SQL easily grows into a monolithic monster.
Sadly a lot of the "NoSQL" hype later on was led by people who had no clue about database theory, and proceeded to throw the baby out with the bathwater and build whatever abomination MongoDB is, rather than taking SQL's theoretical basis and building a language on top of it that is actually designed for complex usecases. There were only a few exceptions like say Datomic in Clojure world that did NoSQL right.
IBM did not spend the money it did to replace its premier database product, IMS, with another DBMS, DB2, just to enable "simple database queries". Yes, SQL was influenced by other "4th generation languages" of the era. But IBM judged -- correctly, on the evidence -- that most programmers would not take the time to master first order predicate logic or set theory.
So they came up with something that implemented the Relational Model in a simple way, one that would attract new converts. SQL has plenty of warts and defects, but its success at extinguishing every other database query language can't be denied.
Datalog isn't especially difficult to learn though, and is intuitive enough that you can fully learn it in 20 minutes.
The relational model with unions of conjunctive queries at its core is excellent. SQL on the other hand feels like some weird COBOL dialect that gives you an incredibly verbose implementation of it.
With that said, there are a few enterprisey NoSQL databases which are relational yet provide a nicer language to work with than SQL. I mentioned datomic (which also has persistence as its key feature), but there's also TypeDB which lets you use the first class support for recursive queries to support inheritance more easily. And of course there's google's Logica language which compiles to SQL and lets you use datalog syntax and the ability to write reusable queries, though it doesn't support recursion due to the nature of its target.
This is true, and a corollary is that the observation that the value proposition for many SaaS companies is just the simplicity over open-source underpinnings.
For example, anyone who's tried to implement SAML SSO by hand or self-hosting a shibboleth server will appreciate the relatively simpler services provided by OneLogin / Okta.
I'm not sure, however, that this is a problem. The freedom of open source to develop without constraints has to come first, so that someone can later take existing code and impose guardrails to simplify.
In other words, freedom itself is at odds with simplicity.
Economics, not freedom, is at odds with simplicity. Nothing is actually free, so you have to have a paywall somewhere. For for-profit open source the paywall is usually at the UI/UX layer where that's typically SaaS-based or for enterprise software available in the form of consultants to hold your hand and train your team.
Make a free, open, easy to use app and you'll go bankrupt.
I frequently run across codebases that appear to have purposely “broken” tiny things that need to be fixed before the code can be used. This seems to be a type of lock/key mechanism to prevent just anyone from using it even though it’s open source
I meant freedom as in liberty, not free as in 'free beer.'
In other words -- if you are free to write code any which way, complexity will result. Or conversely, in order to keep code consistent, someone would need to enforce rules.
I 100% agree, but the problem is that most people care a lot more about free as in free beer than free as in liberty.
I have for quite some time seen the two as being at odds. Free as in free beer undermines free as in liberty by destroying the economic basis for anything other than surveillance or closed business models.
My first job was working for Sendmail, Inc. Their entire business model was based on helping people configure open source Sendmail, via consulting and/or a commercial web UI for configuring the software. And of course selling support for the open source software.
And it wasn't even intentional. The software existed for almost two decades before the commercial company. The commercial company sprang up because the open source authors needed paychecks for all the support they were doing. And also a bunch of companies were selling Sendmail consulting. So they figured people would rather buy consulting from the people who actually make the software.
Sadly, it didn't work out. The company was acquired for a loss to the investors.
These days it seems like people make a library, and they never get to documenting it properly until someone offers them a book deal. And they can't just put all of than knowledge immediately in the docs otherwise who would still buy the book?
From outside this looks like they're holding out on us.
Some tools require google stalking the maintainer in order to synthesize explanations from six different places into a Theory of the System of how the thing is supposed to work. Apache Ant, for a memorable one.
As with many things, you don't know your own opinion of a subject until you have to try to teach it to someone else.
They're doing the fun part (coding) for free. It's well-known that developers don't like writing documentation (esp "properly" which is ill-defined and for some definitions can be a much larger task than writing the code in the first place); in fact, does anyone enjoy writing docs as much as an engineer enjoys writing code?
So this makes total sense to me. As a user of free software, you can choose: a) pay for a commercial package with comes with docs; b) use free software, and pay for the docs in the form of a book (perhaps waiting a few years until lots of people want to do the same); c) use free software and figure it out from breadcrumbs the developer left in whatever forums they use; d) (c) but then step up and write some docs.
Too many people think that free/libre software is supposed to be like commercial software, only better, but it's not the case. It's definitely better in the "libre" way, and of course cheaper in the "price" way, but it's usually not packaged for easy consumption, nor is the developer going to do a bunch of work they don't enjoy for other people who don't want to pay them for it.
> Too many people think that free/libre software is supposed to be like commercial software, only better, but it's not the case.
"It's free so shut up" only works if being free has no consequences. Software is all social, and we pretend that it isn't. Cutting the legs out from underneath commercial software has consequences. Crowding a headspace with entrants has consequences for the next person considering doing the same thing. Implying you have something worth other people's time to look at has consequences.
You figure out how to solve all of those problems and you can be above reproach. Until then, if you can't do it moderately right, then don't do it at all (or keep it to yourself).
True, but if your business model depends on providing support, you may be less incentized to minimize the growth of complexity as new functionality is added (especially if your software has little competition.)
I have found that claim mostly projection from closed source rivals in my experiences. Namely because those are the only places I have found support knowledge outright cashgated.
I head the claim and it seems as if it was a line from Microsoft back when they were hostile to open source and a minute after it left their mouth managment promptly set up a meaning to ask "Why aren't we doing that?" shortsightedly to the detriment of their own longterm health.
Open source ethos have been more "not want to deal with that crap" in ethos and a general tendency against wanting to provide support if it doesn't involve bugfixes. Again back when the community were more stereotypically grumpy greybeard telling you to boost your expertise by RTFM than evangalists with sizable blog and/or social media presence depending on the timeframe, or even novices grinding out tutorial websites as a portfolio item of sorts as an extension of "extracurriculars to get into a good college" side activities. I am not calling that cultural change a bad thing, far from it. I am just noting it has far more energy involved in maintenance of a communal efforts compared to less than easily accessible man files whose gap in time between last update and today could rent a car if it was an ordinary human.
Perhaps there is a perverse incentive to not worry about simplifying now, but I think complexity will eventually piss people off enough to either leave the community or develop an alternative.
Atlassian's Jira is a good day example of what the author's gunning for. A complex, unintuitive, hulk of a product .. with seemingly nonsensical UI updates and archives of out of date documentation.
Amazing for people providing support and training. As a potential user, I've been annoyed by badly set-up instances so many times I'd gladly turn my back on it.
another example is Moodle. It's horrible spaghetti code (PHP as used in 2000, including addons relying on specific database schemata) and if you google for it you'll find a nice forum thread, where a lot of LMS "consultants" are bashing a guy asking whether there are plans for a rewrite.
And yes, as a Moodle user (teacher) you notice quickly that these decisions seriously impact usage!
Based on the title the first thing that popped in my mind was https://www.aseprite.org/ there source code is freely available to anyone https://github.com/aseprite/aseprite/ and if you search some more you can find how to install it at https://github.com/aseprite/aseprite/blob/master/INSTALL.md But none of these links are easy to find and the installation is not that simple either. Is this way of doing things good/bad? I just do not really know how to feel about it. It's partly better then not having the option available, but, i just don't know to feel about it.
I have no issue with this basic premise. Its annoying if you want to pay nothing for a mature, nuanced service, but in my mind its the most equitable way to compensate open source development teams. Dual licensing, "enterprise" features, and such seem to be popular, but are really frustrating to work with and buy. No knock on software salespeople, but I've candidly never had a pleasant experience around any sales or marketing interaction with an OSS team. I would far rather pay for expertise than suffer endless email marketing, inside sales lead qualification, waiting for the AE to ask an internal expert, and getting spun around whether a needed feature does/will exist. That entire apparatus seems to exist to extract money from corporate bank accounts while development is a cost center rather than main focus of the team.
More fundamentally, complexity has more to do with the problem space than the business model.
The central conceit of the paper (https://arxiv.org/abs/1806.06850) is the "NN<->PR correspondence" from section 6, which argues that since nice nonlinear activation functions like tanh can be arbitrarily well-approximated by polynomials, then the whole stack of convolutional layers can be arbitrarily well-approximated by polynomials, and that means the whole network is just a polynomial.
And sure, ```technically''' that's true but there are good reasons that numerics aren't just all polynomials all the time (check out Lloyd Trefethen's 'Approximation Theory and Approximation Practice', for example).
tanh is nice and smooth but its Taylor series doesn't converge outside a small region, and the whole reason you'd use tanh in the first place is because you'll have things outside that region. And tanh is probably the best case for Matloff's argument! ReLU has that corner, and I don't even want to think about approximating softmax with a polynomial.
The blog has comments by, among others, Radford Neal and Kenghagho which I find compelling.
IMHO "open" systems that are too complex are not actually open.
What is closed source anyway? It just means you get binaries. Binaries are not actually obfuscated. You can disassemble them, and there's even some decompilers on the market that will create functionally equivalent C code with auto-generated function and variable names. Nothing stops you from decompiling and reading a closed-source binary... except the complexity and difficulty of understanding it.
Open source systems where the source and installation/admin procedure are absurdly complicated are not actually open... they're just using a different obfuscation technique. If it takes days to install or the source is write-only code, it's no different from closed source software.
This is asked as if there is no correct answer, but there is.
Open Source software is software where the source is made available under an OSI-approved Open Source licence. Closed source software is software which is not Open Source software.
Sometimes even the binaries are carefully obscured. I imagine it's an uphill battle to dig the binary out of a PS5 game, for instance.
> Binaries are not actually obfuscated.
Some are.
> Nothing stops you from decompiling and reading a closed-source binary... except the complexity and difficulty of understanding it.
And possibly the licence, and possibly laws like the DMCA. Also, even if you succeed in reverse-engineering the code, you may be legally constrained in what you can do. If you improve the software, you probably cannot legally distribute the improved version, although you might get away with distributing a delta.
> Open source systems where the source and installation/admin procedure are absurdly complicated are not actually open... they're just using a different obfuscation technique.
I agree artificial complexity (whether in the code or in the software's usability) could be a way for someone to, say, deliberately subvert copyleft licences. An artificially unmanageable codebase would increase the cost of development for the company, though.
As for tricky installation processes, I believe the only way to get the official OpenBSD builds is by buying their DVDs. I don't think most OpenBSD users particularly mind compiling their own system, but I don't know for sure.
> If it takes days to install or the source is write-only code, it's no different from closed source software.
It is different, because the licence allows others to improve the software.
> This is asked as if there is no correct answer, but there is.
> Open Source software is software where the source is made available under an OSI-approved Open Source licence. Closed source software is software which is not Open Source software.
I believe you're missing the point. Parent is making a moral point and your talking about copyright laws and licences stamped by OSI. For one i'm pretty sure parent would much more trust GNU than OSI. Perhaps parent's point would've been clearer if they said 'non-free software' instead of 'closed source'.
And GNU licences define 'source code' as the 'preferred form of modification'. The reason for this being exactly parent's point. As an example, that's one (of the multiple) gripe i have with ml: some 100 lines of python defining a neural network with millions of params doesn't constitute the source code, neither do the trained params. The source is in this case non-textual, it's the training data and the state-dump of the IDE used to manage the training and inspect the model.
> > If it takes days to install or the source is write-only code, it's no different from closed source software.
> It is different, because the licence allows others to improve the software.
But that is pure legalese since it's intractable to modify it. It completely subverts the spirit of the licence, as such it is practicaly and in spirit closed-source.
Extending a bit more the argument, what we want isn't just free software, it's software that's also interoperable, where it's easy to reuse just a part, well engineered (minimaly complex), with well defined scope and powerful/precise interfaces, etc. That software is golden, the rest will just disappear in 20 years.
Thanks for this response. Hadn't meant to straw-man.
I agree it's important that our understanding of something like Free Software remain relevant and practical. At the same time though I think quality and freedom can generally be pretty well separated. If you release poor quality code under a Free and Open Source licence, we generally agree it still counts as Free and Open Source, despite that it's hard to work with. I think that's as it should be, although deliberate subversion might be an exception.
Are there any examples of this actually happening, though?
I imagine the best way to create obfuscated code is to use an obfuscator program, but the result of doing so is not considered to be source code, for our purposes (the preferred form of modification, as you rightly noted). If someone wants to deliver obfuscated source code and pass it off as Free and Open Source Software, they would need to program it manually, which would be severely punishing to their development work.
> what we want isn't just free software, it's software that's also interoperable, where it's easy to reuse just a part, well engineered (minimaly complex), with well defined scope and powerful/precise interfaces, etc.
That strikes me as clearly going too far. Technical excellence is not among the four conditions that define Free Software, neither should it be.
It also has absurd consequences:
• If you write a program in a language which then falls out of favour, do we say that it was Free Software at the time but no longer? (Software freedom is now defined in such a way that it is a function of which programming languages are currently in vogue.)
• If you are an outstanding programmer, you are better able to cope with poor quality code and with code in obscure languages, broadening what software you are able to work with. (Software freedom is now defined in such a way that it is a function of the individual programmer's skill.)
• If you develop a new algorithm and make the source code of your implementation available under a Free and Open Source licence, before you publish a paper explaining your new algorithm, the implementation is considered non-Free prior to the paper's publication, but Free afterward. (Software freedom is now defined in such a way that it is a function of available documentation and literature.)
Similar points might be made on the grounds of what hardware is currently available (does the non-availability of an emulator impact whether software counts as Free?), and what institutional resources are currently available (should laser-cutting software be considered non-Free by most, but Free by the relevant industries?). As an example, NASA's old launch programs [0] are at the intersection of these two concerns.
This is why a market-based economy is not good at solving problems. It's great at managing problems, alright, but solving a problem would mean cutting off a source of potential revenue.
It's not good at solving problems in the best possible way according to everyone. Centralisation is great at solving problems in the best possible way according to a single person or circle of people.
It's good at solving problems so that the majority of people willing to pledge resources are happy.
2) python and numpy (and the scientific stack built atop it)
In neither case was the primary incentive, at first anyway, a desire to make money by complexity lock-in. I think there is a real cost to the complexity of having what is, essentially, a whole second language (with its own syntax and ways of doing things) built on top of the original language. But, I think this kind of thing can and does happen regardless of the financial incentives involved.
Wouldn't be the only such project if that was the case.
I often joke that Angular(or specifically any app written in it) is a job program for front-end developers.
The architecture of this framework is so byzantine that you can spend _years_ working with it and still keep discovering new, often insufficiently documented features.
> A complex package ecosystem was probably not part of RStudio's product vision, at least for many years. But sooner or later, RStudio management will have realised that simplicity and ease of use is not in their interest.
Or it just grew that way. Which is the most likely explanation.
Complexity is a source of income in every business ecosystem. For example Google is super profitable because information on the internet has high degree of complexity and the more complex ecosystem is the more opportunity there is for innovation and revenue.
Well isn't that the whole pay for support business model? You know linux vendors could have made this easier, but there's never been any rational self-interest for them to do so. There's money in bringing order to chaos.
> RStudio, the company, need to sell their services (their IDE is clunky and will be wiped out if a top of the range product, such as Jetbrains, adds support for R). If R were simple to use, companies would have less need to hire external experts.
They are actually fighting for survival I would say.
Most of the people using R and RStudio are the people who only know R and RStudio.
RStudio's interface feels like it is from the early 90s, and RStudio is impractical and messy to use. Frankly, it is _bad_.
R, as a language does only very few amounts of things well. And when some people's use cases that are limited to that, uses it.
Data Science wave riders dominantly learn Python (IMO, as they should). Think >90%.
There are plenty Data Science Misguiders (influencers) who, just from shallow knowledge, tell gullible learners to learn R OR Python. And still almost everybody chooses the second.
Some people I know who uses R amd RStudio uses them as a simple Dashboard, dataviz app, or those who uses it for other reasons write R using Jupyter, VSCode, or good ol' vim.
R thrives on the backs of lesser skilled developers amd domain experts using R for some reason.
Some argue that the tooling of R projects is better than Python's, but it does so less. A penny is of course more reliable than a train.
I'm one of those noobs. Python... IDK, I don't really like pandas, but i can do anything with python, the community feels pretty welcoming even when you're asking stupid things (which I do), there's plenty of information for everything, I can use any database I want...
I tried R but it felt weird, and the community felt more elitist.
I discovered that PyCharm had some nice R support a few weeks back. Not sure why PyCharm and not other IDEs, but what was there was simple but decent. Even embedded the graphics output in its panels and integrated really nicely. I can see myself using this when I have problems which call for R.
> Matloff is also upset about a commercial company swooping in to steal their precious, a common academic complaint (academics swooping in to steal ideas from commercially developed software is, of course, perfectly respectable).
This reads like some kind of sarcastic statement. In fact it is indeed perfectly respectable, that academics do this, because it lets our society advance and research progress outside of proprietary software gardens. In my opinion any such kind of action is perfectly respectable, if it shares knowledge with the world. The other way around it is not OK, because from proprietarizing ideas, there is no benefit for society. If a company is worth its salt, it will offer services on top of existing public knowledge, without creating walled gardens. If it cannot do so, then it might be time to realize, that the company should not exist.
What is not OK is, if academics take that knowledge and make it seem like they were the ones inventing stuff. One should always give credit where credit is due. Never forget academic honesty. Usually however, a lot of foundational research is done by academics, only to be picked up years later and often by commercial entities dealing in proprietary software.