Can anyone knowledgeable here speak about statistics offer some advice? I'm about to get into a project where, for the first time, I'll need to do some statistics processing and visualization. I haven't started on that component of it yet, and I'm free to choose whatever tool I want. Most of the rest of my project is in Haskell, but for the processing/visualization of statistics part, I was thinking of choosing R. Does anyone know how well Mathematica 8, or other commercial packages, stack up?
I've been using Mathematica (MMA) for my dissertation's data analysis, and for the most part it's been great. As an environment to manipulate data in, it's by far the best that I've used- once you get the hang of it, the pattern/transformation-rule language is incredibly useful for reformatting, recoding, mixing, slicing, dicing, etc. one's data. If you're coming from Haskell, you'll probably pick this part up way faster than I did at first.
If your stats needs are relatively simple- linear models, glms, logit models, anovas, simple tests of hypotheses, etc.- MMA is more than adequate. The new version looks like it adds some non-parametric stats functions, as well as paired t-tests, both of which would be quite useful to me.
Also, the visualization tool in MMA are fabulous, and don't make me want to tear my beard out every time I have to go off the beaten path (as opposed to those found in certain other one-letter-long stats environments I could name). 'Nuff said. Another thing I really appreciate about MMA is how consistent the syntax and functions are- once you've figured out one function, the odds are good that your knowledge will be useful on the next function you try and figure out. This, again, stands in stark contrast to other packages (R, SAS, I'm looking at you guys).
I have found myself turning to R for certain specific things, though. Mixed-effects models, repeated-measure ANOVA, Fisher's Exact Test, etc. Really, the two work together well- it's easy to use MMA to get your data in exactly the right form for R, export it, and then do whatever you need from there.
I think Mathematica is more useful for symbolic processing. For crunching large matrices of numbers and making some plots, R is probably best and/or Matlab (or the free Octave). It depends mostly on which programming paradigm you are more comfortable with.
I cannot speak for mathematic, I have barely used it. For stats, R is hard to beat, it has a lot of "cutting edged" toolboxes through CRAN (CPAN for R), and that's what uses most academics in statistics - most leading academics in statistics use it.
Now, its heritage shows quite a bit, and the language is not always nice to play with. It has great plotting facilities ala ggplot: http://had.co.nz/ggplot, which is a very interesting way to look at data visualizaton in a principled way.
There are also quite a few things available in scipy if python is your thing. If you just want to do stats and are not familiar with python nor want to deal with a general programming languge, R is better I think. If you want to make full-fledged applis with a web-fronted, R will not be pleasant :) {usual disclaimer: I am a numpy/scipy contributor).
Alternatives like Matlab, Maple, Stata all have basically the same 'look' to their default graphing packages.
Even though Mathematica would not be the right choice for statistical processing, the graphs it produces are a step above the rest.
So depends what you're use case is.. any of the above would look good enough for an academic paper. But if you're going to be publishing these in a magazine, they probably won't cut it.
Especially when you have such pretty defaults in things like Matlab and Mathematica.
Matplotlib (http://matplotlib.sourceforge.net/) is actually the best opensource library for creating great looking graphs that I have come across, and is comparable to Matlab and Mathematica.
I guess taste comes in place - I find producing good (as in academic publication good) figures in matlab an exercice of pain. The subplot mechanism is awful (at least was 5 years ago), and it is hard to control the layout. I heard mathematic is much better in that regard, but never used it myself.
Just use R -- the stats support is IMO the best in the world and it has very high adoption amongst stats grad departments and practitioners of statistics. The visualization tools work well -- I've written about basic plotting tools, but if you're just starting in R, skip the built in plotting tools and just use ggplot2. It allows you to build astounding graphics.
Of course, any of this is a time investment, but I'd say the only alternative is Matlab - S-plus is stupidly expensive and no better than R, Stata is a pain for any sort of automated processing, SAS is overpriced by an order of magnitude with a hideous learning curve for functionality that lags 10 years behind R, and Mathematica is brand new to the market. Let someone else work out the kinks.
SPSS is quite good, for certain things. A lot of researchers use it because little-to-no programming is required, and you can interact with it in an entirely GUI-way- if you can use Excel, you can use SPSS. It makes it easy to set up certain analyses, and gives lots of output... and there's where my concerns about it come up. It's easy to fall into a false sense of security with it, and to end up with statistics that you don't know how to interpret properly (I call this the "Huh. Now what do I do?" problem). The documentation is often pretty useless on this front as well- lots of pages follow this general pattern: "Jones Test of Gronkularity: If checked, SPSS will calculate the Jones Test of Gronkularity statistic, which tests the null hypothesis that the data are gronkular", as opposed to useful information about why you might care whether the data are gronkular or not, why the Jones test was included in another test's output, etc. For a product aimed at people with relatively limited technical capabilities, I feel like SPSS should have better docs.
One important thing to know about SPSS- that a lot of people don't- is that is really a programming language, for which the GUI is simply a code generator. I find that it's almost always easier for me to interact directly with the under-the-hood guts of SPSS than the GUI, although sometimes when setting up a new analysis for the first time I'll use the GUI to do most of the work and then tweak its results.