Statisticians and data analysts are in a kerfuffle about the recent remarks of AnnMaria De Mars, Ph.D. (President of The Julia Group and a SAS Global Forum attendee) in her blog that the open source statistical analysis tool R is an “epic fail,” or to put it in Twitterese, #epicfail:
I know that R is free and I am actually a Unix fan and think Open Source software is a great idea. However, for me personally and for most users, both individual and organizational, the much greater cost of software is the time it takes to install it, maintain it, learn it and document it. On that, R is an epic fail.
And oh, how the hashtags and comments and teeth-gnashing began!
Nathan Yau’s excellent FlowingData blog recaps the kerfuffle nicely, and his post has accumulated a thoughtful comments thread, as has Dr. De Mars’, to both of which I added my thoughts, expanded here:
To make my prejudices clear, I’ve spent several decades in commercial statistical software development (working in a variety of R&D roles at SYSTAT, StatView, JMP, SAS, and Predictum, and I now do custom JMP scripting, etc., for Global Pragmatica LLC.
I can say with hard-won authority that:
– good statistical software development is difficult and expensive
– good quality assurance is more difficult and expensive
– designing a good graphical user interface is difficult, and expensive
– a good GUI is worthwhile, because the easier it is to try more things, the more things you will try, &
– creative insight is worth a lot more than programming skill
Even commercial software tends to be under-supported, and I’ll be the first to admit that my own programming is as buggy as anybody else’s, but if I’m making life-and-death or world-changing decisions, I want to be sure that I’m not the only one who’s looked at my code, tested border cases, considered the implications of missing values, controlled for underflow and overflow errors, done smart things with floating point fuzziness, and generally thought about any given problem in a few more directions than I have. I want to know that when serious bugs are discovered, the knowledge will be disseminated and somebody’s job is on the line to fix them.
For all these reasons, I temper my sincere enthusiasm about the wide open frontiers of open source products like R with a conservative appreciation for software that has a big company’s reputation and future riding on its accuracy, and preferably a big company that has been in the business long enough to develop the paranoia that drives a fierce QA program.
R is great for what it is, as long as you bear in mind what it isn’t. Your own R code or R code that you find sitting around is only as good as your commitment to testing and understanding of thorny computational gotchas.
I share the apparently-common opinion that R’s interface leaves a lot to be desired. Confidentiality agreements prevent me from confirming or denying the rumors about JMP 9 interfacing with R, but I will say that if they turn out to be true, both products would benefit from it. JMP, like any commercial product, improves when it faces stiff competition and attends to it, and R, like most open source products, could use a better front end.
And now let me make my case for R being an epic success.
I like open source software. I use a bunch of it, and I do what I can for the cause (which isn’t much more than evangelism, unfortunately). For me, the biggest win with open source software is that it makes tools available to me, and others, who don’t need them enough to justify much of a price, but who can benefit from them when they’re affordable or free. When an open source tool gets something done for me, or eases some pain at least, I’m not that picky about its interface, and I’m willing to do my own validation (where applicable).
I can’t say that I love using Linux, but as a long-time UNIX geek and Mac OS X bigot, I am glad Linux is available, I use it for certain things, and I think it’s a whole lot better than Windows and other OSes, especially when Ubuntu builds work out. (I’ve had trouble getting JMP for Linux installed on Ubuntu, but that’s probably due to my own incompetence.) OpenOffice is kind of a pain, but it’s better than paying Microsoft for the privilege of enduring the epic fail that is Office, and it has much better support than Office for import/export of other formats. I love it that any number of open source projects are developing such fabulous tools as bzr version control, which I use daily, and that the FINK project is porting a whole bunch of great open source UNIX widgets to Mac OS X.
I think it’s wonderful that some of the world’s greatest analytical minds are using R to create publicly available routines for power-analysts. I love it that students and people who can’t afford commercial stats software, or who won’t use it enough to justify buying a license, have a high-quality open source option, if they’re willing to work at it a bit. I think it’s great that people who think Excel is good enough can’t make a price objection to upgrading to R.
I believe that democratizing innovation and proliferating analytical competence are good for us all. I count on projects like R and Linux to push commercial developers to make better products, and to force pricing and licensing of those products to remain reasonable. Monopolies are good for nobody, including monopolists.
Long live the proponents of R!
What do you think? Do you trust open source stats code? Do you think R’s interface is good enough? Is JMP’s any better? How heavily do you factor quality of documentation into decisions about software?