Monthly Archives: January 2011

A gentle introduction to R

Whenever a post on this blog requires some data analysis and perhaps a chart or two, my tool of choice is the versatile statistical programming package R. Developed as an open-source implementation of an engine for the S programming language, R is therefore free. Since commercial mathematical packages can costs thousands of dollars, this alone makes R worth investigating. But what makes R particularly powerful is the large and growing array of specialised packages. For any statistical problem you come across, the chances are that someone has written a package that will make the problem much easier to get to grips with.

If it was not already clear, I am something of an R evangelist and I am not the only one. The growing membership of the Sydney Users of R Forum (SURF) suggests that we are getting some traction and there are a lot of people interested in learning more about R.

Sooner or later, every R beginner will come across An Introduction to R, which appears as the first link under Manuals on the R website. If you work your way through this introduction, you will get a good grounding in the essentials for using R. Unfortunately, it is very dry and it can be a challenge to get through. I certainly never managed to read it from start to finish in one sitting, but having used R for more than 10 years, I regularly return to read bits and pieces, so by now I have read and re-read it all many times. So, useful though this introduction is, it is not always a great place to start for R beginners.

There are many books available about R, including books focusing on the language itself, books on graphics in R, books on implementing particular statistical techniques in R and more than one introduction to R. A few weeks ago I was offered an electronic review copy of Statistical Analysis With R, a new beginner’s introduction to R by John M. Quick. Curious to see whether it could offer a good springboard into R, I decided to take up the offer.

At around 300 pages and covering a little less ground, it certainly takes a more leisurely pace than An Introduction to R. It also attempts a more engaging style by building a narrative around the premise that you have become a strategist for the Shu army in 3rd century China. The worked examples are all built around the challenge of looking at past battle statistics to determine the best strategy for a campaign against the rival Wei kingdom. Given how hard it can be to make an introduction to a statistical programming language exciting, it is certainly worth trying a novel approach. Still, some readers may find the Shu theme a little corny.

The book begins with instructions for downloading and installing R and goes on to explore the basics of importing and manipulating data, statistical exploration of the data (means, standard deviations and correlations), linear regression and finishes with a couple of chapters on producing and customising charts. This is a good selection of topics: mastery of these will provide beginners good grounding in the core capabilities of R. Readers with limited experience with statistics may be reassured that no assumptions are made about mathematical knowledge. The exploration of the battle data is used to provide a simple explanation of what linear regression is as well as the techniques available in R to perform the computations. While this approach certainly makes the book accessible to a broader audience, it is not without risks. Statistical tools are notorious for being abused by people who do not understand them properly. As a friend of mine likes to say, “drive-by regressions” can do a lot of damage!

Each chapter adopts the same structure: a brief introduction advancing the Shu story; a list of the topics covered in the chapter; a series of worked examples with sample commands to be entered into the R console followed by an explanatory “What just happened?” section and a “Pop quiz”; suggestions for further tasks for the readers to try; and finally a chapter summary. At times this approach feels a little repetitive (and the recurring heading “Have a go hero” for the suggested further tasks section may sound a little sarcastic to Australian readers at least), but it is thorough.

If I were to write my own introduction to R (one day perhaps?), I would do some things a little differently. I would try to explain a bit more about the semantics of the language, particularly the difference amongst the various data types (vectors, lists, data frames and so on). But perhaps that would just end up being as dry as An Introduction to R. Also, though I certainly agree with Quick that commenting your code is a very important discipline (even if no-one else ever reads it, you might have to read it again yourself!), I do think that he takes this principle too far in expecting readers to type all of the comments in the worked examples into the console!

Statistical Analysis With R is a very gentle introduction to R. If you have no prior experience of R, reading this book will certainly get you started. On the other hand, if you have already started experimenting with R, the pace may just be a little too slow.

Holiday reading

My now traditional annual pilgrimage to the South coast of New South Wales saw the rainiest weather I can remember. While it was nothing on the scale seen in Queensland and Victoria over recent weeks, it did take its toll on some of the wildlife: we saw dozens of dead porcupine puffers washed up on the beach, apparently the victims of an algal bloom triggered by the rains. On the plus side, the lack of sunshine did help me to catch up on a bit of overdue reading, including a review copy of a Beginner’s Guide to R which you can expect to hear more about when I manage to finish writing the review.

I also read two books about climate change, which were very different in style and content.

Merchants of Doubt

The first was Erik Conway and Naomi Oreskes’ Merchants of Doubt (How a Handful of Scientists Obscured the Truth on Issues from Tobacco Smoke to Global Warming). The book is not really about climate change per se, but rather the modus operandi of a number of key climate skeptics. In the process it sheds some interesting light on a question I considered here on the blog about a year ago: why does belief or disbelief in the reality of climate change tend to be polarised along political lines? Most of the protagonists in the Merchants of Doubt are scientists, many of whom were physicists involved in the original US nuclear weapons program. The thesis that Conway and Oreskes build is that these scientists were committed anti-Communists and as the Cold War began to thaw, they saw threats to freedom and capitalism in other places, particularly in the environmental movement. That, at least, is the explanation given as to why the same names appear in defence of Ronald Reagan’s “Star Wars” missile defence scheme, in defence of the tobacco industry (first arguing against claims about the health risks of smoking, later about the health risks of second-hand smoke), dismissing the idea of acid rain and finally casting doubt on claims of human-induced climate change.

While I would not expect the book to sway any climate change skeptic, it should at least encourage people to think a bit harder about messengers as well as the message. It certainly prompted me to do just that. When reading the chapter on the second-hand smoke controversy, I immediately thought of an episode of the Penn and Teller’s very entertaining pseudo-science debunking TV series Bullshit*. The episode in question, as I remembered it, did a convincing job of portraying the risks of second-hand smoke (SHS) as dubious at best. Watching it again was eye-opening. Looking past the scathing treatment of the anti-SHS activist, I focused instead on the credentials of the talking heads who were arguing that the science was not settled. The two main experts were Bob Levy from the Cato Institute, a libertarian think-tank, and Dr Elizabeth Whelan, the president of the American Council on Health and Science.

Levy’s voice immediately suggests he is a smoker, which does not, of course, disqualify him from questioning the science of SHS. More intriguing is the fact that the Cato Institute regularly appears as a company of interest in the Merchants of Doubt. Conway and Oreskes draw a number of links between the Cato Institute and both the defence of the tobacco industry and skepticism of global warming, particularly in the person of Steven Milloy who, before joining Cato, worked for a firm whose main claim to fame was to provide lobbying and public-relations support for tobacco giant Phillip Morris.

As for the American Council on Health and Science, it sounds at first like some kind of association of health professionals (which is presumably why Warren chose the name). It is in fact an industry-funded lobby group…sorry, I mean an independent, nonprofit, tax-exempt organisation. Exactly how much of their funding comes from where is now shrouded in mystery, but here are the details as of 1991.

Of course, scrutinising the backgrounds Levy and Whelan does not prove that their claims are wrong. It does, however, raise the question of why Penn and Teller did not interview anyone more independent, perhaps even a scientist, who expressed the same doubts.

What’s the Worst That Could Happen?

The second book on climate change that the rain helped me to read was Greg Craven’s book What’s the Worst That Could Happen?. I bought this after watching Craven’s amusing, if flawed, video “The Most Terrifying Video You Will Ever See”. Craven, a high-school science teacher in Oregon, has clearly workshopped the issue of climate change extensively with his students and the insight he wants to share in his videos and his book is essentially that the whole problem can be viewed from a game-theoretic perspective. Rather than trying to decide what is true or not (are the skeptics right or are the warmers right?), the important question is should we be acting or not.

Craven decision gridCraven’s Global Warming Decision Grid

In his video, Craven uses an action versus outcome “decision grid” to argue that the consequences of not acting in the event that global warming turns out to be true are worse than the consequences of acting (i.e. economic costs) if it turns out to be false. The argument is entertaining, but unfortunately flawed. The problem is that it can be applied to any risk, however remote. As he writes in the book:

Simply insert any wildly speculative and really dangerous-sounding threat into the grid in place of global warming, and you’ll see the grid comes to the same conclusion–that we should do everything possible to stop the threat. Even if it’s something like giant mutant space hamsters (GMSHs).

The book is an attempt to rescue his idea by developing a series of tools to help sift through the arguments for and against climate change without having to actually understand the science. Along the way, he includes an extensive discussion of confirmation bias which I enjoyed as I am fascinated by cognitive biases. Ultimately though, his conclusions rest on an argument from authority. While he makes an excellent case for the important role that authority plays in science, this approach will not win over the skeptics I know: I can already hear their riposte in the form of the establishment’s rejection of Albert Wegener’s theory of continental drift.

Skeptics aside, What’s the Worst That Could Happen? is an extremely accessible book (perhaps even too folksy in its style for some) and is probably best read by those who are not already entrenched in one camp or another and are just sick of the whole shouting match.

* Long-time readers may remember that Bullshit has been mentioned on the blog before in this post about bottled water.

Hans Rosling: data visualisation guru

It is no secret that I am very interested in data visualisation, and yet I have never mentioned the work of Hans Rosling here on the blog. It is an omission I should finally correct, not least to acknowledge those readers who regularly email me links to Rosling’s videos.

Rosling is a doctor with a particular interest in global health and welfare trends. In an effort to broaden awareness of these trends, he founded the non-profit organisation Gapminder, which is described as:

a modern “museum” on the Internet – promoting sustainable global development and achievement of the United Nations Millennium Development Goals

Gapminder provides a rich repository of statistics from a wide range of sources and it was at Gapminder that Rosling’s famous animated bubble charting tool Trendalyzer was developed. I first saw Trendalyzer in action a number of years ago in a presentation Rosling gave at a TED conference. Rosling continued to update his presentation and there are now seven TED videos available. But, the video that Mule readers most often send me is the one below, taken from the BBC documentary  “The Joy of Stats”.

If the four minutes of video here have whetted your appetite, the entire hour-long documentary is available on the Gapminder website. You can also take a closer look at Trendalyzer in action at Gapminder World.

A way with words

Sometimes the things that are unsaid are far more telling than the things said.

I had cause to reflect on this when I stumbled across a book on my shelves that I have not opened for many years. The book, entitled “Deutsche Bank: Dates, facts and figures 1870-1993”, is an English translation of the year-by-year history of the bank compiled by Manfred Pohl and Angelike Raab-Rebentisch. In keeping with the title, the style is more bullet points than narrative. Nevertheless, I continue to find the pages spanning World War II strangely fascinating.

In 1938, with the connivance of the French and British, Germany annexed Sudetenland in Western Czechoslovakia. For Deutsche Bank, this meant more branches.

Deutsche Bank 1938

The following year, Deutsche Bank was fortunate enough to be able to continue its branch expansion, this time into Poland. At least this time, there is a mention of the events outside the bank that may have been relevant.

Deutsche Bank 1939

Another year, and some more expansion for the bank including a few branches in France. No need to mention the invasion of France here, of course.

Deutsche Bank 1940

From 1942, outside events start to interfere with the bank: the “impact of war” forces rather inconvenient branch closures.

DB War End

To see these extracts in the full context, here are the pages spanning 1934 to 1940 and 1940 to 1946.