Monthly Archives: November 2009

Deceptive Charts #2

Last month I wrote about the dangers of secondary axes, but even charts with a single axis can be deceiving. I have been reflecting on this after reading Jon Peltier’s critique of Microsoft’s “professional” charting tutorials earlier this week. One of the charts Peltier takes issue with is a column chart which has the value axis starting at 100 rather than zero. He writes:

This is a major chart fail. The value axis on a column or bar chart should always include zero. Always. If you want to expand the scale to help resolve the values, then a column chart is not the right chart type.

Bar Chart - Bad
Median Income of Readers – Silicon Alley Version

This chart may do a good job of highlighting the Wall Street Journal leading position, thereby supporting Silicon Alley’s headline “The Journal Has The Richest Readership Among Print Pubs”. But it also gives a distorted impression of just how solid the Journal’s lead is. Starting the income axis at zero, shown in the chart below, gives a rather different impression. The Wall Street Journal still sits at the top, but the variation across the titles is much less significant than the original chart suggested.Bar Chart - Good(ish)

Median Income of Readers – Zero-based Version

Nevertheless, precisely because it displays less variation in the data, the zero-based chart does seem less useful and it is harder to read the values. Commenting on Peltier’s post and musing on my posterous Extras blog, I wondered whether starting axes with zero should be considered an inviolable rule of charting. One of the gurus of data visualisation is William S. Cleveland. In his book “The Elements of Graphing Data” he gives this advice: “Do not insist that zero always be included on a scale showing magnitude”. He goes on to make this argument:

For graphical communication in science and technology assume the viewer will look at the tick mark labels and understand them. Were we not able to make this assumption, graphical communication would be far less useful. If zero can be included on a scale without wasting undue space, then it is reasonable to include it, but never at the expense of resolution.

At first glance this would seem to get the Silicon Alley Insider out of chart jail. But the story does not end there. Cleveland’s book focuses on scientific charts, particularly line and scatterplots (also known as X-Y plots) and there is scarcely a bar or column chart to be found. Furthermore, in his paper “Graphical Methods for Data Presentation: Full Scale Breaks, Dot Charts, and Multibased Logging” (The American Statistician, November 1984), he makes the following observations:

The bar of a bar chart has two aspects that can be used to visually decode quantitative information—size (length and area) and the relative position of the end of the bar along the common scale. The changing sizes of the bars is an important and imposing visual factor; thus it is important that size encode something meaningful. The sizes of bars encode the magnitudes of deviations from the baseline. If the deviations have no important interpretation, the changing sizes are wasted energy and even have the potential to mislead (Schmid 1983).

Cleveland’s solution to showing data variation without having bar lengths deceive was to invent a new type of chart: the “dot plot”. Dot plots, which I have used here on the Stubborn Mule to illustrate statistics on asylum-seekers and universities, use position alone to encode the data. This means that it is much safer to drop zero from the axis. Although rather tricky to produce using Microsoft Excel (I use the R package), they are a good substitute for bar and column charts. This BeyeNetwork article goes into more detail about dot plots, including the use of multi-panel plots, which I will look at in a future post.

So, here is a dot plot version of the newspaper and magazine rankings by reader income.
Papers - dot plot
Median Income of Readers – Dot Plot Version

Now you know to be vigilant against the deceptive use of axis scales.

Mahalo 3.0: my new mechanical Turk

Almost a year ago I posted about using twitter as my very own mechanial Turk. Here’s part of what I wrote back then:

The original mechanical Turk was an 18th century machine that purported to be able to play chess. It was, however, a hoax as a human hidden inside the machine was actually doing the thinking. The term has had a new lease of life online to refer to the practice of crowdsourcing, which involves harnessing the power of large numbers of networked humans. Now that I have over 850 followers (a very modest count by twitter standards) I have begun to sense the crowdsourcing power of twitter. If I post a question to my followers (aka my “tweeps”), the responses are impressive.

Since I wrote that, twitter has evolved. An enormous range of applications have emerged that can be used to access twitter and twitter itself has been adding new features. One consequence is that many people use “lists” or “groups” to view only a subset of their twitter followers. So, even if you have a large number of followers, not as many people are likely to see your tweets anymore. As a result it is becoming harder to use twitter to answer arbitrary questions, unless you are something of a celebrity (whether in real life, or just on twitter).

Mahalo LogoThis is not a criticism of twitter: as it evolves, it is becoming a better, richer communication tool. It just means I have to look elsewhere for my mechanical Turk services and I may just have found the answer in the latest incarnation of Mahalo. The creation of iconoclastic, serial entrepreneur Jason Calacanis, began life in May 2007 as a “human-powered search engine”. Aiming to offer an alternative to algorithmic search-engines such as Google, Mahalo used people to assemble information on a wide range of popular search topics.

Then in December 2008, Mahalo Answers was launched. This service closely resembles Yahoo Answers and the short-lived Google Answers and allows users to post questions online in the hope that other users will provide useful answers. With an eye to the power of financial incentives, Mahalo Answers allows you to pay a “tip” for the best answer to a question. All payments are made in “Mahalo dollars”, which can be bought via the online payments site PayPal for one US dollar and redeemed at an exchange rate of $0.75 (the $0.25 difference representing one avenue for Mahalo to monetise the business). Over time, posing and answering questions earns you points and martial arts-style “belts” which provide greater access to Mahalo features.

While I have tinkered with Mahalo in the past, the recent launch of the revamped “Mahalo 3.0” prompted me to come back for a closer look. Mahalo Answers now has top billing, prompting users to “ask any question, any time”. The emphasis on “human-powered search” has shifted. The content is still there, but under headings suchs as Mahalo “How Tos”.

To test Mahalo answers, I posed a question about gold prices. For some time now I have been meaning to follow up a comment on my property prices post, which suggested looking at house prices relative to the price of gold. To do this I need a decent amount of historical gold price data. I was very impressed to have a response within 24 hours pointing to the Deutsche Bundesbank which has monthly gold prices going back to the 1950s. Now I have no excuse not to do the house price analysis.

So while twitter remains my social networking tool of choice, Mahalo Answers is looking like a very promising source of information when Google searches draw a blank. I will continue to experiment with it and as I do you can keep track of the questions I answer.

Hot and Dry Days Ahead for Australia

Earlier this month, the Australian Bureau of Meteorology released the October figure for the Southern Oscillation Index (SOI). It showed a precipitous plunge of almost 20 points down to -14.6. Just how significant a drop this is can be seen in the chart below, which shows the distribution of monthly changes in the SOI going back to 1876 (-14.6 is at the lower 5% quantile, which means that a fall as big as this, or bigger, has only occurred 5% of the time).

SOI histogram

Distribution of SOI changes (Jan 1876-Oct 2009)

But what exactly is the SOI and what is the significance of this decline in the index? The index is the standardised anomaly of the monthly average difference in sea-level air pressure between Tahiti and Darwin. “Standardised anomaly”  means that the index measures the deviation of this pressure difference from the long-term average and is scaled by the standard deviation of the pressure difference and then multiplied by 10. The significance of the index lies in its relationship to the El Niño weather phenomenon. According to the Bureau of Meteorology:

Sustained negative values of the SOI often indicate El Niño episodes. These negative values are usually accompanied by sustained warming of the central and eastern tropical Pacific Ocean, a decrease in the strength of the Pacific Trade Winds, and a reduction in rainfall over eastern and northern Australia. The most recent strong El Niño was in 1997/98, although its effect on Australia was rather limited. Severe droughts resulted from the weak to moderate El Niño events of 2002/03 and 2006/07.

The chart below gives a historical perspective of the SOI over the last ten years. To get a better sense of the trends in the index, I have overlaid two different types of curve smoothing: a lowess (“locally-weighted scatterplot smoothing”) curve and a spline curve. The two give very similar results and make the 2002/03 and 2006/07 SOI downturns clearly visible. The timing of these downturns suggest that the corresponding droughts follow with something of a lag.

SOI 10 year historySouthern Oscillation Index (Jan 2000-Oct 2009)

Over the last couple of years, the SOI has been solidly in positive territory and, again with a lag, there has followed an improvement in drought conditions. Indeed, New South Wales recently replaced the tight water restrictions which had been in place for a number of years with the less onerous “Water Wise” rules. Unfortunately, this change may turn out to have been premature. If the downward trend in the index seen over the last few months persists, Australia may face a return to severe drought conditions.

For anyone who is interested in how these charts were created, here is the R code. It is also available from the Stubborn Mule files section.

UPDATE: at the request of singingfish, here is a chart showing the full recorded history of the SOI back to 1876. The blue line is a spline smoothed curve.

SOI - Full History

Southern Oscillation Index (1876-2009)

Melbourne Cup by Numbers

I don’t know anything about horses. Ever since I was bitten by one at the Easter Show as a small child, they have ranked very low on my animal preference list: only just above geese. Still, at this time of year almost everyone in Australia gets caught up in some way with the Melbourne Cup, the race that stops the nation.  As usual, I expect that my involvement will only stretch as far as participating in the $2 sweep at the office, thereby avoiding the need to actually make any kind of ignorance-based horse selection.

But, with a mule as this blog’s mascot, you would think that I could do better than that, so I have thrown my charting skills at the history of previous winners in as slap-dash a manner as I can to see where playing the historical odds would get me.

I thought I would start with colour because, even as a racing-form novice, I suspect that should have no bearing on the result and so it seems like a safe place to start. Based on the winners going back to 1861, here is the distribution of winning colours.

Cup Colour Histogram (II)

Cup Winners by Colour (1861-2008)

This suggests that bay-coloured horses have an edge (or maybe it is just a common colour for a horse). What about the “sex” of horse (please do not ask me what all of these terms mean, but apparently “horse” is a sex)?

Continue reading