Thinking about data visualization

Earlier this week we have received the final assignment for the MOOC on infographics and data visualization. Alberto does not spare his students, writing: “This time, I am giving you the freedom to do whatever you want.” My first idea was a slight jubilation, everything is possible. Then we get 7 steps: starting with making a headline and gathering the data and ending with getting the results back. While commuting to work I saw  a small headline in one of the free newspapers: “Well-off people live longer in good health”.

Statistics Netherlands (in Dutch CBS) collects and processes data “in order to publish statistics to be used in practice“. Their website has a nice series of interactive infographics, and already years ago they were one of the first that introduced a webmapping interface to their statline website. For many of my GIS classes I have used their data. So a very useful source of wonderful data. But let me return to the assignment, the first step: getting a headline.

life expectancy based on data from cbs.nl

A Simple Headline … but Tease the Information

This week someone in my tweet-lists posted a tweet on a Webinar: How to Write Headlines for the Web. After watching the webinar I understood that my headline could make or break my great story. A good story without a catchy headline will been read less on the web. It should contain big numbers and they should be easily digestible. Wow… as a non journalist this is quite a challenge. And according to Alberto Cairo I need to “Try to find a focus, a headline.” In the webinar one of my favorite techniques is used: free association, with in the back of my mind the main question: what the story is about. If not just with a blank sheet of paper, I often use a MindMapping tool (in my case the fabulous Freemind) for this process.

The general scope given by Statistics Netherlands is: “Men and women from high-income households on average live about 8 and 7 years longer respectively than their counterparts in low-income households.” Far to long for a headline. In Dutch we have a proverb “Riches alone make no man happy” (or “Money isn’t everything”). This is what my associations led to. Leaving me with a number of keywords: Riches, happiness, long live, income, and 7 and 8 years. What about “How to earn an 8 years longer life?” Is it simple enough? There is another great tool that I love to browse: the urban dictionary. Wealthy has many hits: moneybags, ballers. So… “Baller gets 8 years extra”?

Gather the data … Combinations and Context

The next step is to think about the story I want to tell. In this case I will focus on the Dutch data first. My mind wanders on: it would be nice to get data for another country. There must also be some historical data on this subject. The Dutch Economic-Historical Archive (NEHA) has this kind of data, also the Statistics Netherlands have data back to 1899 in its historical series. While talking about the subject over dinner my son came up with the fact from his history class: the average life expectancy of a worker in Manchester during the industrial revolution was very low (an average of 17).

Another idea may be to link the data to the life style. There are lots of data on that subject too. On the other hand it is more difficult, and the context may be a lot harder to give. The Statistics Netherlands also mentions good health and good mental health. This may be subjects to include, but not as a main subject for the assignment for now.

So the plan…

What is the story I want to tell? There is enough historical data available. I want to tell the story of the rich, the poor, and the middle class at different moments in time. The turn of the 20th century, the 1930’s crisis, the after war period, the late 1970’s where many patterns changed, and now the 21st century. This approach will tell many stories. It will tell about prosperity, the working class, history, and many, many social elements that make a culture.

My story will be about culture and people, based on historical statistics. Now the next step is to think about the form.

This weeks assignment: data manipulation

This weeks assignment in the MOOC on infographics and data visualization by Alberto Cairo is about maps. From his new to appear book we have to read the fourth chapter on Cartography for Journalists, or as the chapter title reads: Thematic Maps, Statistics and Cartography Meet. Like his earlier book – The Functional Art -, also this chapter is a well written piece with many great visuals.1 Alberto Cairo describes thematic maps as “the purest and most successful form of information graphics”, and I certainly do not disagree. And about the assignment? That is to use data from the data from the US Bureau of Labor Statistics and show unemployment in the US. In a way like The Guardian’s Data Blog published a story about unemployment in the US: http://www.guardian.co.uk/news/datablog/interactive/2011/sep/08/us-unemployment-obama-jobs-speech-state-map. But with more functionality and depth.

With Tableau, TileMill, or even with ESRI ArcGIS online this task is not that big. The data is easily accessible and well-organized. But this is exactly where standard mapping differs from making infographics. I am quite sure that our teacher does not want the standard map. In the first assignment I created a map and Alberto commented: “However, it doesn’t improve the original as much as it could. The reason is that you are forcing me to click on each country to get the data, rather than giving me the opportunity to explore the data in different ways, such as creating rankings, comparisons between countries, etc.”. So no standard map this time, but something that will focus on the exploration.

I decided to focus on two of the questions that are raised in the assignment:

  • What kind of graphs or maps would you need to tell a compelling story based on this?
  • How would you give context to the data?

If we are talking about unemployment during the first period of Obama there are  some nice infographics on this subject already in the run to the elections.

My approach: Geo Tagging

The latest data available is from September 2012 and looking at that data you immediately see the big differences between states. Montana, Wyoming, North and South Dakota, Nebraska. From my times being there I know the views of large (or even more than large) plains, the emptiness. The number of people per square meter must be very low. On the other hand the giant peak in California, and the higher (but not peak) values in the dataset for states along the east coast, and the big cities like Chicago, Detroit, and states like Texas, and Florida. My first impression of the dataset is that it is not averaged by population density.

In order to map the data, and to show the population – unemployment relation the data must be geo-tagged. Since the data is ordered by state, and the state codes are given this is not a difficult task. MaxMind offers a nice table of states and their longitude and latitude. By combining this data with the given data set I now at least have point data that is geo referenced. And so it can be mapped with a centroid.

Then I started to play with tableau public. Within the map option of the software there are several settings and datasets preloaded, population, population by race, occupations.

What appears to me is that population density and mixture of race both have effects on the figures. States with many big cities seem to have a higher unemployment rate. So a first step would be to map the data against the population density.

The Census Bureau Data

The United States Census Bureau has a nice dataset for the census of April 2010. Although the census data is for the full population, including those people that are too young to work, or those that are retired the figures change already with the first quick lay-out. Rhode Island that at first was a small dot, now suddenly becomes one of the largest. On the other hand California that seemed to be the state with an incredible high unemployment rate now has become an average player. All in all we see how the differences have become smaller when it comes to the percentage of the total population that is unemployed.

Conclusion and the final result

So “to tell a compelling story” I made an interactive infographic where the map is a main element on the page. From this map you can click on a state and see the unemployment figures for 10 periods in time: at the start of the first period of Obama, and then each year, until just before his re-election. The context is that data is multi interpretable, even though everyone knows the facts. By leaving out specific details data manipulation becomes a word with a double meaning.

ps. Did I tell you that subscription to the second course, starting in January is open? You can subscribe here: Knight Center for Journalism in the Americas’s Distance Learning program.

1 In the first version this paragraph read: “From his book we have to read the fourth chapter on Cartography for Journalists, or as the chapter title reads: Thematic Maps, Statistics and Cartography Meet. The book is well written with many great visuals and the same is true in this chapter.”

Happy GISDay

A week ago (on Thursday) someone tweeted “happy postgisday”, and yes, it was the day after GISDay. This yearly event is, as is stated on the GISDay website: “The annual salute to geospatial technology and its power to transform and better our lives”. Looking at the event map as published on the website I was amazed by two things, firstly the wide spread of events all over the globe, secondly that there was no event planned in The Netherlands, although there exists a good and well organized GIS community. And I did not organize any event either.

My thought was: What could I have organized to bring GIS to a wider audience? The following themes come to my mind:

The application of GIS in area’s where you do not expect it.

Not so long ago, about a decade or so, GIS mainly took place in the drawing room. Networks were not any longer designed and maintained on the large drawing boards with pen and paper. In the GIS era these drawing boards were replaced by digitizer boards and large monitors, and the blueprints were replaced by bits and bytes. With all this development answering questions on the assets became easier. Examples of questions you can answer in this context are: What is the current state of our network, what type of asset had the biggest interference sensitivity over the last period, what customers should be informed about the upcoming repair work?

As said in an earlier blog post on this subject the main shift appeared when navigation systems became more and more a commodity. Nowadays  GIS is not any longer limited only to the drawing room. We see GIS in many different contexts, and different industries, on places where you would not expect it. To tell this story may be my first presentation.

Your safety monitored with GIS

The second story is about Geo-data and boundaries. In European context Inspire is becoming more and more grown up. Inspire is the initiative that should create an infrastructure to make geo information and spatial data better accessible. When you cross a border (and in this case I am not even talking about the country borders), it may well be that the data that you find on the other side of the border is not directly usable. This can cause problems, for example when a river gets polluted, and we want to take steps to prevent the pollution to get into the drinking water supply chain. Best is to have data that can be easily exchanged between different organizations.

Different local governments store their data in different ways, this is due to for example the GIS software they use. The main result of this is that if we want to get a full overview of data available we should first create a common language. But not only we should store the spatial data in a common way, it must also be found across the different borders. So labels to the data and the datasets, the metadata, must be generalized too. In the last years we have seen a fast growth of the so called geo-portals, in the future these will be the entrance to the European data. They are a wonderful way to tell a larger audience how spatial data, and the systems storing and analyzing this data work together on monitoring safety.

The past analyzed with GIS

A growing theme in historical studies is the application of GIS to study spatio-temporal processes. Mapping differences between two or more different time periods, and showing where changes appeared. In the last decades I have published a number of these studies. For example on detecting changes in the urban landscape (how a city developed). But there is so much more that can be done on this subject. In the book “Past Time, Past Place” Anne Knowles collected a number of very good examples on how GIS can be applied in history. This book was published in 2002 and since then there has been a lot of new development. For example GIS has become better accessible and more a commodity in the historical sciences.

If we apply GIS to history we also come to the subject of story telling. With the historical datasets that we have available we can tell a story that may have been hidden before. This story can make the past more interactive, how odd this may sound. We can show the development of a town, starting from a little village on a sand ridge, and how, based on the written deeds we find in the archives, we see that over time the village grew. For example we can show the map, and how more and more streets and houses appear. In addition to this map we can add the deeds on which we base our findings to the different plots.

Next year… GISDay

Next year on GISDay (Wednesday, November 20, 2013) I would like to show small projects on these three examples, mainly to introduce GIS to a wider audience. In the mean time I will post examples here.

The Day After and the Biblia Pauperum, some light reading

One of the most interesting infographics to make would be to show how many new data visualizations have been created over the last few months. In the run-up to the American as well the Dutch elections these last months I saw how images on the complex data that comes with these elections have appeared more often, and more frequently. Is this a good sign? Yes I believe so!

In Medieval times we had the Biblia Pauperum, a blockbook, explaining the holy book like in a cartoon. There are some really beautiful early examples of these “designed bibles” in Italy, sometimes in the form of a drapery that would hang over the edge of the pulpit, showing in clear (and big) images what was told from that same pulpit. Later on, mainly in 15th and 16th centuries we find some beautiful woodcuts in the Low-Countries, probably closer to the newspapers that now publish the infographics. In the last months I had to think of this quite often when being confronted with the schematics on electoral votes, swing states, and statistical discussions on prognosis.

While going through my bookmarks of the last week (in order to label them and get them into the right folder) I have collected a number of political and non-political links that would apply to the label Biblia Pauperum, plus some other things I found worth reading.

Data, Information, Knowledge … and Wisdom

Last week the MOOC on Infographics and Data Visualization at the Knight Center for Journalism in the Americas started, and I am one of the 2000 lucky students. About 8 years ago my former employer dropped a book on my desk. Mouth watering and in one go I finished it. The book had to do with data processing and information visualization in a way that as a computer scientist and art historian I could understand so well. The book by Edward Tufte has been a source of inspiration for many lectures and thoughts while working with, in my case, the presentation of geographical data.

One of the main reasons for me to start with the course is that I see the importance of data visualization. I am neither a journalist, nor a professional designer yet I want to visualize my data analysis. For example the data about assets in a geo information system, and in a way like Stephen Few describes it: Meaningful decoded data where the nature of the data as well as the relationships between the different objects is clear.  I want to be able to present this data in a way that it is understandable for non geo informatics people.

Data information knowledge model

Last week Alberto Cairo introduced us to the concepts of the data information knowledge model and on how to analyze the ongoing stream of infographics that are produced in his first lecture on information visualization.  One of the references he makes is to a chapter on data visualization from Stephen Few. Few writes: “The goal is to translate abstract information into visual representations that can be easily, efficiently, accurately, and meaningfully decoded.” In the information processing for Geographical Information Systems we often meet the same goals.

One of the future directions that Few mentions in his text is: “The integration of geo-spatial and network displays (such as node and link diagrams) with other forms of display for seamless interaction and simultaneous use.” And that is exactly why 10 years ago that book landed on my desk. I believe that the integration as mentioned above is a very obvious one,  but we should be careful. Geo data looks very “sexy”, and we see that many designers of infographics tend to use maps as  a background, or when a location is given, map the data to that location. That brings me to one of the questions of Stephen Few: “Is it obvious how people should use the information”.

In the discussion last week on a map given by Alberto Cairo, the instructor of the course, mentioned the map as a background played an important role in many people’s responses. Information on Internet use for several countries had to be presented. Some of the responses tended towards the fact that everyone knows where specific countries are on the map, so why try to map a chart to a location. On the level of countries or continents I can understand that argument, but in many cases we work with data on a smaller scale. When it comes to statistics on your assets the map is an excellent carrier of information.

Location data

Probably more then 90% of the data in a geo-database has nothing to do with the map in a direct way. It does not contain X, Y, or Z coördinates by itself, but the data is linked to other data tables that do have the location connection. For example we can have a postal code that will link customer data to a specific area. In this way we can enrich the data. The last few years there have been a number of companies that showed us wonderful examples of how to do. The result is that much of the data that is available in information systems now can be linked in one way or the other to a specific location.

If you have a shopping card from the local supermarket, data is collected on the products you buy. Different queries can be run on this raw data, for example on price ranges or on the type of products. All this data has no location component, it’s products, prices, and quantities. And we can, based on this data, run wonderful statistics. We can add extra value to this data set when we combine these statistics to the postal code of the consumer. Suddenly we start to see patterns, for example when it comes to age categories in a certain area of town.

The next step, and here I refer back to the future direction, is to change the information that we get from the different queries into knowledge. Visualizations based on the above example may have added value. But this added value can only be achieved when the data is easily and efficiently available, plus easy to read and interpret.

This is exactly where we can learn from the designers that work in the newspaper offices. This is, besides the fun, a reason for me to take the course on infographics and data visualization.

Gadgets and the mattarello

Very recently the “mattarello” was brought to my attention in a blog. Such a fantastic Italian word, and such a great instrument when making your own pasta. I admire people that can handle that stick properly, and when I see it being used I always stay to watch it a while. This youtube video shows one of these magicians.

The automatic rolling pin

I should admit that although I have tried a couple of times to roll and cut my own pasta using a rolling pin I have gone for the machine. A couple of years ago I treated myself to a pasta machine from Trebs. Since then I even more regularly make my own pasta using this machine. The advertisement says you can make pasta within 15 minutes, and that it has only to cook for 4 minutes. With me it always take a bit longer to make it, but I save this extra time by eating it more “al dente”.

Trebs pasta machine

The recipe that I always use is very simple:

  • 100 gr unbleached flour or semolina flour
  • 1 egg
  • 1 tablespoon olive oil
  • a little salt

For the taste semolina flour is better, but since it has a coarser grain it is more difficult to work. To have just a taste of the semolina in your pasta you can mix the two types of flour. In that case use 1 part of semolina on 4 parts of unbleached flour.

Tonight I made tagliatelle with a special pesto. But that recipe I save for next time.