Deutsche Welle Global Media Forum 2018

Data mining for journalism: a hands-on approach

By Caroline Paul Kanjookaran

Working with data can be a daunting task for even the most experienced journalists. This is mostly because journalists are facing a pressing question – From where will I collect data? Orange has prepared a hands-on kit, with inputs from Sophie Rotgeri and Kira Schacht of Journocode UG. This is meant to be a handy guide for anyone in the field of data journalism and includes tips, tricks, and more.

Sources of data

1. Leaks. Leaks are a major source of information; however, a downside is that they often tend to be unreliable. Therefore, it is advised that journalists take these with a pinch of salt and treat them the way they treat any other information, by researching data received thoroughly and meticulously.

2. Scraping. Simply put, scraping means extracting relevant data from websites. One downloads the source code of the website being scraped and use the information needed. This is ideal for beginners and an advantage is that it is easy to conduct small, simple scraps. However, while carrying out scraping, it is important to keep in mind that websites can be problematic, and technical precautions should not be circumvented. Keep in mind that they’re there for a reason. Also, data may be protected by copyright, and it is important to know who owns it and whether the ownership is public or private.

3. Freedom of Information Act. Under the FoIA, it is possible for the journalists to access data from public bodies to be used for public interest. There are several laws that grant access to data, and it is worthwhile to know them. However, this method of data acquisition can prove to be a long and expensive, and sometimes, it might be virtually impossible to access data. A way around this problem is by reaching out to those people – scientists, researchers etc – who actually compile data since they might be happy to help.

4. Open data. It is one of the best options for data mining. It is commonplace that governmental bodies publish data online that is available easily. It is also recommended to reach out to statistical offices if one is looking for specific data. While this is generally an easy way to acquire data, it might be necessary to clean it before using. Therefore, it can be difficult to analyze or visualize the data directly, making the process tedious and time consuming. Sometimes, the data might also not be used for machine usability, leading to further complications.

5. Seeing the obvious. One of the easiest, albeit overlooked ways of acquiring data, is by asking a question – ‘Who would know that?’. For example, scientists or researchers or even NGOs working in the relevant field might be good sources of data.

Working with data

Once you find data using one or more of the following methods, it is important to analyze it. Asking the questions below will help to avoid pitfalls:

1. Who collected the data? Is it possible that there is a conflict of interest? This question can come in handy when analyzing data received from NGOs or organizations that have a vested interest in a particular cause.

2. How was the data collected? Is it possible that there are gaps or biases? For example, crime statistics only show instances of crime that were actually reported, but it is likely that not all crimes got reported

3. Which questions can or cannot be answered through this data? The extent of data usage needs to be clear.

4. Is there a second source? It is important to double check the data acquired whenever possible to ensure credibility and accountability


Leave a reply

Your email address will not be published. Required fields are marked *