Chapter 09: R for Visualization, Statistics & Data Literacy by Brian Walsh

Why R?

Can’t I do the same things using Tableau? In theory, yes. But practically speaking, Tableau is designed to do most of the work for you once your data is cleaned: basic statistical calculations, visualizations, and presentation. While the convenience is nice, it comes at a cost: Tableau Creator, the only Tableau package that comes close to the power of R, costs $70/month.

How much is R? It’s free, cross-platform, and open source. The best textbooks for introductory R programming are also free. And one more thing: if you can harness the power or R for data science, you will be able to work with Tableau in your sleep: it’s the difference between using software and programming: one limits our creativity, while the other requires it.

Also, learning software is not as useful of a long-term approach as learning programming. While Tableau is enjoying popularity now, it will at some point be replaced with a new software package, which you’ll then need to learn from scratch.

Applications come and go – remember Flash? What about Director? – but even if a programming language evolves or becomes obsolete, you can still apply all of the programming concepts you’ve learned to other languages.

Speaking of other languages, what about Python? While Python can do nearly anything that R can, it is not a language solely dedicated to data science (you can make websites, applications, and all kinds of non-data-based content with Python). Most data scientists using Python rely heavily on Pandas, a Python package for data manipulation. While Python is extremely powerful and a great data science tool, I’d argue that starting with R is easier, since all it really does is work with data. Once a person knows R, picking up Python is much easier.

Still not convinced? R is cross-platform, and its most popular IDE, R Studio, is also free, cross-platform, and lightweight. Since R is package based, the size and functionality of your copy of the R language is customizable. Input and Output in R is extremely easy. R Studio can also publish and host content, code, and visualizations; and also create interactive apps based on your code – still all free. It’s a one-stop shop. Download R at r-project.org, and then install R Studio from rstudio.com, and you’re up and running.

There is a wide variety of free, publicly accessible online databases that cover a very broad range of topics, allowing an instructor to introduce varying forms of data (numerical, character, logical, temporal, etc.) or choose a more specific focus. There is also an active, curated repository of R packages called CRAN (Comprehensive R Archive Network) from which you can extend R and make it more powerful (also free).

The most established R developers have not only written free textbooks¹ on how to use their ideas, but they also make packages that replace some of the original, or ‘base’ R programming techniques with cleaner, easier to understand approaches (the Tidyverse, as it’s called). Put another way, the leaders of modern R proselytize for it, too – and learning R now is easier than it ever was.

How to Teach Basic Data Literacy Using R

ƒ

R can act as an introduction to both core programming concepts and basic data literacy. There is no pre-requisite knowledge to use the language, and R Studio presents a familiar user interface – which is much less intimidating than just a code terminal. By starting with data visualization, we can ‘see’ the errors in basic data manipulation and methodology, allowing us to learn the language while also learning best practices and common errors with data. This approach works if you follow five core tenets to teaching R in the classroom:

Teach the Tidyverse Instead of Base R

The ‘Tidyverse’ — a collection of newer R packages that can replace much base functionality in R is much easier to learn than base R. The language frequently has conflicts between different packages being loaded into the system, but the Tidyverse packages all work in harmony with each other, making the process of importing data, cleaning, filtering, and visualizing much easier. It sounds like heresy to those already well-versed in R, but instead of teaching base R and then the Tidyverse, - or teaching them side by side - start with the Tidyverse. Once it all makes sense, you can expand their view of R’s capabilities (including base R functions not covered by the Tidyverse). Purists may question such an approach, but not only does it allow you to cover more content in a semester, it also allows you to cover more advanced topics. Once a class is versed in the Tidyverse and ‘tidy data,’ for instance, they can use the ‘sf’ package to create geo-spatial visualizations. Such a topic is far too advanced to cover without relying on the Tidyverse, which grows daily and will probably one day overtake Base R anyways.

Start With Visualization

While R is easy to start using, it does require that you understand the underlying structure of the data you’re working with - especially when creating visualizations. So by starting the class with visualization, you force your class to try to comprehend what is possible in any given data set (and what is not). You also create an environment where students are not afraid to ‘play around’ – flip their axes, or try a different type of visualization. It can help overcome coding trepidation, while simultaneously requiring full engagement in the content.

R comes with a number of easy to understand datasets already built-in for the purpose of practicing. A sample dataset, such as mtcars, can be loaded and visualized in only a handful of lines of code. Here’s an example, with comments added (starting with ‘#’) that indicate what each line of code is doing:

Coding Pedagogy Edited by Jeremy Sarachan

We’re comparing the weight and the fuel efficiency of the cars in the data set, with the transmission type (auto vs. manual) indicated with color.

This approach gives students reticent to learn code something to latch on to: everyone can understand a basic chart. And if you don’t like cars, there are built-in datasets on flowers, airplane arrivals, earthquakes and air quality, to name a few. While R is not a purely ‘visual’ language, it has visual components that help make it more accessible to students. By starting with both a basic data set and a basic visualization, we can ‘make’ something on our first day of class.

The Tidyverse and its popularity is largely the work of one man, Hadley Wickham, whose (free) textbook is the definitive source for learning the Tidyverse: http://r4ds.had.co.nz/ (The book starts off with using ggplot for visualization). Julia Silge and David Robinson recently published a book on a more targeted use of the Tidyverse for text mining: https://www.tidytextmining.com/.

Use Simple Data to Introduce Complicated Concepts

An example introductory data set is called ‘babynames,’ which is also the work of Hadley Wickham. While the data set – a collection of all names recorded by the Social Security Administration, going back to 1880 – is very easy for students to both comprehend and visualize, it also allows the instructor to introduce more advanced concepts (like variance, since the total number of baby names is increasing over time). Students gain confidence by understanding the basic data and observations, allowing you to add in more complicated concepts selectively.

The simplicity of the babynames dataset also allows for students to make basic methodology errors. A student named ‘Beth’ may be surprised at the relative infrequency of the name in the database, without taking into account the name ‘Elizabeth’ (which would be the given name on a birth certificate for most Americans named ‘Beth’). The relative frequency of ‘Brian’ is different and unique from that of ‘Bryan’ or ‘Ryan.’ What methodological approach could best be used when attempting a comparative analysis of name frequency? Consider the code below.

Of course, the worst methodological errors pop up in the hypotheses of the projects themselves. Every time I start a class off with an assignment using the babynames dataset, at least one student decides to measure the popularity of certain female names in relation to famous Disney characters (i.e. did usage of the name ‘Ariel’ increase in the years after The Little Mermaid was released?).

While this seems a solid project idea on the surface, the Disney universe is too broad to consider analyzing every name. So how do you pick names? Only protagonists? Then what about ‘beauty,’ from Beauty & the Beast – do animals count? Can’t I just analyze the movies I like? Not if you want an objective approach and reproducible results. Names – even just the names of Disney characters – are a methodology pitfall for most students, and hence a great way to get them thinking critically about generating a report that has reproducible results and lacks bias.

Use the Messiness of Public Data to your Advantage

To correct these common errors, it helps to show the class examples of poor methodology — and ask them why it’s wrong. A great example is performing geography-based analysis: many U.S. counties have publicly available data and APIs. Users can see things like emergency service calls, household median income, traffic data, and overdose rates, and compare the data on a county-by-county basis. Unfortunately, the data is usually incomplete, poorly documented, filled with strange shorthand codes instead of traditional values, only covering a small period of time, and other issues.

A common student mistake involves drafting a hypothesis that ignores the incompleteness of the data or overlooks a limited time range. But by far the most frequent methodological error is simply not controlling for population; comparing overdose rates for the most- and least-populous counties in a state will not have convincing results unless you do.

The abundance of messy publicly available data can easily lead to such methodological errors. By requiring reports to be written using the scientific method, students have to explain the rationale for what they include in a project. That’s a great way to get them thinking critically about methodology:

Why does County A have four times the overdose rate of County B?
Did you control for population?
Why does County C record zero overdoses?
Is it really zero, or did they not submit data?

Use the Familiarity of Social Media to Introduce APIs:

Use (Hopefully Familiar) Great Works of Literature to Introduce Text and Sentiment Analysis

R packages that allow for the collection of social media data are a great way to introduce APIs, as well as data cleaning principles. Combined with publicly available literature text from gutenberg.org, this is a great way to introduce text analysis and sentiment analysis.

While Facebook allows for its users to download most of their content, that content is of mixed format (images, text, video), and that complexity would require a complex project to parse. Twitter is much easier to use, as it’s primarily text, including links and hashtags. It also allows users to download their data, which arrives in a very convenient .csv file for budding data scientists. Analyzing your own use of Twitter is an easy way to understand what the data is saying, how to visualize it, how to correct for errors (such as time zone discrepancies in your posts), and what data fields reveal the most interesting insights.

Twitter also has a very easy to use API, with a well-documented R package to access it (rTweet). Relying on their familiarity with the data from analyzing their own Twitter feed, students become more engaged with using the API. Unfortunately, Twitter limits API search results to (roughly) the last week of activity, so projects must be limited to trending topics (news stories such as natural disasters are good places to start).

Individual user’s accounts can be downloaded using rTweet, though, and they go back much further in time. So comparing individual users’ use of Twitter over a selected time frame is a good project idea. By combining the assignment with sentiment analysis, students can compare the word frequency and sentiment of competing online voices, for instance (political commentators, FOX news Twitter accounts vs. MSNBC Twitter accounts, etc).

While gutenberg.org does not include every book in the public domain, it does allow for access to a huge body of literature. Its limitations – or the limitations of copyright law, depending on how you look at it – present a challenge to inventing a flawless methodology. For instance, while you could download and analyze every single science fiction book on Gutenberg, you of course would not be analyzing all the science fiction every written. Your methodology in this case is ‘curated’ by the availability of the titles you wish to analyze – and thus your hypothesis is not objectively reproducible. Similar to the Twitter API, the best solution to this limitation is to focus on one author and compare the data from each of their books, or compare a handful of authors to each other. As long as gutenburg.org has all of their primary works, that approach should hold water. Here’s an example of downloading and visualizing a sample data set:

Conclusions

In case it is not already clear, the primary issue that arises in introductory data science is that of flawed methodology. While students feel that their efforts are going entirely into learning a programming language, they’re really learning how to make a holistic argument that can be applied beyond their limited data – and the primary obstacle to that goal is flawed methodology.

My emphasis on methodology is informed partly by the fact that many people coming to data science are not in degree programs that specialize in the topic – they are astronomers or linguists, health practitioners or government analysts, and they learn and use R out of a necessity to be able to crunch their own data. Unless you’re already a scientist, you’ll almost certainly make some methodological errors.

While some may argue that teaching methodology and data literacy along with learning R is problematic, I would counter that learning methodology outside of performing data analysis may not be retained by students, as it is purely theoretical. It should also be mentioned that this approach to data literacy incorporates financial data, geographic data, continuous (time-based) data, social media data, and text and literature analysis. Students are encouraged to find their own datasets or APIs for projects, as well (the Spotify and Genius APIs are very popular for this). While this may seem disorganized, true data literacy requires comfort with all forms of data – and limiting the data types that students see does them a disservice.

Chapter 8

To cite this article:

MLA: Walsh, Brian. “R for Visualization, Statistics & Data Literacy.” Coding Pedagogy, edited by Jeremy Sarachan, 2019, ch. 9, http://codingpedagogy.net. Accessed 1 Apr. 2020. [update access date]

APA: Walsh, B.. (2019). "R for Visualization, Statistics & Data Literacy.” In J. Sarachan (Ed.), Coding Pedagogy, ch. 9. Retrieved from http://codingpedagogy.net.

Chicago: Walsh, Brian, “R for Visualization, Statistics & Data Literacy.” in Coding Pedagogy, ed. Jeremy Sarachan, ch. 9, Coding Pedagogy, 2019. http://codingpedagogy.net.

Last: Chapter 8

Chapter 09

R for Visualization, Statistics & Data Literacy

by
Brian Walsh

Why R?