Data Tidying · Data Science with R (2024)

“Tidy datasets are all alike but every messy dataset is messy in itsown way.” – Hadley Wickham

Data science, at its heart, is a computer programming exercise. Datascientists use computers to store, transform, visualize, and model theirdata. As a result, every data science project begins with the same task:you must prepare your data to use it with a computer. In the wild, datasets come in many different formats, but each computer program expectsyour data to be organized in a predetermined way, which may vary fromprogram to program.

In this book, we will use R to do data science. R is an excellentlanguage for data science because with R you can do everything fromcollect your data (from the web or a database), to transform it,visualize it, explore it, model it, and run statistical tests on it. Youcan also use R to report your results when you are finished, and you canrun R interactively, as if you were operating a calculator and notwriting computer code. Best of all, R is free.

In this chapter, you will learn the best way to organize your data forR, a task that I call data tidying. This may seem like an odd place tostart, but tidying data is the most fruitful skill you can learn as adata scientist. It will save you hours of time and make your data mucheasier to visualize, manipulate, and model with R.

Note that this chapter explains how to change the format, or layout, oftabular data. To learn how to use different file formats with R, seeAppendix B: Data Sources.

Outline

In Section 2.1, you will learn how the features of R determine thebest way to layout your data. This section introduces “tidy data,” a wayto organize your data that works particularly well with R.

Section 2.2 teaches the basic method for making untidy data tidy. Inthis section, you will learn how to reorganize the values in your dataset with the the spread() and gather() functions of the tidyrpackage.

Section 2.3 explains how to split apart and combine values in yourdata set to make them easier to access with R.

Section 2.4 concludes the chapter, combining everything you’ve learnedabout tidyr to tidy a real data set on tuberculosis epidemiologycollected by the World Health Organization.

Prerequisites

You will need to have R installed on your computer to run the code inthis chapter, as well as the RStudio IDE, a free program that makes iteasier to use R. You can learn how to install both in Appendix A:Getting Started.

You will also need to install the tidyr, devtools, and DSRpackages. To install, tidyr and devtools, open RStudio and run thecommand

install.packages(c("tidyr", "devtools"))

DSR is a collection of data sets that I have assembled for this bookand saved online as a github repository(github.com/garrettgman/DSR). Toinstall DSR, run the command

devtools::install_github("garrettgman/DSR")

2.1 Tidy data

You can organize tabular data in many ways. For example, the data setsbelow show the same data organized in four different ways. Each data setshows the same values of four variables country, year, population,and cases, but each data set organizes the values into a differentlayout . You can access the data sets in the DSR package.

library(DSR)# Data set onetable1## Source: local data frame [6 x 4]## ## country year cases population## 1 Afghanistan 1999 745 19987071## 2 Afghanistan 2000 2666 20595360## 3 Brazil 1999 37737 172006362## 4 Brazil 2000 80488 174504898## 5 China 1999 212258 1272915272## 6 China 2000 213766 1280428583# Data set twotable2## Source: local data frame [12 x 4]## ## country year key value## 1 Afghanistan 1999 cases 745## 2 Afghanistan 1999 population 19987071## 3 Afghanistan 2000 cases 2666## 4 Afghanistan 2000 population 20595360## 5 Brazil 1999 cases 37737## 6 Brazil 1999 population 172006362## 7 Brazil 2000 cases 80488## 8 Brazil 2000 population 174504898## 9 China 1999 cases 212258## 10 China 1999 population 1272915272## 11 China 2000 cases 213766## 12 China 2000 population 1280428583# Data set threetable3## Source: local data frame [6 x 3]## ## country year rate## 1 Afghanistan 1999 745/19987071## 2 Afghanistan 2000 2666/20595360## 3 Brazil 1999 37737/172006362## 4 Brazil 2000 80488/174504898## 5 China 1999 212258/1272915272## 6 China 2000 213766/1280428583

The last data set is a collection of two tables.

# Data set fourtable4 # cases## Source: local data frame [3 x 3]## ## country 1999 2000## 1 Afghanistan 745 2666## 2 Brazil 37737 80488## 3 China 212258 213766table5 # population## Source: local data frame [3 x 3]## ## country 1999 2000## 1 Afghanistan 19987071 20595360## 2 Brazil 172006362 174504898## 3 China 1272915272 1280428583

You might think that these data sets are interchangeable since theydisplay the same information, but one data set will be much easier towork with in R than the others.

Why should that be?

R follows a set of conventions that makes one layout of tabular datamuch easier to work with than others. Your data will be easier to workwith in R if it follows three rules

Each variable in the data set is placed in its own column
Each observation is placed in its own row
Each value is placed in its own cell*

Data that satisfies these rules is known as tidy data. Notice thattable1 is tidy data.

In table1, each variable isplaced in its own column, each observation in its own row, and eachvalue in its own cell.

Tidy data builds on a premise of data science that data sets containboth values and relationships. Tidy data displays the relationships ina data set as consistently as it displays the values in a data set.

At this point, you might think that tidy data is so obvious that it istrivial. Surely, most data sets come in a tidy format, right? Wrong. Inpractice, raw data is rarely tidy and is much harder to work with as aresult. Section 2.4 provides a realistic example of data collected inthe wild.

Tidy data works well with R because R is a vectorized programminglanguage. Data structures in R are built from vectors and R’s operationsare optimized to work with vectors. Tidy data takes advantage of both ofthese traits.

R stores tabular data as a data frame, a list of atomic vectors arrangedto look like a table. Each column in the table is an atomic vector inthe list.

A data frame is alist of vectors that R displays as a table. When your data is tidy, thevalues of each variable fall in their own column vector.

Tidy data arranges values so that the relationships in the data parallelthe structure of the data frame. Recall that each data set is acollection of values associated with a variable and an observation. Intidy data, each variable is assigned to its own column, i.e., its ownvector in the data frame. As a result, you can extract easily the valuesof a variable in a tidy data set with R’s list syntax,

table1$cases## [1] 745 2666 37737 80488 212258 213766

R will return the values as an atomic vector, one of the most versatiledata structures in R. Many functions in R are written to take atomicvectors as input, as are R’s mathematical operators. This adds up to aneasy user experience; you can extract and manipulate the values ofvariables in tidy data with concise, simple code, e.g.,

mean(table1$cases)## [1] 91276.67table1$cases / table1$population * 10000## [1] 0.372741 1.294466 2.193930 4.612363 1.667495 1.669488

Tidy data also takes advantage of R’s vectorized operations. In R, it iscommon to supply one or more vectors of values to a function ormathematical operator as input, and to receive a vector of values asoutput. To create the output, R applies the function in element-wisefashion: R first applies the function (or operation) to the firstelements of each vector involved. Then R applies the function (oroperation) to the second elements of each vector involved, and so onuntil R reaches the end of the vectors. If one vector is shorter thanthe others, R will recycle its values as needed (according to a set ofrecycling rules).

Data set one

Since table1 is organized in a tidy fashion, you can calculate therate like this,

# Data set onetable1$cases / table1$population * 10000

Data set two

Data set two intermingles the values of population and cases in thesame columns. As a result, you will need to untangle the values wheneveryou want to work with each variable separately.

You’ll need to perform an extra step to calculate the rate.

# Data set twocase_rows <- c(1, 3, 5, 7, 9, 11, 13, 15, 17)pop_rows <- c(2, 4, 6, 8, 10, 12, 14, 16, 18)table2$value[case_rows] / table2$value[pop_rows] * 10000

Data set three

Data set three combines the values of cases and population into the samecells. It may seem that this would help you calculate the rate, but thatis not so. You will need to separate the population values from thecases values if you wish to do math with them. This can be done, but notwith “basic” R syntax.

# Data set three# No basic solution

Data set four

Data set four stores each variable in a different format: as a column, aset of column names, or a field of cells. As a result, you will need towork with each variable differently. This makes code written for dataset four hard to generalize. The code that extracts the values ofyear, names(table4)[-1], cannot be generalized to extract the valuesof population, c(table5$1999, table5$2000, table5$2001). Compare thisto data set one. With table1, you can use the same code to extract thevalues of year, table1$year, that you use to extract the values ofpopulation. To do so, you only need to change the name of the variablethat you will access: table1$population.

The organization of data set four is inefficient in a second way aswell. Data set four separates the values of some variables across twoseparate tables. This is inconvenient because you will need to extractinformation from two different places whenever you want to work with thedata.

After you collect your input, you can calculate the rate.

# Data set fourcases <- c(table4$1999, table4$2000, table4$2001) population <- c(table5$1999, table5$2000, table5$2001)cases / population * 10000

Data set one is much easier to work with than with data sets two, three,or four. To work with data sets two, three, and four, you need to takeextra steps, which makes your code harder to write, harder tounderstand, and harder to debug.

Keep in mind that this is a trivial calculation with a trivial data set.The energy you must expend to manage a poor layout will increase withthe size of your data. Extra steps will accumulate over the course of ananalysis and allow errors to creep into your work. You can avoid thesedifficulties by converting your data into a tidy format at the start ofyour analysis.

The next sections will show you how to transform untidy data sets intotidy data sets.

Tidy data was popularized by Hadley Wickham, and it serves as the basisfor many R packages and functions. You can learn more about tidy data byreading Tidy Data a paper written by Hadley Wickham and published inthe Journal of Statistical Software. Tidy Data is available online atwww.jstatsoft.org/v59/i10/paper.

2.2 `spread()` and `gather()`

The tidyr package by Hadley Wickham is designed to help you tidy yourdata. It contains four functions that alter the layout of tabular datasets, while preserving the values and relationships contained in thedata sets.

The two most important functions in tidyr are gather() andspread(). Each relies on the idea of a key value pair.

2.2.1 key value pairs

A key value pair is a simple way to record information. A pair containstwo parts: a key that explains what the information describes, and avalue that contains the actual information. So for example,

Password: 0123456789

would be a key value pair. 0123456789 is the value, and it isassociated with the key Password.

Data values form natural key value pairs. The value is the value of thepair and the variable that the value describes is the key. So forexample, you could decompose table1 into a group of key value pairs,but it would cease to be a useful data set because you no longer knowwhich values belong to the same observation.

Country: AfghanistanCountry: BrazilCountry: ChinaYear: 1999Year: 2000Year: 2001Population: 19987071Population: 20595360Population: 172006362Population: 174504898Population: 1272915272Population: 1280428583Cases: 745Cases: 2666Cases: 37737Cases: 80488Cases: 212258Cases: 213766

Every cell in a table of data contains one half of a key value pair, asdoes every column name. In tidy data, each cell will contain a value andeach column name will contain a key, but this doesn’t need to be thecase for untidy data. Consider table2.

table2## Source: local data frame [12 x 4]## ## country year key value## 1 Afghanistan 1999 cases 745## 2 Afghanistan 1999 population 19987071## 3 Afghanistan 2000 cases 2666## 4 Afghanistan 2000 population 20595360## 5 Brazil 1999 cases 37737## 6 Brazil 1999 population 172006362## 7 Brazil 2000 cases 80488## 8 Brazil 2000 population 174504898## 9 China 1999 cases 212258## 10 China 1999 population 1272915272## 11 China 2000 cases 213766## 12 China 2000 population 1280428583

In table2, the key column contains only keys (and not just becausethe column is labelled key). Conveniently, the value column containsthe values associated with those keys.

You can use the spread() function to tidy this layout.

2.2.2 `spread()`

spread() turns a pair of key:value columns into a set of tidy columns.To use spread(), pass it the name of a data frame, then the name ofthe key column in the data frame, and then the name of the value column.Pass the column names as they are; do not use quotes.

To tidy table2, you would pass spread() the key column and thenthe value column.

table2## Source: local data frame [12 x 4]## ## country year key value## 1 Afghanistan 1999 cases 745## 2 Afghanistan 1999 population 19987071## 3 Afghanistan 2000 cases 2666## 4 Afghanistan 2000 population 20595360## 5 Brazil 1999 cases 37737## 6 Brazil 1999 population 172006362## 7 Brazil 2000 cases 80488## 8 Brazil 2000 population 174504898## 9 China 1999 cases 212258## 10 China 1999 population 1272915272## 11 China 2000 cases 213766## 12 China 2000 population 1280428583library(tidyr)spread(table2, key, value)## Source: local data frame [6 x 4]## ## country year cases population## 1 Afghanistan 1999 745 19987071## 2 Afghanistan 2000 2666 20595360## 3 Brazil 1999 37737 172006362## 4 Brazil 2000 80488 174504898## 5 China 1999 212258 1272915272## 6 China 2000 213766 1280428583

spread() returns a copy of your data set that has had the key andvalue columns removed. In their place, spread() adds a new column foreach unique value of the key column. These unique values will form thecolumn names of the new columns. spread() distributes the cells of theformer value column across the cells of the new columns and truncatesany non-key, non-value columns in a way that prevents duplication.

spread() distributes a pair ofkey:value columns into a field of cells. The unique values of the keycolumn become the column names of the field of cells.

You can see that spread() maintains each of the relationshipsexpressed in the original data set. The output contains the fouroriginal variables, country, year, population, and cases.

And the values of these variables are grouped according to the orginalobservations, but now the layout of these relationships is tidy.

spread() takes three optional arguments in addition to data, key,and value:

fill - If the tidy structure creates combinations of variablesthat do not exist in the original data set, spread() will place anNA in the resulting cells. NA is R’s missing value symbol. Youcan change this behaviour by passing fill an alternative value touse.
convert - If a value column contains multiple types of data,its elements will be saved as a single type, usually characterstrings. As a result, the new columns created by spread() willalso contain character strings. If you set convert = TRUE,spread() will run type.convert() on each new column, which willconvert strings to doubles (numerics), integers, logicals,complexes, or factors.
drop - The drop argument controls how spread() handlesfactors in the key column. If you set drop = FALSE, spread willkeep factor levels that do not appear in the key column, filling inthe missing combinations with the value of fill.

2.2.3 `gather()`

gather() does the reverse of spread(). gather() collects a set ofcolumn names and places them into a single “key” column. It alsocollects the cells of those columns and places them into a single valuecolumn. You can use gather() to tidy table4.

table4 # cases## Source: local data frame [3 x 3]## ## country 1999 2000## 1 Afghanistan 745 2666## 2 Brazil 37737 80488## 3 China 212258 213766

To use gather(), pass it the name of a data frame to reshape. Thenpass gather() a character string to use for the name of the “key”column that it will make, as well as a character string to use as thename of the value column that it will make. Finally, specify whichcolumns gather() should collapse into the key value pair (here withinteger notation).

gather(table4, "year", "cases", 2:3)## Source: local data frame [6 x 3]## ## country year cases## 1 Afghanistan 1999 745## 2 Brazil 1999 37737## 3 China 1999 212258## 4 Afghanistan 2000 2666## 5 Brazil 2000 80488## 6 China 2000 213766

gather() returns a copy of the data frame with the specified columnsremoved. To this data frame, gather() has added two new columns: a“key” column that contains the former column names of the removedcolumns, and a value column that contains the former values of theremoved columns. gather() repeats each of the former column names (aswell as each of the original columns) to maintain each combination ofvalues that appeared in the original data set. gather() uses the firststring that you supplied as the name of the new “key” column, and ituses the second string as the name of the new value column.

I’ve placed “key” in quotation marks because you will usually usegather() to create tidy data. In this case, the “key” column willcontain values, not keys. The values will only be keys in the sense thatthey were formally in the column names, a place where keys belong.

Just like spread(), gather maintains each of the relationships in theoriginal data set. This time table3 only contained three variables,country, year and cases. Each of these appears in the output ofgather() in a tidy fashion.

gather() also maintains each of the observations in the original dataset, organizing them in a tidy fashion.

We can use gather() to tidy table4 in a similar fashion.

table5 # population## Source: local data frame [3 x 3]## ## country 1999 2000## 1 Afghanistan 19987071 20595360## 2 Brazil 172006362 174504898## 3 China 1272915272 1280428583gather(table5, "year", "population", 2:3)## Source: local data frame [6 x 3]## ## country year population## 1 Afghanistan 1999 19987071## 2 Brazil 1999 172006362## 3 China 1999 1272915272## 4 Afghanistan 2000 20595360## 5 Brazil 2000 174504898## 6 China 2000 1280428583

In this code, I identified the columns to collapse with a series ofintegers. 2:3 describes the second and third columns of the dataframe. You can identify the same columns with each of the commandsbelow.

gather(table5, "year", "population", c(2, 3))gather(table5, "year", "population", -1)

You can also identify columns by name with the notation introduced bythe select function in dplyr, see Section 3.1.

In Section 3.6, you will learn how to combine two data frames, likethe tidy versions of table4 and table5 into a single data frame.

2.3 `separate()` and `unite()`

spread() and gather() help you reshape the layout of your data toplace variables in columns and observations in rows. separate() andunite() help you split and combine cells to place a single, completevalue in each cell.

2.3.1 `separate()`

separate() turns a single character column into multiple columns bysplitting the values of the column wherever a separator characterappears.

[SEPARATE DIAGRAM]

So, for example, we can use separate() to tidy table3, whichcombines values of cases and population in the same column.

separate(table3, rate, into = c("cases", "population"))## Source: local data frame [6 x 4]## ## country year cases population## 1 Afghanistan 1999 745 19987071## 2 Afghanistan 2000 2666 20595360## 3 Brazil 1999 37737 172006362## 4 Brazil 2000 80488 174504898## 5 China 1999 212258 1272915272## 6 China 2000 213766 1280428583

To use separate() pass separate the name of a data frame to reshapeand the name of a column to separate. Also give separate() an intoargument, which should be a vector of character strings to use as newcolumn names. separate() will return a copy of the data frame with thecolumn removed. The previous values of the column will be split acrossseveral columns, one for each name in into.

By default, separate() will split values wherever a non-alphanumericcharacter appears. Non-alphanumeric characters are characters that areneither a number nor a letter. For example, in the code above,separate() split the values of rate at the forward slash characters.

If you wish to use a specific character to separate a column, you canpass the character to the sep argument of separate(). For example,we could rewrite the code above as

separate(table3, rate, into = c("cases", "population"), sep = "/")

See Appendix E to learn more about regular expressions in R.

You can also pass an integer or vector of integers to sep.separate() will interpret the integers as positions to split at.Positive values start at 1 at the far-left of the strings; negativevalue start at -1 at the far-right of the strings. The length of sepshould be one less than the number of names in into. You can use thisarrangement to separate the last two digits of each year.

separate(table3, year, into = c("century", "year"), sep = 2)## Source: local data frame [6 x 4]## ## country century year rate## 1 Afghanistan 19 99 745/19987071## 2 Afghanistan 20 00 2666/20595360## 3 Brazil 19 99 37737/172006362## 4 Brazil 20 00 80488/174504898## 5 China 19 99 212258/1272915272## 6 China 20 00 213766/1280428583

You can futher customize separate() with the remove, convert, andextra arguments:

remove - Set remove = FALSE to retain the column of valuesthat were separated in the final data frame.
convert - By default, separate() will return new columns ascharacter columns. Set convert = TRUE to convert new columns todouble (numeric), integer, logical, complex, and factor columns withtype.convert().
extra - extra controls what happens when the number of newvalues in a cell does not match the number of new columns in into.If extra = error (the default), separate() to return an error.If extra = drop, separate() will drop new values and supplyNAs as necessary to fill the new columns. If extra = merge,separate() will split at most length(into) times.

2.3.2 `unite()`

unite() does the opposite of separate(): it combines multiplecolumns into a single column.

[UNITE DESCRIPTION]

We can use unite() to rejoin the century and year columns that wecreated in the last example. That data is saved in the DSR package astable6.

table6## Source: local data frame [6 x 4]## ## country century year rate## 1 Afghanistan 19 99 745/19987071## 2 Afghanistan 20 00 2666/20595360## 3 Brazil 19 99 37737/172006362## 4 Brazil 20 00 80488/174504898## 5 China 19 99 212258/1272915272## 6 China 20 00 213766/1280428583unite(table6, "new", century, year, sep = "")## Source: local data frame [6 x 3]## ## country new rate## 1 Afghanistan 1999 745/19987071## 2 Afghanistan 2000 2666/20595360## 3 Brazil 1999 37737/172006362## 4 Brazil 2000 80488/174504898## 5 China 1999 212258/1272915272## 6 China 2000 213766/1280428583

Give unite() the name of the data frame to reshape, the name of thenew column to create (as a character string), and the names of thecolumns to unite. unite() will place an underscore (_) between valuesfrom separate columns. If you would like to use a different separator,or no separator at all, pass the separator as a character string tosep.

unite() returns a copy of the data frame that includes the new column,but not the columns used to build the new column. If you would like toretain these columns, add the argument remove = FALSE.

You can also use integers or the syntax of the dplyr::select tospecify columns to unite in a more concise way. We’ll learn aboutselect in Section 3.1.

2.4 Case Study

The who data set in the DSR package contains cases of tuberculosis(TB) reported between 1995 and 2013 sorted by country, age, and gender.The data comes in the 2014 World Health Organization GlobalTuberculosis Report, available for download atwww.who.int/tb/country/data/download/en/.The data provides a wealth of epidemiological information, but it wouldbe difficult to work with the data as it is.

To see the data in its raw form, load DSR with library(DSR) then run

View(who)

A subset of the who data frame displayed withView().

who provides a realistic example of tabular data in the wild. Itcontains redundant columns, odd variable codes, and many missing values.In short, who is messy.

TIP

The View() function opens a data viewer in the RStudio IDE. Here youcan examine the data set, search for values, and filter the displaybased on logical conditions. Notice that the View() function beginswith a capital V.

The most unique feature of who is its coding system. Columns fivethrough sixty encode four separate pieces of information in their columnnames:

The first three letters of each column denote whether the columncontains new or old cases of TB. In this data set, each columncontains new cases.
The next two letters describe the type of case being counted. Wewill treat each of these as a separate variable.
- rel stands for cases of relapse
- ep stands for cases of extrapulmonary TB
- sn stands for cases of pulmonary TB that could not bediagnosed by a pulmonary smear (smear negative)
- sp stands for cases of pulmonary TB that could be diagnosed bea pulmonary smear (smear positive)
The sixth letter describes the sex of TB patients. The data setgroups cases by males (m) and females (f).
The remaining numbers describe the age group of TB patients. Thedata set groups cases into seven age groups:
- 014 stands for patients that are 0 to 14 years old
- 1524 stands for patients that are 15 to 24 years old
- 2534 stands for patients that are 25 to 34 years old
- 3544 stands for patients that are 35 to 44 years old
- 4554 stands for patients that are 45 to 54 years old
- 5564 stands for patients that are 55 to 64 years old
- 65 stands for patients that are 65 years old or older

Notice that the who data set is untidy in multiple ways. First, thedata appears to contain values in its column names. We can move thevalues into their own column with gather(). This will make it easy toseparate the values combined in each code.

who <- gather(who, "code", "value", 5:60)

We can separate the values in each code with two passes of separate().The first pass will split the codes at each underscore.

who <- separate(who, code, c("new", "var", "sexage"))

The second pass will split sexage after the first character to createa sex column and an age column.

who <- separate(who, sexage, c("sex", "age"), sep = 1)

Finally, we can move the rel, ep, sn, and sp keys into their owncolumn names with spread().

who <- spread(who, var, value)

The who data set is now tidy. It is far from sparkling (for example,it contains several redundant columns), but it will now be much easierto work with in R. We will continue to work with this tidy version ofwho in Section 3.7, where we will remove the redundant columns andcalculate new variables.