Skip to Tutorial Content

Overview

In this second workbook for the module on R for Data Management, we are going to be taking a closer look at how you can reshape your data.

We often describe a dataset as being a rectangle of data BUT there are 2 ways we can orientate this rectangle.

In wide format, every row is a single observation and every column is a variable.

Plot num Height Flowers Pests
1 12cm 3 12
2 15cm 5 7
3 7cm 1 3
4 14cm 4 10

Wide will be the data format you are most familiar with, it is essentially the default shape for analysis. It is easier for us to visually interpret as people, and your variables are nicely separated for analysis.

In long format however, each row is a measure within an observation. So you can have multiple rows for the same plot/household etc.

Plot num Measure Flowers
1 Height 12
1 Flowers 3
1 Pests 12
2 Height 15
2 Flowers 5
2 Pests 7
3 Height 7
3 Flowers 1
3 Pests 3
4 Height 14
4 Flowers 4
4 Pests 10

The use of this type of data may not be immediately clear and is seemingly against those 'tidy data' rules we mentioned in the last workbook as 1 row is NOT 1 observation. However this sort of data can be useful in many scenarios including longitudinal / repeated measures data. Whereby each row would be a specific time for each unit (person A at time 1, person A at time 2, person B at time 1 etc.).

Person Time Height
A 1 43
A 2 48
A 3 52
B 1 45
B 2 48
B 3 56

This could be preferable for analysis compared to wide data as that would have the measurement variable split across multiple columns for each time point.

Person Height.1 Height.2 Height.3
A 43 48 52
B 45 48 56

Long data can also be useful for hierarchical data. Let’s say you have multiple plants within a plot, you could have your data at plot level; therefore, each column is a measure for a particular plant. But if i wanted to compare heights across ALL plants, this becomes a bit difficult because i have all their heights across multiple columns rather than just 1.

Plot Fertiliser Height.1 Height.2
A None 12 15
B Manure 14 18

However, we could reshape this into long format, so your data is now at plant level. The plot level data is retained, it is just repeated so these variables are still available to us for any comparative analysis.

Plot Fertiliser Plant Height
A None 1 12
A None 2 15
B Manure 1 14
B Manure 2 18

It is important to think about what shape your data needs to be in prior to an analysis as different methods will lend themselves to different shapes. Arguably most basic analyses will suit a wide dataset but others such as comparisons across groups in some cases or even plotting may need a long shaped dataset. It's also likely that as you work through your analysis you realise you need to constantly swap between shapes and create new datasets for specific comparisons of even particular plots.

In this workbook, we will look at how you can transform your data between these two formats with R

Packages

The only package that will be used in this workbook is Tidyr. This package was demonstrated in the previous workbook and contains many useful functions to wrangle your data. The focus is keeping your data "tidy".

More on this package can be found here

For a quick introduction on the required functions please watch this short video

Dataset

For the purposes of this demonstration we are going to use two examples of the following wide data set.

In this first example we have recorded the height of a plant in each of our 10 plots at three different time points.

In this extended example, we have recorded both height and the number of pests observed at three different time points.

During this tutorial we will be looking at a few different ways to first convert these wide datasets to long, and then we will convert them convert them back from long to wide.

Wide to Long

Let's start with our first dataset and try transforming this into long dataset rather than a wide data set.

Tidyr supplies a very useful function for this called pivot_longer.

There are 3 key arguments that you will need to supply to this function.

  • cols =, you use this argument to specify which columns you actually want to turn into long format. In our example we have selected the columns between Height_1 and Height_3. Note the use of the colon means everything from X (Height_1) to Y (Height_3), including these columns. So this will select Height_1, Height_2 and Height_3
  • names_to =, this argument is used to specify the name of the column into which those column names will be input. In other words, Height_1, Height_2 and Height_3 will become the values in a new column when you pivot this into a longer format. So this argument is used to give that column a name.
  • values_to =, this is the same as the above but is used for the column in which the values of Height_1, Height_2 and Height_3 will be input.

In this example we have also specified one of the many optional arguments this function can take. You can look at the help page to inspect some of these optional arguments by typing ?pivot_longer into your R console or through this link

Here we have used the argument names_prefix, this tells R if there is some sort of text at the start of the column name that you don't want to be included in the values for that new column. So in our case we only really need the number from the column name and not the "Height_" part. So we can use this argument to remove that from the text, leaving just the number.

Before doing any sort of reshaping of your data it is always important to think about how many rows and how many columns should the resulting transformation contain. So in our case we had 10 rows and 4 columns, 3 of which are being pivoted into longer format. So column wise we should still have our Plot_ID column, and the 2 columns resulting from the pivot. One with the time of measurement and one with the value of that measurement. And we had 3 measurements for 10 observations (rows), with that set to become one measurement per row, that means we should have 30 rows (3*10). So we should have 30 rows and 3 columns when our data is in long format.

It is important to think this through as checking the numbers of rows and columns is a good step in checking your transformation has worked as intended

data_long1 <- data_wide1 %>%
  pivot_longer(cols = Height_1:Height_3,
               names_to = "Height_time",
               values_to = "Value",
               names_prefix = "Height_")

data_long1

Success, we have a dataset now of 30 rows and 3 columns. One row per time point for each plot.

Let's move onto our more complicated example where we have 2 different measures recorded at 3 separate time points.

Let’s do the same thing again but of course extend our column selection to Pests_3. You can see that what we end up with is certainly in long format but the length is defined by time point and measure. Therefore 2 measures * 3 time points * 10 observations means we have 60 rows. You can see how this could get quite extensive if we were to do this for more and more observations/timepoints/measures. We could get some extraordinarily long datasets.

data_long2 <- data_wide2 %>%
  pivot_longer(cols = Height_1:Pests_3,
               names_to = "Time_Measure",
               values_to = "Value")

data_long2

More often than not, this is probably not what you are looking for in your long data. This is a bit of a quirk of long data in that you can often choose different options for how to lengthen your data. In this example we could choose to define our length based on the measurement, as we have done in this first example, but as our data is longitudinal we could also choose to define this on time point. In other words we have a sort of quasi-multilevel approach as we have measures within timepoints within plots. In cases such as these, it is likely that you would prefer to define your data based on time points while keeping your measurements separate.

So the question is, how can we go from the wide data to long based on time point rather than on measurement within timepoints therefore keeping height and pests to their own columns?

Basically rather than 60 rows of 3 columns or 10 rows of 7 columns, what we want is 30 rows of 4 columns. 10 observations * 3 timepoints, with one column for Plot_ID, another for timepoint, one for height and one for pests.

So we sort of have a mid point point between fully wide and fully long.

Thankfully we can create new columns for each type of measure using pivot_longer. Though it means ensuring that you have a consistent naming system.This is because we need to specify some sort of separator or naming pattern so that R can distinguish cleanly between two pieces of information in our column headers.

Now a very important point to reiterate regarding this function is you absolutely need a consistent naming pattern to your columns. In our example we have one as all of our columns containing time sensitive measurements take the format "Measurement_Timepoint". There is a clear separation of the 2 pieces of information.

My recommendation is that when you are dealing with naming columns for wide longitudinal/multilevel data, you make a consistent and clear indication of the variables for the different time points/sub-categories. So if it is a time point make sure every time-sensitive variable does begin with something like "W1_" or end with "_W1". Make a clear distinction between the variable and the time point.

We have done this as we have all our columns looking like "Measurement_Timepont"

Now in order to make sure we get separate columns for height and pests we actually need to start supplying our names_to and values_to arguments.

We keep cols the same, but in names_to we need to do something a little special. You need to include the ".value" string as one of your arguments. In our case we specify this first as in our column names the measure comes first in the naming system. ".value" tells R that part of the column name specifies a measurement and that these should take their own new columns rather than being spliced into one long column. In other words it will see that we have "Height_" and "Pests_" and recognise that these should be separate columns in your output.

We also have to tell R what it is that separates these pieces of information in the column names, in our case it is an underscore. So we supply that in quotations to the names_sep argument.

Note that we do not need to specify the values_to argument as R will name these columns based upon the existing column names. Though we can add it in if we want them to be called something different.

data_long3 <- data_wide2 %>%
  pivot_longer(cols = Height_1:Pests_3,
               names_to = c(".value", "Time"),
               names_sep = "_")

data_long3

The resulting is a dataset of 30 rows where time point has defined our dataset length while we have separate columns for our different measures.

Long to wide

Now that we have reshaped our data from wide to long. Let's do the opposite and get it back to wide again.

As you might expect, the opposite of pivot_longer is pivot_wider.

This has two mandatory arguments which are similarly the opposite of what you find for pivot_longer.

  • names_from =, use this argument to tell R the column, or columns, from which to generate new column names
  • values_from =, we use this argument to tell R which column, or columns, it should be getting the new cell values from.

We also have a couple optional arguments

  • id_cols =, This argument is used to specify the columns that uniquely identify each observation. By default this will use any column in the data which is not specified in names_from or values_from =. So in the previous example from the last page, when we pivoted to wide it would have taken Plot_ID and Time to be the identifying variables.
  • names_prefix = we use this once again, but this time rather than telling R to remove a prefix existing already in our column names. We are using it to tell R to add this prefix on in front of our values that will be generating our new column names. If we had left his argument out we would have got column names that just say 1, 2 and 3.

There are additional optional arguments you may wish to research through the help page here

data_wide1_new <- data_long1 %>%
  pivot_wider(id_cols = Plot_ID,
              names_from = Height_time,
              values_from = Value,
              names_prefix = "Height_")

data_wide1_new

Similarly we can reshape our really long data, the one with 60 rows just using the basic arguments.

data_wide2_new <- data_long2 %>%
  pivot_wider(id_cols = Plot_ID,
              names_from = Time_Measure,
              values_from = Value)

data_wide2_new

Now with our other piece of long data where we had rows defined by time point rather than measure, we need to be a little bit more clever. We can actually pass multiple columns into the values_from argument by wrapping them in c(). For those not familiar with c() this is used often to combine a list of items into a vector. So we create a list of columns by writing in c(Height,pass). This is a very common function in R.

Note that when we do this, the new column names will always be "the value column name _ the label from the names from column". In this case it will generate those original column names of Height_1, Height_2 etc.

data_wide3_new <- data_long3 %>%
  pivot_wider(id_cols = Plot_ID,
              names_from = Time,
              values_from = c(Height,Pests))

data_wide3_new

For more detailed and complicated examples of using the pivot functions, please follow this link

Long vs Wide Data