Chapter 6 Importing Data

6.1 Placeholder

Importing data in R can be really easy. To import data in R, you usually use a function with the general form read.___() and that’s it! To make the dataset accessible in R you have to assign it a name e.g. 

NAME <- read.___()

The object ‘NAME’ is a type of data.frame. As mentioned in earlier, it is essentially a large table, so R expects a “complete” table as the input. If the input is not a “complete” table, R will do it’s best to fill in the blanks - maybe not in the way you expect!

By “complete” table, we mean;

  • Each column is of the same length
  • Each column is has the same type of value in each object.
  • Usually the first row is are the column names
  • Contain no blank rows in the middle of the dataset
  • have no special formatting

Depending on your data, one of the best tools to check and edit your data is Excel. Another thing to note is an Excel file should only have one rectangle of data per sheet.

Here is an example of a poorly formatted excel sheet

and here is the correted version

6.2 Introduction to UKIreland dataset

The UKIreland data contains data from 2000 to 2016 with the gdp, population, and unemployment rate of the UK and Ireland. (FAOSTAT 2018)

6.3 Reading in data from CSV

CSV format is the most common formats to save data. See notes in the slides about good practice for data structures before reading anything in.

If there are problems in your data before you read it into R, there will be problems with your analysis in R. Make sure you check your data when it comes in - solving the problems in your raw data files is usually easier than solving the problems using R, and will also prevent others using the data from running into the same problems in the future.

UKIreland<-read.csv("path/to/file")

6.4 Reading in from Excel

Reading from excel is also easy; but needs an extra package to be installed. There are a lot of different options to do this! My preferred choice is to use the openxlsx library.

library(openxlsx)

UKIreland2<-read.xlsx("path/to/UKIreland.xlsx")

UKIreland2

Note that the excel version of the file is different; there is one sheet for each country.

The read.xlsx function automatically only reads in the first sheet. So this line above has only read in the data for Ireland, from the first tab in the Excel file. However, if we have named sheets in our Excel file then reading in the sheets we want is easy:

UK<-read.xlsx("path/to/UKIreland.xlsx","UK")
Ireland<-read.xlsx("path/to/UKIreland.xlsx","Ireland")

We can use filter() to produce subsets of our data - e.g. to find the maximum unemployment rate:

filter(UKIreland,unemployment==max(unemployment))
##   country date      gdp population unemployment capital
## 1 Ireland 2012 49177.44    4586897       14.725  Dublin

Remember the double == when making logical statements

Using pipes and group_by() lets us combine multiple tasks together easily - e.g. finding the maximum unemployment rate within each country

UKIreland %>%
  group_by(country) %>%
    filter(unemployment==max(unemployment))
## # A tibble: 2 x 6
## # Groups:   country [2]
##   country         date    gdp population unemployment capital
##   <fct>          <int>  <dbl>      <int>        <dbl> <fct>  
## 1 Ireland         2012 49177.    4586897        14.7  Dublin 
## 2 United Kingdom  2011 41412.   63258918         8.09 London

Or we could use summarise() to calculate average unemployment rates per country:

UKIreland %>%
  group_by(country) %>%
    summarise(average_unemployment=mean(unemployment))
## # A tibble: 2 x 2
##   country        average_unemployment
##   <fct>                         <dbl>
## 1 Ireland                        8.18
## 2 United Kingdom                 6.02

Note this time - only a single = sign in summarise because we are assigning a value (set average_unemployment to be equal to mean of unemployment). Earlier we wanted R to check if two things were equal (when is unemployment equal to maximum unemployment), so we used ==.

mutate() is a nice function to help us create new variables. E.g. the total number of unemployed people

mutate(UKIreland,TotalUnemployed=unemployment*population/100)
country date gdp population unemployment capital TotalUnemployed
Ireland 2000 26241.51 3805174 4.266667 Dublin 162354.1
Ireland 2001 28227.28 3866243 3.925000 Dublin 151750.0
Ireland 2002 32539.95 3931947 4.491667 Dublin 176610.0
Ireland 2003 41107.03 3996521 4.616667 Dublin 184506.1
Ireland 2004 47630.93 4070262 4.491667 Dublin 182822.6
Ireland 2005 50878.64 4159914 4.400000 Dublin 183036.2
Ireland 2006 54306.91 4273591 4.525000 Dublin 193380.0
Ireland 2007 61359.64 4398942 4.691667 Dublin 206383.7
Ireland 2008 61257.90 4489544 6.433333 Dublin 288827.3
Ireland 2009 52104.04 4535375 12.050000 Dublin 546512.7

Note that this is obviously a stupid number from a methodological point of view since unemployment rates are based on “eligible workforce population” rather than total population. But I’m not an economist so let’s go with that for now.

References

FAOSTAT. 2018. “Live Stock.” http://www.fao.org/faostat/en/#data/QA.