Chapter 6 Importing Data
6.1 Placeholder
Importing data in R can be really easy. To import data in R, you usually use a function with the general form read.___()
and that’s it! To make the dataset accessible in R you have to assign it a name e.g.
NAME <- read.___()
The object ‘NAME’ is a type of data.frame
. As mentioned in earlier, it is essentially a large table, so R expects a “complete” table as the input. If the input is not a “complete” table, R will do it’s best to fill in the blanks - maybe not in the way you expect!
By “complete” table, we mean;
- Each column is of the same length
- Each column is has the same type of value in each object.
- Usually the first row is are the column names
- Contain no blank rows in the middle of the dataset
- have no special formatting
Depending on your data, one of the best tools to check and edit your data is Excel. Another thing to note is an Excel file should only have one rectangle of data per sheet.
Here is an example of a poorly formatted excel sheet
and here is the correted version
6.2 Introduction to UKIreland dataset
The UKIreland data contains data from 2000 to 2016 with the gdp, population, and unemployment rate of the UK and Ireland. (FAOSTAT 2018)
6.3 Reading in data from CSV
CSV format is the most common formats to save data. See notes in the slides about good practice for data structures before reading anything in.
If there are problems in your data before you read it into R, there will be problems with your analysis in R. Make sure you check your data when it comes in - solving the problems in your raw data files is usually easier than solving the problems using R, and will also prevent others using the data from running into the same problems in the future.
UKIreland<-read.csv("path/to/file")
6.4 Reading in from Excel
Reading from excel is also easy; but needs an extra package to be installed. There are a lot of different options to do this! My preferred choice is to use the openxlsx library.
library(openxlsx)
UKIreland2<-read.xlsx("path/to/UKIreland.xlsx")
UKIreland2
Note that the excel version of the file is different; there is one sheet for each country.
The read.xlsx function automatically only reads in the first sheet. So this line above has only read in the data for Ireland, from the first tab in the Excel file. However, if we have named sheets in our Excel file then reading in the sheets we want is easy:
UK<-read.xlsx("path/to/UKIreland.xlsx","UK")
Ireland<-read.xlsx("path/to/UKIreland.xlsx","Ireland")
We can use filter()
to produce subsets of our data - e.g. to find the maximum unemployment rate:
filter(UKIreland,unemployment==max(unemployment))
## country date gdp population unemployment capital
## 1 Ireland 2012 49177.44 4586897 14.725 Dublin
Remember the double == when making logical statements
Using pipes and group_by() lets us combine multiple tasks together easily - e.g. finding the maximum unemployment rate within each country
UKIreland %>%
group_by(country) %>%
filter(unemployment==max(unemployment))
## # A tibble: 2 x 6
## # Groups: country [2]
## country date gdp population unemployment capital
## <fct> <int> <dbl> <int> <dbl> <fct>
## 1 Ireland 2012 49177. 4586897 14.7 Dublin
## 2 United Kingdom 2011 41412. 63258918 8.09 London
Or we could use summarise() to calculate average unemployment rates per country:
UKIreland %>%
group_by(country) %>%
summarise(average_unemployment=mean(unemployment))
## # A tibble: 2 x 2
## country average_unemployment
## <fct> <dbl>
## 1 Ireland 8.18
## 2 United Kingdom 6.02
Note this time - only a single = sign in summarise because we are assigning a value (set average_unemployment to be equal to mean of unemployment). Earlier we wanted R to check if two things were equal (when is unemployment equal to maximum unemployment), so we used ==
.
mutate() is a nice function to help us create new variables. E.g. the total number of unemployed people
mutate(UKIreland,TotalUnemployed=unemployment*population/100)
country | date | gdp | population | unemployment | capital | TotalUnemployed |
---|---|---|---|---|---|---|
Ireland | 2000 | 26241.51 | 3805174 | 4.266667 | Dublin | 162354.1 |
Ireland | 2001 | 28227.28 | 3866243 | 3.925000 | Dublin | 151750.0 |
Ireland | 2002 | 32539.95 | 3931947 | 4.491667 | Dublin | 176610.0 |
Ireland | 2003 | 41107.03 | 3996521 | 4.616667 | Dublin | 184506.1 |
Ireland | 2004 | 47630.93 | 4070262 | 4.491667 | Dublin | 182822.6 |
Ireland | 2005 | 50878.64 | 4159914 | 4.400000 | Dublin | 183036.2 |
Ireland | 2006 | 54306.91 | 4273591 | 4.525000 | Dublin | 193380.0 |
Ireland | 2007 | 61359.64 | 4398942 | 4.691667 | Dublin | 206383.7 |
Ireland | 2008 | 61257.90 | 4489544 | 6.433333 | Dublin | 288827.3 |
Ireland | 2009 | 52104.04 | 4535375 | 12.050000 | Dublin | 546512.7 |
Note that this is obviously a stupid number from a methodological point of view since unemployment rates are based on “eligible workforce population” rather than total population. But I’m not an economist so let’s go with that for now.
References
FAOSTAT. 2018. “Live Stock.” http://www.fao.org/faostat/en/#data/QA.