Chapter 5 Data Manipulation with dplyr
The previous chapter introduced ggplot2
which expanded our Base-R vocabulary to help us visualise our data easier.
Now, what if our dataset isn’t organised quite the way we want it to create a plot? Maybe it’s missing a column we need? Maybe missing values are encoded stragangely? Perhaps we need to standarise the units in a column?
We can use another package to solve these problems - dplyr
! (Wickham, François, et al. 2019)
Like ggplot2
, it adds more words to our R vocabulary and is focused on organising your data. dplyr
is short for “data frame plier”. Data frames are an object type in R
but we don’t think it is useful to discuss object types in this tutorial. data frames
are essentially tables that stores your data in R.
5.1 Set Up for Session
Similarly to the previous chapter, it is assumed you have R Studio running and have installed ggplot2
and dplyr
. If not, run the following code;
install.packages(c("ggplot2","dplyr"))
We begin by loading dplyr
and ggplot2
.
library(dplyr)
library(ggplot2)
When you load dplyr
you will get a warning message;
Don’t worry!.
What it is saying is that dplyr
has a few functions with the same name as built-in functions, so it will now use the ones from dplyr
by default when these are called.
QUESTION: Can you match the interpretation of the error with what is written in the error message? Why do you think they haved used the phrase ‘masked’?
library(dplyr)
library(ggplot2)
Let’s start by loading the miniIMDB dataset. This process is that same as described in the previous chapter but loading 04-miniImdb.RData
instead of 03 Pulse.RData
.
5.2 Description of Datasets Being Used
The “imdb” dataset has been built from the subsets of the Internet Movie Database made available for non-commercial purposes by the IMDb team: IMDB (2018)
It contains the following informations for all the entries having more than 500 votes, that are not of type “tvEpisodes” and for which information about year of release, running time and director(s) was available at the time of extraction (24/10/2018):
Variable | Meaning |
---|---|
title | the popular title of the entry |
type | type of the entry: movie short, tvMiniSeries, tvMovie, tvSeries, tvShort, tvSpecial, video or videoGame |
year | the year of release or start of release for series |
length | the duration of the running time in minutes |
director | the director (or director appearing first in the list of directors) |
birthYear | year of birth of director |
NumVotes | number of votes for the entry |
averageRating | IMDb’s weighted average rating for the entry |
As well as a list of logical (TRUE/FALSE) columns qualifying the genre of the entry:
Genres |
---|
animation |
action |
adventure |
comedy |
documentary |
fantasy |
romance |
sci_fi |
thriller |
The “miniImdb” dataset is a small subset of “imdb”. It contains the following 5 first variables of “imdb” for the titles that have received more than 1 million votes:
miniImdb |
---|
title |
type |
year |
length |
numVotes |
We will start by using the miniImdb dataset but we will later move to the full imdb dataset when we get more familiar with data manipulation.
5.3 Slicing, Filter and Select
We can use square brackets as a ‘quick’ way to get subsets of datasets (also called slicing) based on position;
DATASETNAME[ROW NUMBERS,COLUMN NUMBERS]
miniImdb[1,]
## # A tibble: 1 x 5
## title type year length numVotes
## <chr> <fct> <dbl> <dbl> <int>
## 1 The Godfather movie 1972 175 1374861
miniImdb[,1]
## # A tibble: 29 x 1
## title
## <chr>
## 1 The Godfather
## 2 Star Wars: Episode IV - A New Hope
## 3 Star Wars: Episode V - The Empire Strikes Back
## 4 The Silence of the Lambs
## 5 Schindler's List
## 6 Forrest Gump
## 7 Pulp Fiction
## 8 The Shawshank Redemption
## 9 Se7en
## 10 The Lord of the Rings: The Fellowship of the Ring
## # ... with 19 more rows
miniImdb[1:5,1:2]
## # A tibble: 5 x 2
## title type
## <chr> <fct>
## 1 The Godfather movie
## 2 Star Wars: Episode IV - A New Hope movie
## 3 Star Wars: Episode V - The Empire Strikes Back movie
## 4 The Silence of the Lambs movie
## 5 Schindler's List movie
But to get more useful subsets we would probably need to base these on characteristics or names rather than positions. We can use the filter()
and select()
functions for that from dplyr
.
filter() helps you get the rows you are interested in
filter(miniImdb, type=="tvSeries")
filter(miniImdb, year>2005)
## # A tibble: 2 x 5
## title type year length numVotes
## <chr> <fct> <dbl> <dbl> <int>
## 1 Breaking Bad tvSeries 2008 49 1124817
## 2 Game of Thrones tvSeries 2011 57 1365039
## # A tibble: 12 x 5
## title type year length numVotes
## <chr> <fct> <dbl> <dbl> <int>
## 1 Inglourious Basterds movie 2009 153 1069563
## 2 The Departed movie 2006 151 1031834
## 3 The Dark Knight movie 2008 152 1975810
## 4 The Prestige movie 2006 130 1020039
## 5 Avatar movie 2009 162 1005456
## 6 Interstellar movie 2014 169 1218930
## 7 The Avengers movie 2012 143 1129041
## 8 Breaking Bad tvSeries 2008 49 1124817
## 9 Game of Thrones tvSeries 2011 57 1365039
## 10 The Dark Knight Rises movie 2012 164 1335842
## 11 Inception movie 2010 148 1755897
## 12 Django Unchained movie 2012 165 1156987
Whereas select helps you keep only the columns that you care about, by listing their names
select(miniImdb,title,year, numVotes)
## # A tibble: 29 x 3
## title year numVotes
## <chr> <dbl> <int>
## 1 The Godfather 1972 1374861
## 2 Star Wars: Episode IV - A New Hope 1977 1078317
## 3 Star Wars: Episode V - The Empire Strikes Back 1980 1008039
## 4 The Silence of the Lambs 1991 1075970
## 5 Schindler's List 1993 1035642
## 6 Forrest Gump 1994 1527656
## 7 Pulp Fiction 1994 1566510
## 8 The Shawshank Redemption 1994 2006753
## 9 Se7en 1995 1226169
## 10 The Lord of the Rings: The Fellowship of the Ring 2001 1445423
## # ... with 19 more rows
or by going in sequence - from title to year for example
select(miniImdb,title:year)
## # A tibble: 29 x 3
## title type year
## <chr> <fct> <dbl>
## 1 The Godfather movie 1972
## 2 Star Wars: Episode IV - A New Hope movie 1977
## 3 Star Wars: Episode V - The Empire Strikes Back movie 1980
## 4 The Silence of the Lambs movie 1991
## 5 Schindler's List movie 1993
## 6 Forrest Gump movie 1994
## 7 Pulp Fiction movie 1994
## 8 The Shawshank Redemption movie 1994
## 9 Se7en movie 1995
## 10 The Lord of the Rings: The Fellowship of the Ring movie 2001
## # ... with 19 more rows
So in summary,
filter()
works by subsetting rows from a dataset
select()
works for subset of columns
QUESTION: Write the code that would give you all the entries that have been released before 2000
filter(miniImdb, ????)
QUESTION: What would you do to only keep the columns title, length and numVotes from the miniImdb dataset
select(miniImdb,????)
5.4 Creating new dataframe
All of these subsets so far have only produced temporary results printed into the console window. We usually subset data so that we can then do something with it later. One option is to assign the subset to a new dataframe with the arrow: “<-”. We will learn another way later.
recentEntries<-filter(miniImdb, year>2005)
recentEntries
## # A tibble: 12 x 5
## title type year length numVotes
## <chr> <fct> <dbl> <dbl> <int>
## 1 Inglourious Basterds movie 2009 153 1069563
## 2 The Departed movie 2006 151 1031834
## 3 The Dark Knight movie 2008 152 1975810
## 4 The Prestige movie 2006 130 1020039
## 5 Avatar movie 2009 162 1005456
## 6 Interstellar movie 2014 169 1218930
## 7 The Avengers movie 2012 143 1129041
## 8 Breaking Bad tvSeries 2008 49 1124817
## 9 Game of Thrones tvSeries 2011 57 1365039
## 10 The Dark Knight Rises movie 2012 164 1335842
## 11 Inception movie 2010 148 1755897
## 12 Django Unchained movie 2012 165 1156987
tvSeriesData<-filter(miniImdb, type=="tvSeries")
tvSeriesData
## # A tibble: 2 x 5
## title type year length numVotes
## <chr> <fct> <dbl> <dbl> <int>
## 1 Breaking Bad tvSeries 2008 49 1124817
## 2 Game of Thrones tvSeries 2011 57 1365039
Note the == sign to mean IS EQUAL TO
A single equals sign is an assignment statement: x=y “set x to be equal y”
A double equals sign is a question x==y “is x equal to y?”
within the filter() function the single = sign will return an error. Not all functions may be so friendly!
filter(miniImdb, type="tvSeries")
## `type` (`type = "tvSeries"`) must not be named, do you need `==`?
Remember R is case sensitive, and cannot do any association of meaning on its own
filter(miniImdb, type=="tv Series")
## # A tibble: 0 x 5
## # ... with 5 variables: title <chr>, type <fct>, year <dbl>, length <dbl>,
## # numVotes <int>
nothing
filter(miniImdb, type=="tvseries")
## # A tibble: 0 x 5
## # ... with 5 variables: title <chr>, type <fct>, year <dbl>, length <dbl>,
## # numVotes <int>
also nothing
filter(miniImdb, Type=="tvSeries")
## Error: object 'Type' not found
still nothing
We can also produce a subset of a subset:
titleVotesRecent<-select(recentEntries, title, numVotes)
titleVotesRecent
## # A tibble: 12 x 2
## title numVotes
## <chr> <int>
## 1 Inglourious Basterds 1069563
## 2 The Departed 1031834
## 3 The Dark Knight 1975810
## 4 The Prestige 1020039
## 5 Avatar 1005456
## 6 Interstellar 1218930
## 7 The Avengers 1129041
## 8 Breaking Bad 1124817
## 9 Game of Thrones 1365039
## 10 The Dark Knight Rises 1335842
## 11 Inception 1755897
## 12 Django Unchained 1156987
QUESTION: Produce a subset of the data for the type “movie” and assign it to an object called movies
??? <- filter(miniImdb, ????==????)
???
QUESTION: Is there any movie who received more than 2 million votes?
filter(????, ????)
5.5 Column Transformations
We can create new columns with the function mutate()
For example let’s try to convert the length of the entries in hour rather than in minutes
mutate(miniImdb, lengthInHour = length/60)
## # A tibble: 29 x 6
## title type year length numVotes lengthInHour
## <chr> <fct> <dbl> <dbl> <int> <dbl>
## 1 The Godfather movie 1972 175 1374861 2.92
## 2 Star Wars: Episode IV - A New ~ movie 1977 121 1078317 2.02
## 3 Star Wars: Episode V - The Emp~ movie 1980 124 1008039 2.07
## 4 The Silence of the Lambs movie 1991 118 1075970 1.97
## 5 Schindler's List movie 1993 195 1035642 3.25
## 6 Forrest Gump movie 1994 142 1527656 2.37
## 7 Pulp Fiction movie 1994 154 1566510 2.57
## 8 The Shawshank Redemption movie 1994 142 2006753 2.37
## 9 Se7en movie 1995 127 1226169 2.12
## 10 The Lord of the Rings: The Fel~ movie 2001 178 1445423 2.97
## # ... with 19 more rows
The usual signs that R uses to make calculations are:
Symbol | Meaning |
---|---|
* |
multiply |
/ |
divide |
+ |
add |
- |
substract |
** |
raise to the power |
QUESTION: Create a new dataframe adding a column giving the number of votes in million to miniImdb and show only the title and this newly created column
miniImdbMillion <- mutate(miniImdb, ???? = ????)
select(????, ????)
5.6 More ways to filter
When we look at the full imdb dataset, it will be useful to know that we can use multiple conditions and additional fonctions to filter rows:
Here are the possible logical symbols to use when doing conditions in R:
Code | Meaning |
---|---|
== | EQUALS |
& | AND |
! | NOT |
< | less than |
> | more than |
And we can combine these together. On the “recentEntries” data let’s get all the entries with a length between 1h30 (90min) and 2h30 (150min)
filter(recentEntries,length>90 & length<150)
## # A tibble: 3 x 5
## title type year length numVotes
## <chr> <fct> <dbl> <dbl> <int>
## 1 The Prestige movie 2006 130 1020039
## 2 The Avengers movie 2012 143 1129041
## 3 Inception movie 2010 148 1755897
We can also use functions like max()
or min()
to help us with the filtering. For example, maybe we want to know which entrie’s length is the largest
filter(recentEntries,length==max(length))
## # A tibble: 1 x 5
## title type year length numVotes
## <chr> <fct> <dbl> <dbl> <int>
## 1 Interstellar movie 2014 169 1218930
Finally, going back to our very first examples, note that the function slice()
is an alternative way to get specific rows by positions in a dataset. We will use it on the full imdb dataset
slice(recentEntries, 1)
## # A tibble: 1 x 5
## title type year length numVotes
## <chr> <fct> <dbl> <dbl> <int>
## 1 Inglourious Basterds movie 2009 153 1069563
slice(miniImdb, 1:5)
## # A tibble: 5 x 5
## title type year length numVotes
## <chr> <fct> <dbl> <dbl> <int>
## 1 The Godfather movie 1972 175 1374861
## 2 Star Wars: Episode IV - A New Hope movie 1977 121 1078317
## 3 Star Wars: Episode V - The Empire Strikes Ba~ movie 1980 124 1008039
## 4 The Silence of the Lambs movie 1991 118 1075970
## 5 Schindler's List movie 1993 195 1035642
QUESTION: Which movie has the oldest year of release among the movies of the miniImdb dataset?
filter(miniImdb,????)
QUESTION: Which entry has the oldest year of release in the full imdb dataset?
filter(????,????)
QUESTION: I’m trying to find out which movie has the oldest year of release in the full imdb dataset. Can you guess why this code doesn’t work? How should you modify it?
filter(imdb, type=="movie" & year==min(year))
## # A tibble: 0 x 17
## # ... with 17 variables: title <chr>, type <fct>, year <dbl>,
## # length <dbl>, numVotes <int>, averageRating <dbl>, director <chr>,
## # birthYear <dbl>, animation <lgl>, action <lgl>, adventure <lgl>,
## # comedy <lgl>, documentary <lgl>, fantasy <lgl>, romance <lgl>,
## # sci_fi <lgl>, thriller <lgl>
5.7 Aggregation/Grouping
The imdb dataset contains entries of various types, but we only saw entries of type movie and tvSeries so far, because the other types don’t have enough votes to be in the miniImdb dataset. So we will now use the full imdb dataset.
One thing that would be interesting is to get the entries with highest number of votes for each type of entry. You know how to do it for each type separately, by filtering on the specific type first:
short<-filter(imdb,type=="short")
filter(short, numVotes==max(numVotes))
## # A tibble: 1 x 17
## title type year length numVotes averageRating director birthYear
## <chr> <fct> <dbl> <dbl> <int> <dbl> <chr> <dbl>
## 1 Kung~ short 2015 31 50331 8 David S~ NA
## # ... with 9 more variables: animation <lgl>, action <lgl>,
## # adventure <lgl>, comedy <lgl>, documentary <lgl>, fantasy <lgl>,
## # romance <lgl>, sci_fi <lgl>, thriller <lgl>
But wouldn’t it be nice to do it for all types in a single move? We can do so using the function group_by() instead of filter() for the first step
imdb_type<-group_by(imdb,type)
filter(imdb_type, numVotes==max(numVotes))
## # A tibble: 9 x 17
## # Groups: type [9]
## title type year length numVotes averageRating director birthYear
## <chr> <fct> <dbl> <dbl> <int> <dbl> <chr> <dbl>
## 1 Eddi~ tvSp~ 1983 69 13995 8.2 Bruce G~ NA
## 2 The ~ movie 1994 142 2006753 9.3 Frank D~ 1959
## 3 Band~ tvMi~ 2001 594 297551 9.5 David F~ 1959
## 4 The ~ video 2003 100 66405 7.4 Shin'ic~ NA
## 5 High~ tvMo~ 2006 98 70030 5.3 Kenny O~ 1950
## 6 Shre~ tvSh~ 2007 21 11025 6.5 Gary Tr~ 1960
## 7 Game~ tvSe~ 2011 57 1365039 9.5 Matt Sh~ 1975
## 8 Halo~ vide~ 2009 34 3077 7.6 Rich Wi~ NA
## 9 Kung~ short 2015 31 50331 8 David S~ NA
## # ... with 9 more variables: animation <lgl>, action <lgl>,
## # adventure <lgl>, comedy <lgl>, documentary <lgl>, fantasy <lgl>,
## # romance <lgl>, sci_fi <lgl>, thriller <lgl>
QUESTION: Can you find out which title has the oldest release year for each type of entry?
imdb_type<-group_by(imdb,????)
filter(????, ????==????)
5.8 Pipes
I’m a big sci-fi fan, and I see that none of the most voted entries are of this genre (see above). Let’s use filter to only keep the sci-fi entries:
imdbSciFi<-filter(imdb,sci_fi=="TRUE")
imdbSciFi_type<-group_by(imdbSciFi,type)
filter(imdbSciFi_type, numVotes==max(numVotes))
## # A tibble: 7 x 17
## # Groups: type [7]
## title type year length numVotes averageRating director birthYear
## <chr> <fct> <dbl> <dbl> <int> <dbl> <chr> <dbl>
## 1 A Tr~ short 1902 13 36791 8.2 Georges~ 1861
## 2 Batt~ tvMo~ 2007 101 18920 7.7 Félix ~ 1951
## 3 Robo~ tvSh~ 2007 30 7134 8.2 Seth Gr~ 1974
## 4 Ince~ movie 2010 148 1755897 8.8 Christo~ 1970
## 5 Deat~ video 2010 100 27769 5.6 Roel Re~ 1969
## 6 The ~ tvSe~ 2010 44 755188 8.4 Lesli L~ NA
## 7 11.2~ tvMi~ 2016 60 56331 8.2 James S~ NA
## # ... with 9 more variables: animation <lgl>, action <lgl>,
## # adventure <lgl>, comedy <lgl>, documentary <lgl>, fantasy <lgl>,
## # romance <lgl>, sci_fi <lgl>, thriller <lgl>
To get from the original data to this output we had to create a number of intermediate steps
imdb -> filter on sci-fi -> group_by type -> filter to max
We either need to explicitly save each of the steps as a dataeframe OR we can be clever and use pipes `%>%
imdb %>%
filter(sci_fi=="TRUE") %>%
group_by(type) %>%
filter(numVotes==max(numVotes))
## # A tibble: 7 x 17
## # Groups: type [7]
## title type year length numVotes averageRating director birthYear
## <chr> <fct> <dbl> <dbl> <int> <dbl> <chr> <dbl>
## 1 A Tr~ short 1902 13 36791 8.2 Georges~ 1861
## 2 Batt~ tvMo~ 2007 101 18920 7.7 Félix ~ 1951
## 3 Robo~ tvSh~ 2007 30 7134 8.2 Seth Gr~ 1974
## 4 Ince~ movie 2010 148 1755897 8.8 Christo~ 1970
## 5 Deat~ video 2010 100 27769 5.6 Roel Re~ 1969
## 6 The ~ tvSe~ 2010 44 755188 8.4 Lesli L~ NA
## 7 11.2~ tvMi~ 2016 60 56331 8.2 James S~ NA
## # ... with 9 more variables: animation <lgl>, action <lgl>,
## # adventure <lgl>, comedy <lgl>, documentary <lgl>, fantasy <lgl>,
## # romance <lgl>, sci_fi <lgl>, thriller <lgl>
With the piping system we start with the name of the data (imdb). This then gets carried through in each step - the data from the end of line 1 gets automatically used in line 2 and so on.
We do not need to call the dataframe name as the first option in the functions that we use
And we can easily add on extra functions. Select() might be useful to have a quicker look at all the information we are interested in.
imdb %>%
filter(sci_fi=="TRUE") %>%
group_by(type) %>%
filter(numVotes==max(numVotes)) %>%
select(title,type,averageRating,numVotes)
## # A tibble: 7 x 4
## # Groups: type [7]
## title type averageRating numVotes
## <chr> <fct> <dbl> <int>
## 1 A Trip to the Moon short 8.2 36791
## 2 Battlestar Galactica: Razor tvMovie 7.7 18920
## 3 Robot Chicken: Star Wars tvShort 8.2 7134
## 4 Inception movie 8.8 1755897
## 5 Death Race 2 video 5.6 27769
## 6 The Walking Dead tvSeries 8.4 755188
## 7 11.22.63 tvMiniSeries 8.2 56331
QUESTION: Can you find out for each type of entry, which is the best rated title using pipes?
imdb %>%
group_by(????) %>%
filter(????==????)
5.9 The summarise() function
To know the average rating of all the movies, we could use the function summarize:
imdb %>%
filter(type=="movie") %>%
summarize(mean=mean(averageRating))
## # A tibble: 1 x 1
## mean
## <dbl>
## 1 6.32
But we can get lot’s of other summary information, like the number of entries, using the function n() the standard deviation of the ratings using the function sd() or the average number of votes
imdb %>%
filter(type=="movie") %>%
summarize(n=n(), meanRating=mean(averageRating), sdRating=sd(averageRating), meanVotes=mean(numVotes))
## # A tibble: 1 x 4
## n meanRating sdRating meanVotes
## <int> <dbl> <dbl> <dbl>
## 1 39221 6.32 1.20 17775.
And we can combine it with group_by to have these information for both the action and the non action movies
imdb %>%
filter(type=="movie") %>%
group_by(action) %>%
summarize(n=n(), meanRating=mean(averageRating), sdRating=sd(averageRating), meanVotes=mean(numVotes))
## # A tibble: 2 x 5
## action n meanRating sdRating meanVotes
## <lgl> <int> <dbl> <dbl> <dbl>
## 1 FALSE 32811 6.39 1.17 14898.
## 2 TRUE 6410 5.97 1.31 32499.
by the way, you can group by more than one variable
imdb %>%
filter(type=="movie") %>%
group_by(action, adventure, comedy, sci_fi) %>%
summarize(n = n(), meanRating=mean(averageRating), sdRating=sd(averageRating), meanVotes=mean(numVotes))
## # A tibble: 15 x 8
## # Groups: action, adventure, comedy [8]
## action adventure comedy sci_fi n meanRating sdRating meanVotes
## <lgl> <lgl> <lgl> <lgl> <int> <dbl> <dbl> <dbl>
## 1 FALSE FALSE FALSE FALSE 18324 6.53 1.15 13353.
## 2 FALSE FALSE FALSE TRUE 1020 5.35 1.41 24846.
## 3 FALSE FALSE TRUE FALSE 10964 6.27 1.11 12320.
## 4 FALSE FALSE TRUE TRUE 256 5.64 1.25 14883.
## 5 FALSE TRUE FALSE FALSE 1264 6.49 1.10 27982.
## 6 FALSE TRUE FALSE TRUE 107 5.55 1.60 62676.
## 7 FALSE TRUE TRUE FALSE 850 6.27 1.16 41371.
## 8 FALSE TRUE TRUE TRUE 26 6.13 1.39 102517.
## 9 TRUE FALSE FALSE FALSE 3210 6.09 1.24 19461.
## 10 TRUE FALSE FALSE TRUE 430 5.11 1.43 44512.
## 11 TRUE FALSE TRUE FALSE 1058 6.03 1.15 17225.
## 12 TRUE FALSE TRUE TRUE 34 4.97 1.34 27008.
## 13 TRUE TRUE FALSE FALSE 1184 6.03 1.37 54860.
## 14 TRUE TRUE FALSE TRUE 172 5.86 1.48 180236.
## 15 TRUE TRUE TRUE FALSE 322 5.71 1.44 36063.
``` ARRANGE, UNGROUP, SLICE INTRODUCED LATER Let’s see what’s the lowest rated combination of genres.
QUESTION: Can you find the lowest rated combination of genres from above? HINT: This can be done with arrange(), ungroup() and slice()
imdb %>%
filter(type=="movie") %>%
group_by(action, adventure, comedy, sci_fi) %>%
summarize(n = n(), meanRating=mean(averageRating), sdRating=sd(averageRating), meanVotes=mean(numVotes)) %>%
arrange(meanRating) %>%
ungroup() %>%
slice(1)
## # A tibble: 1 x 8
## action adventure comedy sci_fi n meanRating sdRating meanVotes
## <lgl> <lgl> <lgl> <lgl> <int> <dbl> <dbl> <dbl>
## 1 TRUE FALSE TRUE TRUE 34 4.97 1.34 27008.
Oh boy, you probably don’t want to watch an action/comedy/sc-fi movie that wouldn’t be of genre adventure!
QUESTION: Can you calculate for each year, the average length and rating of the released movies? Save the result in a dataframe
???? <- imdb %>%
filter(????) %>%
group_by(????) %>%
summarize(????)
```
QUESTION: Would you manage to plot the average length of movies (as y) against the year of release (as x)?
ggplot(data = ????, aes(y=????, x=????)) +
geom_????
5.10 Arranging
Ok what we probably want at this point is think about the movie we are going to watch tonight.
The best rated sci-fi movie is:
imdb %>%
filter(type=="movie" & sci_fi=="TRUE") %>%
filter(averageRating==max(averageRating))
## # A tibble: 1 x 17
## title type year length numVotes averageRating director birthYear
## <chr> <fct> <dbl> <dbl> <int> <dbl> <chr> <dbl>
## 1 Ince~ movie 2010 148 1755897 8.8 Christo~ 1970
## # ... with 9 more variables: animation <lgl>, action <lgl>,
## # adventure <lgl>, comedy <lgl>, documentary <lgl>, fantasy <lgl>,
## # romance <lgl>, sci_fi <lgl>, thriller <lgl>
But I already watched it
Maybe, I should check the first 10 sci-fi movies in the dataset using slice()
:
imdb %>%
filter(type=="movie" & sci_fi=="TRUE") %>%
slice(1:10)
## # A tibble: 10 x 17
## title type year length numVotes averageRating director birthYear
## <chr> <fct> <dbl> <dbl> <int> <dbl> <chr> <dbl>
## 1 20,0~ movie 1916 105 1342 7 Stuart ~ 1883
## 2 Dr. ~ movie 1920 49 4148 7 John S.~ 1878
## 3 The ~ movie 1925 106 4015 7.1 Harry O~ 1885
## 4 Metr~ movie 1927 153 136815 8.3 Fritz L~ 1890
## 5 Fran~ movie 1931 70 56442 7.9 James W~ 1889
## 6 Dr. ~ movie 1931 98 10969 7.7 Rouben ~ 1897
## 7 The ~ movie 1932 68 2694 6.4 Charles~ 1900
## 8 The ~ movie 1933 71 24943 7.7 James W~ 1889
## 9 Isla~ movie 1932 70 6863 7.5 Erle C.~ 1896
## 10 King~ movie 1933 100 71884 7.9 Ernest ~ 1893
## # ... with 9 more variables: animation <lgl>, action <lgl>,
## # adventure <lgl>, comedy <lgl>, documentary <lgl>, fantasy <lgl>,
## # romance <lgl>, sci_fi <lgl>, thriller <lgl>
We only get very old movies, because the dataset seems to be somewhat arranged by year of release. So before using slice(), we want to arrange the movies by rating using the function arrange()
:
imdb %>%
filter(type=="movie" & sci_fi=="TRUE") %>%
arrange(averageRating) %>%
slice(1:10)
## # A tibble: 10 x 17
## title type year length numVotes averageRating director birthYear
## <chr> <fct> <dbl> <dbl> <int> <dbl> <chr> <dbl>
## 1 Brow~ movie 2015 98 712 1.2 Francis~ NA
## 2 Purge movie 2010 80 1131 1.4 David K~ NA
## 3 Nukie movie 1987 95 1021 1.5 Michael~ 1950
## 4 Star~ movie 2009 81 511 1.5 Jon Bon~ NA
## 5 Ultr~ movie 1990 81 676 1.6 Kevin T~ NA
## 6 Alie~ movie 2011 80 1523 1.6 Lewis S~ NA
## 7 Evil~ movie 2006 90 578 1.7 Jim Car~ NA
## 8 Univ~ movie 2007 85 1383 1.7 Griff F~ 1981
## 9 Atla~ movie 2018 86 503 1.7 Jared C~ NA
## 10 Turk~ movie 2006 110 14830 1.9 Kartal ~ 1938
## # ... with 9 more variables: animation <lgl>, action <lgl>,
## # adventure <lgl>, comedy <lgl>, documentary <lgl>, fantasy <lgl>,
## # romance <lgl>, sci_fi <lgl>, thriller <lgl>
Look at the ratings! These are probably the worst movies ever. arrange()
by default sorts in ascending order, so we need to arrange by descending order of rating:
imdb %>%
filter(type=="movie" & sci_fi=="TRUE") %>%
arrange(desc(averageRating)) %>%
slice(1:10)
## # A tibble: 10 x 17
## title type year length numVotes averageRating director birthYear
## <chr> <fct> <dbl> <dbl> <int> <dbl> <chr> <dbl>
## 1 Ince~ movie 2010 148 1755897 8.8 Christo~ 1970
## 2 The ~ movie 1999 136 1439664 8.7 Lana Wa~ 1965
## 3 Inte~ movie 2014 169 1218930 8.6 Christo~ 1970
## 4 Alien movie 1979 116 684793 8.5 Ridley ~ 1937
## 5 Back~ movie 1985 116 888217 8.5 Robert ~ 1951
## 6 Term~ movie 1991 137 870521 8.5 James C~ 1954
## 7 The ~ movie 2006 130 1020039 8.5 Christo~ 1970
## 8 Alie~ movie 1986 137 579233 8.4 James C~ 1954
## 9 Lucia movie 2013 135 10022 8.4 Pawan K~ NA
## 10 Metr~ movie 1927 153 136815 8.3 Fritz L~ 1890
## # ... with 9 more variables: animation <lgl>, action <lgl>,
## # adventure <lgl>, comedy <lgl>, documentary <lgl>, fantasy <lgl>,
## # romance <lgl>, sci_fi <lgl>, thriller <lgl>
That’s much better.
QUESTION: What are the 5 best rated thriller tvSeries having more than 10000 votes
imdb %>%
filter(????) %>%
arrange(????) %>%
slice(????)
5.11 Summary
Key Functions learned:
filter()
select()
mutate()
group_by()
summarise()
arrange()
SUGGESTION: Write down in your own words what each of these functions does
Other concepts covered:
logical operators: ! | & ==
numerical operators: + - / *
pipes%>%
other useful functions: slice() max() min() n() mean() sd()
Lets combine all of these concepts together into a single piece of R code!:
imdb %>%
group_by(director) %>%
summarize(n=n(), meanRating=mean(averageRating), sumVotes=sum(numVotes), medianLength=median(length)) %>%
mutate(meanVotes=sumVotes/n) %>%
filter(n>10 & meanVotes>10000) %>%
arrange(desc(meanRating)) %>%
select(director, n, meanRating, medianLength)%>%
slice(1:10)
## # A tibble: 10 x 4
## director n meanRating medianLength
## <chr> <int> <dbl> <dbl>
## 1 Christopher Nolan 12 8.09 124
## 2 Hayao Miyazaki 14 8 100.
## 3 Spike Jonze 11 7.8 28
## 4 Quentin Tarantino 12 7.79 145
## 5 David Fincher 12 7.72 128
## 6 Anurag Kashyap 14 7.67 136.
## 7 Akira Kurosawa 30 7.67 114.
## 8 S.S. Rajamouli 11 7.62 158
## 9 William Wyler 28 7.62 106
## 10 Krzysztof Kieslowski 16 7.61 95
QUESTION: Write down what this code does
QUESTION: Which line could we remove by making a little modification inside the function summarize()?
QUESTION: Did you notice the use of the function median()? Why would one prefer to use the median rather than the mean?
BONUS CHALLENGE: Could you find among the movie entries, the 10 directors having directed their 5th movie at the youngest age? Who is the 9th one in the list? (hint: you may need to use a function that we haven’t seen yet: ungroup())
imdb %>%
filter(????) %>%
mutate(ageAtRelease=????) %>%
group_by(????) %>%
arrange(????) %>%
slice(????) %>%
ungroup() %>%
arrange(????) %>%
select(????) %>%
slice(????)
References
IMDB. 2018. “IMDB.” https://www.imdb.com/interfaces/.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2019. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.