Integrating dplyr with Remote databases

A recent RViews article covers the use of the dplyr package to interact with SQL databases

All the code can be written in R, which dplyr then translates into SQL queries to harness the power of a database You will probably want to read the article if interested in extending the process to your own data but here is a taster from some of mine

Install and load packages

The database accessibility feature is still being developed, so currently we will need to use the development versions of dbplyr and dplyr.

# this takes  a while as needs compiling
# devtools::install_github("tidyverse/dplyr") #0.6.0
# devtools::install_github("tidyverse/dbplyr") #
# devtools::install_github("rstats-db/odbc") #
# install.packages("DBI") # was 0.6-1 


Make Connection and investigate tables available

I maintain an Microsoft SQL Server which includes some tables related to the English Premier League. I have hidden the code which includes my id/password

First make the connection

con <- dbConnect(odbc::odbc(),
                   Driver    = "SQL Server Native Client 11.0", 
                   Server    = "",
                   Database  = "epldb",
                   UID       = [My User ID],
                   PWD       = [My Password],
                   Port      = 1433)

Then retrieve and examine the tables. I happen to know that ‘tblPlayers’ and ‘tblPlayerClub’ are a couple of relevant ones

# There are around 500 tables but thankfully the datatables are listed first
dbListTables(con) %>% head()
## [1] "Cities"                  "collegeFootballRankings"
## [3] "Countries"               "didCastaway"            
## [5] "didTracks"               "didUrl"
# For these two tables the common field is PLAYERID, which will be used later
dbListFields(con, "tblPlayers")
##  [6] "BIRTHDATE"  "POSITION"   "HEIGHT"     "WEIGHT"     "STICKER"   
## [11] "ORDER"      "PHOTO"      "SoccerBase" "MatchFacts" "Who'sWho"  
## [16] "ETPOS"      "OPTAcode"   "ETValue"    "Memo"       "Dead"
dbListFields(con, "tblPlayerClub")
##  [1] "PLAYERID"            "TEAMID"              "JOINED"             
##  [4] "USJOINED"            "LEFT"                "FAKELEFT"           
##  [7] "PERMANENT"           "FEE"                 "FEEOUT"             
## [10] "MISC"                "PLAYER_TEAM"         "testdate"           
## [13] "OldTEAMID"           "SquadNo"             "FirstGame"          
## [16] "FirstGoal"           "FirstCAution"        "LastGame"           
## [19] "LastGoal"            "LastCaution"         "SpellWithClub"      
## [22] "Memo"                "Out"                 "OutReason"          
## [25] "RecordSigning"       "RecordSigningClub"   "RecordAgeYoungOrder"
## [28] "RecordAgeYoungYears" "RecordAgeYoungDays"  "RecordAgeOldOrder"  
## [31] "RecordAgeOldYears"   "RecordAgeOldDays"
## create some data.frames for easy manipulation, restricting to pertinent fields
players <- tbl(con, "tblPlayers") %>% 

playerClub <- tbl(con, "tblPlayerClub") %>%

The ugly field names are a relic of setting up in MS Access many moons ago. Apologies

Create some example output

Let’s plot the most common first names of players

players %>%
  group_by(FIRSTNAME) %>% 
  tally() %>% 
  arrange(desc(n)) %>% 
  head() %>% 
  plot_ly(x=~n, y=~FIRSTNAME)

and a table of which players have appeared on the books of the most teams

players %>% 
left_join(playerClub) %>% # 
  tally() %>% 
  filter(PLAYERID!="OWNGOAL") %>% 
  ungroup() %>% 
  select(name=LASTNAME,clubs=n) %>% 
  arrange(desc(clubs)) %>% 
  filter(clubs>6) %>% 
name clubs
Bent 9
Bellamy 8
Ben Haim 7
Cole 7
Keane 7
Nash 7
Routledge 7
Unsworth 7

Ideally, this would show full name (the leader is, in fact, Marcus Bent) but attempting to combine FIRSTNAME and LASTNAME is currently causing an error

This is very much a token usage but should give some idea of the speed of interacting with a remote database and power these extra features provide

Share Comments
comments powered by Disqus