https://github.com/scw/r-devsummit-2016-talk
What's a data scientist?
“A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.”
— Josh Wills
Us geographic folks also rely on knowledge from multiple domains. We know that spatial is more than just an x
and y
column in a table, and how to get value out of this data.
Languages commonly used in data science:
R — Python — Matlab — Julia
We're a big Python shop, so why R?
CRAN
: 6400 packages for solving problemsData types you're used to seeing...
Numeric
- Integer
- Character
- Logical
- timestamp
... but others you probably aren't:
vector
- matrix
- data.frame
- factor
Vector:
a.vector <- c(4, 3, 8, 7, 1, 5)
Matrix:
A = matrix(
c(4, 3, 8, 7, 1, 5), # same data as above
nrow=2, ncol=3, # what's the shape of the data?
byrow=TRUE) # what order are the values in?
Data Frames:
# Create a data frame out of an existing tabular source
df.from.csv <- read.csv("data/growth.csv", header=TRUE)
# Create a data frame from scratch
quarter <- c(2, 3, 1)
person <- c("Goodchild", "Tobler", "Krige")
met.quota <- c(TRUE, FALSE, TRUE)
df <- data.frame(person, met.quota, quarter)
R> df
person met.quota quarter
1 Goodchild TRUE 2
2 Tobler FALSE 3
3 Krige TRUE 1
sp
TypesSpatialPoints
SpatialLines
SpatialPolygons
Entity + Attribute model
ggplot2
, scales
, dplyr
, devtools
, many othersfit.results <- lm(pollution ~ elevation + rainfall + ppm.nox + urban.density)
caret
for model specification consistencyI believe that the time is ripe for significantly better documentation of programs, and that we can best achieve this by considering programs to be works of literature.
— Donald Knuth, “Literate Programming”
RMarkdown
, Roxygen2
dplyr
PackageBatting %.%
group_by(playerID) %.%
summarise(total = sum(G)) %.%
arrange(desc(total)) %.%
head(5)
shiny
, but R is first and foremost a language that expects fluency from its usersStore your data in ArcGIS, access it quickly in R, return R objects back to ArcGIS native data types (e.g. geodatabase feature classes).
Knows how to convert spatial data to sp
objects.
ArcGIS | R | Example Value |
---|---|---|
Address Locator | Character | Address Locators\\MGRS |
Any | Character | |
Boolean | Logical | |
Coordinate System | Character | "PROJCS[\"WGS_1984_UTM_Zone_19N\"... |
Dataset | Character | "C:\\workspace\\projects\\results.shp" |
Date | Character | "5/6/2015 2:21:12 AM" |
Double | Numeric | 22.87918 |
ArcGIS | R | Example Value |
---|---|---|
Extent | Vector (xmin, ymin, xmax, ymax) | c(0, -591.561, 1000, 992) |
Field | Character | |
Folder | Character | full path, use with e.g. file.info() |
Long | Long | 19827398L |
String | Character | |
Text File | Character | full path |
Workspace | Character | full path |
Start by loading the library, and initializing connection to ArcGIS:
# load the ArcGIS-R bridge library
library(arcgisbinding)
# initialize the connection to ArcGIS. Only needed when running directly from R.
arc.check_product()
Opening data has two stages, like data cursors:
arc.open
arc.select
Similar to using arcpy.da
cursors
First, select a data source (can be a feature class, a layer, or a table):
input.fc <- arc.open('data.gdb/features')
Then, filter the data to the set you want to work with (creates in-memory data frame):
filtered.df <- arc.select(input.fc,
fields=c('fid', 'mean'),
where_clause="mean < 100")
This creates an ArcGIS data frame -- looks like a data frame, but retains references back to the geometry data.
Now, if we want to do analysis in R with this spatial data, we need it to be represented as sp
objects. arc.data2sp
does the conversion for us:
df.as.sp <- arc.data2sp(filtered.df)
arc.sp2data
inverts this process, taking sp
objects and generating ArcGIS compatible data frames.
Finished with our work in R, want to get the data back to ArcGIS. Write our results back to a new feature class, with arc.write
:
arc.write('data.gdb/new_features', results.df)
WKT to proj.4 conversion:
arc.fromP4ToWkt, arc.fromWktToP4
Interacting directly with geometries:
arc.shapeinfo, arc.shape2sp
Geoprocessing session specific:
arc.progress_pos, arc.progress_label, arc.env (read only)
tool_exec <- function(in_params, out_params) {
# the first input parameter, as a character vector
input.features <- in_params[[1]]
# alternatively, can access by the parameter name:
input.input <- in_params$input_features
print(input.dataset)
# ... next, do analysis steps
# this will be returned as the "Output Graphs" parameter.
out_params[[1]] <- plot(results.dataset)
return(out_params)
}
Looking for a package to solve a problem? Use the CRAN Task Views.
Tons of good books and resources on R available, check out the RSeek engine to find resources for the language which can be difficult to locate because of the name.
An Introduction to Staistical Learning (PDF) website A free and accessible version of the classic in the field, Elements of Statistical Learning.
Courses:
Books:
Clustering demo covers mclust
and sp
.
iOS, Android: Feedback from within the app
Windows Phone, or no smartphone? Cuneiform tablets accepted.