https://github.com/scw/r-columbia-2016-talk
What's a data scientist?
“A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.”
— Josh Wills
Us geographic folks also rely on knowledge from multiple domains. We know that spatial is more than just an x and y column in a table, and how to get value out of this data.
Languages commonly used in scientific and statistical problem solving:
R —
Python —
Matlab —
Julia
Ju PyteR = Jupyter
We're a big Python shop, so why R?
"Why can't everyone just use Python?"
≈"Why can't everyone just speak English?"
More like dialects. We speak with our Canadian friends, right?
Complementary in many workflows. People use both to get real work done.
?CRAN: 8000 packages for solving problems
?Includes domain specific languages for statistics. E.g.:
fit.results <- lm(pollution ~ elevation + rain + ppm.nox + elevation:rain)Similar properties in other parts of the language
Data types you're used to seeing...
Numeric - Integer - Character - Logical - timestamp
... but others you probably aren't:
vector - matrix - data.frame - factor
Vector:
a.vector <- c(4, 3, 8, 7, 1, 5)Matrix:
A = matrix(
c(4, 3, 8, 7, 1, 5), # same data as above
nrow=2, ncol=3, # what's the shape of the data?
byrow=TRUE) # what order are the values in?Data Frames:
# Create a data frame out of an existing tabular source
df.from.csv <- read.csv("data/growth.csv", header=TRUE)
# Create a data frame from scratch
quarter <- c(2, 3, 1)
person <- c("Goodchild", "Tobler", "Krige")
met.quota <- c(TRUE, FALSE, TRUE)
df <- data.frame(person, met.quota, quarter) R> df
person met.quota quarter
1 Goodchild TRUE 2
2 Tobler FALSE 3
3 Krige TRUE 1sp TypesSpatialPointsSpatialLinesSpatialPolygons
Entity + Attribute model
ggplot2, scales, dplyr, devtools, many others
fit.results <- lm(pollution ~ elevation + rain + ppm.nox + elevation:rain)caret for model specification consistencyI believe that the time is ripe for significantly better documentation of programs, and that we can best achieve this by considering programs to be works of literature.
— Donald Knuth, “Literate Programming”
RMarkdown, Roxygen2
née IPythondplyr PackageBatting %.%
group_by(playerID) %.%
summarise(total = sum(G)) %.%
arrange(desc(total)) %.%
head(5)shiny, but R is first and foremost a language that expects fluency from its users
Store your data in ArcGIS, access it quickly in R, return R objects back to ArcGIS native data types (e.g. geodatabase feature classes).
Knows how to convert spatial data to sp objects.
| ArcGIS | R | Example Value |
|---|---|---|
| Address Locator | Character | Address Locators\\MGRS |
| Any | Character | |
| Boolean | Logical | |
| Coordinate System | Character | "PROJCS[\"WGS_1984_UTM_Zone_19N\"... |
| Dataset | Character | "C:\\workspace\\projects\\results.shp" |
| Date | Character | "5/6/2015 2:21:12 AM" |
| Double | Numeric | 22.87918 |
| ArcGIS | R | Example Value |
|---|---|---|
| Extent | Vector (xmin, ymin, xmax, ymax) | c(0, -591.561, 1000, 992) |
| Field | Character | |
| Folder | Character | full path, use with e.g. file.info() |
| Long | Long | 19827398L |
| String | Character | |
| Text File | Character | full path |
| Workspace | Character | full path |
Start by loading the library, and initializing connection to ArcGIS:
# load the ArcGIS-R bridge library
library(arcgisbinding)
# initialize the connection to ArcGIS. Only needed when running directly from R.
arc.check_product()Opening data has two stages, like data cursors:
arc.openarc.selectSimilar to using arcpy.da cursors
First, select a data source (can be a feature class, a layer, or a table):
input.fc <- arc.open('data.gdb/features')Then, filter the data to the set you want to work with (creates in-memory data frame):
filtered.df <- arc.select(input.fc,
fields=c('fid', 'mean'),
where_clause="mean < 100")This creates an ArcGIS data frame -- looks like a data frame, but retains references back to the geometry data.
Now, if we want to do analysis in R with this spatial data, we need it to be represented as sp objects. arc.data2sp does the conversion for us:
df.as.sp <- arc.data2sp(filtered.df)arc.sp2data inverts this process, taking sp objects and generating ArcGIS compatible data frames.
Finished with our work in R, want to get the data back to ArcGIS. Write our results back to a new feature class, with arc.write:
arc.write('data.gdb/new_features', results.df)WKT to proj.4 conversion:
arc.fromP4ToWkt, arc.fromWktToP4Interacting directly with geometries:
arc.shapeinfo, arc.shape2spGeoprocessing session specific:
arc.progress_pos, arc.progress_label, arc.env (read only)
tool_exec <- function(in_params, out_params) {
# the first input parameter, as a character vector
input.features <- in_params[[1]]
# alternatively, can access by the parameter name:
input.input <- in_params$input_features
print(input.dataset)
# ... next, do analysis steps
# this will be returned as the "Output Graphs" parameter.
out_params[[1]] <- plot(results.dataset)
return(out_params)
}
Looking for a package to solve a problem? Use the CRAN Task Views.
Tons of good books and resources on R available, check out the RSeek engine to find resources for the language which can be difficult to locate because of the name.
An Introduction to Staistical Learning (PDF) website A free and accessible version of the classic in the field, Elements of Statistical Learning.
Courses: