What’s a data scientist?
“A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.”
— Josh Wills
Us geographic folks also rely on knowledge from multiple domains. We know that spatial is more than just an x and y column in a table, and how to get value out of this data.
Python (SciPy stack, Jupyter, scikit-learn, …)
R (ML task view)

Industry standard for package management in the data science context, built by Continuum Analtyics
Started with Python, but as shown in the R segment of the plenary, it can be used to support R, and hybrid workflows which connect multiple languages.
?
?CRAN: 6400 packages for solving problemsData types you’re used to seeing…
Numeric - Integer - Character - Logical - timestamp
… but others you probably aren’t:
vector - matrix - data.frame - factor
# Create a data frame out of an existing source
df.from.csv <- read.csv(
"data/growth.csv",
header=TRUE)# Create a data frame from scratch
quarter <- c(2, 3, 1)
person <- c("Goodchild",
"Tobler",
"Krige")
met.quota <- c(TRUE, FALSE, TRUE)
df <- data.frame(person,
met.quota,
quarter)R> df
person met.quota quarter
1 Goodchild TRUE 2
2 Tobler FALSE 3
3 Krige TRUE 1sp TypesSpatialPointsSpatialLinesSpatialPolygonsEntity + Attribute model
. 
Store your data in ArcGIS, access it quickly in R, return R objects back to ArcGIS native data types (e.g. geodatabase feature classes).
Knows how to convert spatial data to sp objects.
| ArcGIS | R | Example Value |
|---|---|---|
| Address Locator | Character | Address Locators\\MGRS |
| Any | Character | |
| Boolean | Logical | |
| Coordinate System | Character | "PROJCS[\"WGS_1984_UTM_Zone_19N\"... |
| Dataset | Character | "C:\\workspace\\projects\\results.shp" |
| Date | Character | "5/6/2015 2:21:12 AM" |
| Double | Numeric | 22.87918 |
| ArcGIS | R | Example Value |
|---|---|---|
| Extent | Vector (xmin, ymin, xmax, ymax) | c(0, -591.561, 1000, 992) |
| Field | Character | |
| Folder | Character | full path, use with e.g. file.info() |
| Long | Long | 19827398L |
| String | Character | |
| Text File | Character | full path |
| Workspace | Character | full path |
Start by loading the library, and initializing connection to ArcGIS:
# load the ArcGIS-R bridge library
library(arcgisbinding)
# initialize the connection to ArcGIS. Only needed when running directly from R.
arc.check_product()First, select a data source (can be a feature class, a layer, or a table):
input.fc <- arc.open('data.gdb/features')Then, filter the data to the set you want to work with (creates in-memory data frame):
filtered.df <- arc.select(input.fc,
fields=c('fid', 'mean'),
where_clause="mean < 100")This creates an ArcGIS data frame – looks like a data frame, but retains references back to the geometry data.
Now, if we want to do analysis in R with this spatial data, we need it to be represented as sp objects. arc.data2sp does the conversion for us:
df.as.sp <- arc.data2sp(filtered.df)arc.sp2data inverts this process, taking sp objects and generating ArcGIS compatible data frames.
Finished with our work in R, want to get the data back to ArcGIS. Write our results back to a new feature class, with arc.write:
arc.write('data.gdb/new_features', results.df)WKT to proj.4 conversion:
arc.fromP4ToWkt, arc.fromWktToP4Interacting directly with geometries:
arc.shapeinfo, arc.shape2spGeoprocessing session specific:
arc.progress_pos, arc.progress_label, arc.env (read only)ggplot2, scales, dplyr, devtools, many others
fit.results <- lm(pollution ~ elevation + rainfall + ppm.nox + urban.density)caret for model specification consistencyI believe that the time is ripe for significantly better documentation of programs, and that we can best achieve this by considering programs to be works of literature.
— Donald Knuth, “Literate Programming”
RMarkdown, Roxygen2
née IPythondplyr PackageBatting %.%
group_by(playerID) %.%
summarise(total = sum(G)) %.%
arrange(desc(total)) %.%
head(5)shiny, but R is first and foremost a language that expects fluency from its users
Looking for a package to solve a problem? Use the CRAN Task Views.
Tons of good books and resources on R available, check out the RSeek engine to find resources for the language which can be difficult to locate because of the name.
An Introduction to Staistical Learning (PDF) website A free and accessible version of the classic in the field, Elements of Statistical Learning.
Courses:
Books:
Clustering demo covers mclust and sp.
iOS, Android: Feedback from within the app
Windows Phone, or no smartphone? Cuneiform tablets accepted.