Highlights of EARL 2018 - Adnan Fiaz Julian Ferry Hannah Frick Dragoș Moldovan-Grünfeld - LondonR
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
EARL London 2018 • 5th EARL London Conference • 3 Keynote speakers • 5 Workshops • 3 Streams • 56 Presentations – lightning talks for the first time • 1 Panel Discussion • 2 Evening Networking Events
The Workshops • R in 6 Hours • Shiny – Beyond the Basics • Deep Learning with Keras in R • A Crash Course in Python for R Users • Functional Programming with purrr
Data Driven Decision- Making Adnan Fiaz
Data Driven Decision-making Keynotes: • Winning in a data-driven world, Edwina Dunn • Building a Data Driven Company, Rich Pugh Talks: • Decision Lead Data Science, Steven Wilkins • A brief history of Data at Autotrader, Paul Owens • R – The tool for Screwfix, Gavin Jackson
Have a (data) strategy “Focus on the data you need rather than the data you have” – Edwina Dunn “Know how to build the ‘engine’, now it needs to drive the car” – Rich Pugh
Not madmen but math (wo)men “A key differentiator for businesses…is a culture of continuous learning” – Edwina Dunn “The key role of data scientists in the coming years is one of educator” – Rich Pugh
Special mention Finding out what Parliament thinks, Sam Tazzyman (Ministry of Justice) • Explaining complex topics simply • Show your code in action (and link to it) • Why so serious?
Machine Learning Julian Ferry Hannah Frick
Balancing model complexity and interpretability In defence of complexity: • The power of machine learning in segmenting CRM databases, Jeremy Horne • The making of a real-world Moneyball – finding undervalued players with h2o, Jo-Fai Chow In defence of interpretability: • Understanding your model, Kasia Kulma • Measuring Marketing Performance, Wojtek Kostelecki
Complex models in CRM segmentation - Jeremy Horne • How do we identify the customers on a CRM database who are most likely to make a purchase this month? • Most databases are dominated by lower value segments
Separating low value segments • Tools used: – Kernlab package – Boosting to focus on outliers – outcomes that are not ‘normal’ Key takeaway: Machine learning models can help us differentiate between customers within the same group, where decision-tree type rules fail.
In defence of interpretability – Kasia Kulma
In defence of interpretability – Kasia Kulma
In defence of interpretability – Kasia Kulma
LIME – Local Interpretable Model- Agnostic Explanations
Predicting baseball player performance with h2o, Jo-Fai Chow • Problem: Finding undervalued baseball players in Major League Baseball (MLB)
End result – Shiny + LIME
The beauty of linear models, Wojtek Kostelecki • Modelling contributions to mileage
The beauty of linear models, Wojtek Kostelecki Using a linear model we can extract the individual contribution of each variable to sales
David Smith – Not Hotdog • Not Hotdog: Image recognition with R and the Custom Vision API
David Smith – Not Hotdog
David Smith – Not Hotdog
David Smith – Not Hotdog
David Smith – Not Hotdog R Code: https://github.com/revodavid/nothotdog
Lars Kjeldgaard - modelgrid • A ‘caret’-based Framework for Training Multiple Tax Fraud Detection Models • Framework for creating, managing and training multiple caret models • Pipe-friendly
Lars Kjeldgaard - modelgrid library(modelgrid) # create model grid object credit_default_models % pull(Class), x = GermanCredit %>% select(-Class), metric = "ROC", trControl = tr_control )
Lars Kjeldgaard - modelgrid # add a random forest model credit_default_models % add_model(model_name = "Funky Forest", method = "rf", tuneGrid = data.frame(mtry = c(10, 20))) # add an eXtreme gradient boosting model credit_default_models % add_model(model_name = "Big Boost", method = "xgbTree", nthread = 8)
Lars Kjeldgaard - modelgrid # train models and evaluate credit_default_models % train(.) credit_default_models$model_fits %>% resamples(.) %>% bwplot(.)
Reproducibility and R in Production Dragoș Moldovan-Grünfeld
Reproducibility & R in Production • Keynote: – RMarkdown: The Bigger Picture, Garrett Grolemund, RStudio • Talks: – Beyond Prototypes. A Journey to The Production Land, Omayma Said, Freelance – Bridging the gap between Data Scientists and Engineers; using R in production, Leanne Fitzpatrick, HelloSoda
Garrett Grolemund (RStudio) • Reproducibility crisis: – ”We created a cargo cult by confusing math with science. Now we must undo it.” – “Create maps, not proofs” – “Reproducibility is an opportunity”
Leanne Fitzpatrick (HelloSoda) • “Bridging the gap between Data Scientists and Engineers; using R in production” • Barriers to entry (R in production) – Engineering – Infrastructure – Data science – Cultural
Overcoming barriers • Deployment: – central to the data science process – Solution: Docker • Plumbing/ integration – Solution: code as a service with Plumber • Package and dependency management – Solution: pacman
Overcoming barriers (cont’d) • Reproducible framework – Solution: Project Template http://projecttemplate.net • Stability & error handling – Solution: testing & CI – testthat and usethis • Scaling – Solution: docker • Culture – Solution: collaboration
Omayma Said (Freelance) • “Beyond Prototypes. A Journey to The Production Land” • Challenges: reproducibility, portability, and accessibility • Docker • Use/Modify available Dockerfiles • Use helper packages
Helper packages • containerit – Package an R workspace and all dependencies as a Docker container • liftr – Containerize R Markdown documents for continuous reproducibility • rize – A robust method to automagically dockerize your Shiny application
Special mention Using R and Shiny to improve hospital operations, Christian Moroy and Jonathan Bruce (Edge Health) • Predict how long operations take using R • Recommend free slots that should be filled via Shiny • Disseminate daily reports via markdown + email (from R) • Saved a predicted £4m in 2017/18
Next?
EARL US Roadshow 7 November 2018, Seattle, WA Julia Silge Data Scientist @ Stack Overflow Co-author Text Mining with R with David Robinson Co-author tidytext package
EARL US Roadshow 9 November 2018, Houston, TX Robert Gentleman Vice President of Computational Biology @ 23andMe One of the designers of the R programming language Hadley Wickham Chief Scientist @ RStudio Author of numerous books on R Prolific R package author
EARL US Roadshow 13 November 2018, Boston, MA Bob Rudis (@hrbrmstr) Chief Security Data Scientist @ Rapid7 Prolific tweeter, package author and blogger
The End
You can also read