2018 Predictive Analytics Symposium - Session 33: Commercializing a Data Science Model as Application Programming Interface (API) or Batch Service
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
2018 Predictive Analytics Symposium Session 33: Commercializing a Data Science Model as Application Programming Interface (API) or Batch Service SOA Antitrust Compliance Guidelines SOA Presentation Disclaimer
Commercializing a Data Science Model as API or Batch Service Jeffrey Heaton, Ph.D. and Ed Deuser September 2018
Presenters Jeffrey Heaton, Ph.D. – Lead Data Scientist - RGA Jeff Heaton is a lead data scientist at Reinsurance Group of America (RGA), an adjunct instructor for the Sever Institute at Washington University, and the author of several books about artificial intelligence. Jeff holds a Master of Information Management (MIM) from Washington University and a Ph.D. in computer science from Nova Southeastern University. Over twenty years of experience in all aspects of software development allows Jeff to bridge the gap between complex data science problems and proven software development. Working primarily with the Python, R, Java/C#, and JavaScript programming languages he leverages frameworks such as TensorFlow, Scikit-Learn, Numpy, and Theano to implement deep learning, random forests, gradient boosting machines, support vector machines, T-SNE, and generalized linear models (GLM). Jeff holds numerous certifications and credentials, such as the Johns Hopkins Data Science certification, Fellow of the Life Management Institute (FLMI), ACM Upsilon Pi Epsilon (UPE), a senior membership with IEEE. He has published his research through peer reviewed papers with the Journal of Machine Learning Research and IEEE. Ed Deuser – Technical Architect and Developer - RGA Ed Deuser is a Technical Architect with RGA Reinsurance Company. In this role, Ed is responsible for technical solutions that support RGA’s global business units, including Valuation, Financial Solutions, Underwriting, and Global Research, Development and Analytics. He also served as the technical lead for B3i, the Blockchain Insurance Industry Initiative, and guides other digital objectives for RGA. In addition to his experience in the insurance sector, Ed has worked in financial services, government and law enforcement. Accomplished in the emerging field of distributed ledger technology, Ed has participated in RGA sponsored hackathons as a coach and was part of the winning team at the Office of the National Coordinator (ONC) for Health Information Technology’s first-ever hackathon. Ed received his Bachelor of Science in Information Systems from the University of Missouri–St. Louis. His article “From R Studio to Real-Time Operations,” which he co-authored with RGA Lead Data Scientist Jeff Heaton, was published in the December 2017 issue of the Society of Actuaries’ Predictive Analytics and Futurism Section newsletter. RGA Reinsurance Company The security of experience. The power of innovation. www.rgare.com 4
Operational Readiness Readiness occurs throughout the Project Project project; most importantly when it Inception Execution starts. End User Journey – Contract and Service Level Agreement (SLA) Workload Project at Security is first and last thing we Reality Risk think of. Agreed on patterns of use • Batch • Real Time Project Failure • Web 7
Contract Management End Users Journey to a delivered Service level agreement (SLA) Clear Expectation Management in Contractual Terms End User Journey and Expectations Standard Service level agreement as basis 8
Security in Depth Should be first and last thing we think of Threat Modeling • How could it be compromised ? • How to protect compromised sections ? Logging, Monitoring and Alerting • Forensic logging of the item to be protected and where it is housed. • Monitor and Alert on suspicious activities and logs. Pen Testing • Contract with someone to ensure the item is protected. “According to Microsoft, the potential cost of cyber-crime to the global community is a mind- boggling $500 billion, and a data breach will cost the average company about $3.8 million. “ 9
API in English Please API stands for Application Programming Interface. Cohort – 100 Id, gender, Scores conditions API Compute Score Cohort – 100 Id, gender, conditions, score API 10 What is an API ?
Model Development Methodology 11
Model Development Methodology Model Scoping and Business Understanding Data Understanding Data Discovery and Enrichment Model Fitting / Validation Model Deployment 12
Input Format for Model For an API, data input must be very standardized Clients tend to vary the format of input data during model development. Columns provided might change. Column names might change. Date formats may not be consistent. For an automated API, this format must become consistent. 13
Use Excel as a Tool, Not a Format For tabular data, we prefer CSV (UTF-8) Excel is a powerful data exploration tool for rapid analysis. However, Excel can be a problematic data exchange format. • Inability to specify export encoding (UTF-8, Unicode, etc.). • Excel often mangles input by inferring data format. Such as treating SNOMED codes as numbers. • Different tools generate Excel files differently. • Many more ways to confuse automated imports with Excel than CSV. 14
Input Format for Model JSON, CSV, or XML? Input from the client is usually in JSON, XML, or CSV format. For real time API’s we prefer JSON/XML • JSON and XML provide a hierarchical view of data. • JSON and XML do not always easily fit into Excel. For batch, we generally prefer CSV (sometimes Excel) • CSV and Excel both store data in tabular format. 15
The XML Format Verbose and Hierarchical 16
The JSON Format Concise and JavaScript-like 17
Data Discovery and Enrichment Augmenting the input data with additional data sources Client input data usually will not contain all necessary information for a model. • If identity of individual is known (PII), we might augment with: o 3rd party marketing data on individual. o 3rd party credit data on individual. • If identity of individual is unknown (PII-less): o RGA severity scores for drugs or medical diagnosis. o RGA mortality tables. 18
Model Fitting Teaching a model from data Model fitting is where a data scientist trains a model based on data. Fitting is usually a very manual process that can go on for days, weeks, or months. The final output from fitting is a model that can be deployed for client use. 19
Model Deployment Making your model available to clients How will your model be used? • Will the model be used directly by individual human users? • Will the model be integrated into a system developed by client’s IT? • Will the model be used as part of a client’s mobile application? • Will users upload files that a client will upload? Manual steps from fitting must be automated. Input data must be checked for errors. 20
Personally Identifiable Information (PII) and Data Retention What data should we retain? (and where) Some input data contains PII, others do not. Some clients request us to retain no data. We prefer to keep some data. We usually do not store PII data on the model side. 21
Ongoing Model Validation Keeping the model relevant Client data distributions can change over time. Baseline truth can change. Models must be evaluated over time to ensure they remain relevant. Calibration is an ongoing process. 22
Partnerships 23
Know your strengths Partnerships in Place to Ensure success Questions to ask : Types of partnerships : • Do you have data scientists in your • Internal organization ? “Partnering with different parts of your • Are you experienced in cloud organization “ deployments ? • External • Can you sustain the DevOps practice ? “ i.e. Staff Augmentation, Client Partner (i.e. RGA) “ • Do you understand where your attack vectors are ? 24
Example Commercialization 25
Commercialization example EXAMPLE. models Swagger Hub – Create an API first, what's on the menu Upload API to API gateway on AWS. Pre- templated NodeJS Lamda to compute score on cohort. 26
Questions 27
Appendix 28
Resources to use for creating your own API Disclaimer: The resources provided are intended for educational purposes only and do not replace independent professional judgment. Statements of fact and opinions expressed are those of the participants individually and, unless expressly stated to the contrary, are not the opinion or position of Reinsurance Group of America, its cosponsors, or its committees. Reinsurance Group of America does not endorse or approve, and assumes no responsibility for, the content, accuracy or completeness of the information presented. The above resources do not provide all security measures that are recommended; such that appropriate security measures are not provided use freely at your own risk. https://github.com/eddeuser2017/commercialize_api 29
You can also read