Introduction to predictive modeling and data mining - Stat ...

Page created by Rick Cox

Society

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Introduction to predictive modeling and data mining - Stat ...

Introduction to predictive modeling
          and data mining

             Rebecca C. Steorts
Predictive Modeling and Data Mining: STA 521

              August 25 2015

                                               1

Today’s Menu

1. Brief history of data science (from slides of Bin Yu)
2. Motivation of this course.
3. What is predictive modeling and data science?
4. The boring bits (but they’re important).

                                                           2

Data science

[Bin Yu, IMS Presidential Address, 2014]   3

Data science

[Bin Yu, IMS Presidential Address, 2014]
                                           4

Data science

[Bin Yu, IMS Presidential Address, 2014]
                                           5

Data science

[Bin Yu, IMS Presidential Address, 2014]
                                           6

Data science

[Bin Yu, IMS Presidential Address, 2014]
                                           7

Data science

[Bin Yu, IMS Presidential Address, 2014]
                                           8

Data science

[Bin Yu, IMS Presidential Address, 2014]   9

Data science

[Bin Yu, IMS Presidential Address, 2014]   10

Data science

[Bin Yu, IMS Presidential Address, 2014]

                                           11

Data science

[Bin Yu, IMS Presidential Address, 2014]

                                           12

More about data science and how it’s relevant today....

                                                          13

Data science, today

[Credit: Jenny Bryan]                     14

Data science, today

[Credit: Jenny Bryan]                     15

Data science, today

                      16

Data science, today

                      17

So what is the class all about?

                                  18

What is data mining?
Data mining is the science of discovering structure and making
predictions in (large) data sets
  I   Unsupervised learning: discovering structure

      E.g., given measurements X1 , . . . Xn , learn some underlying
      group structure based on similarity

  I   Supervised learning: making predictions

      I.e., given measurements (X1 , Y1 ), . . . (Xn , Yn ), learn a model
      to predict Yi from Xi

                                                                             19

What is data mining?
Data mining is the science of discovering structure and making
predictions in (large) data sets
  I   Unsupervised learning: discovering structure

      E.g., given measurements X1 , . . . Xn , learn some underlying
      group structure based on similarity

  I   Supervised learning: making predictions

      I.e., given measurements (X1 , Y1 ), . . . (Xn , Yn ), learn a model
      to predict Yi from Xi
Note: Hidden underneath is the idea of prediction (which we will
get into). As we have talked about words like data science, data
mining, etc are just more sexy!

                                                                             19

Google

Search             Ads

Gmail             Chrome
                           20

Facebook

People you may know
                      21

Netflix

$1M prize!

             22

eHarmony

Falling in love with statistics

                                  23

FICO

An algorithm that could cause a lot of grief

                                               24

FlightCaster

Apparently it’s even used by airlines themselves

                                                   25

IBM’s Watson

A combination of many things, including data mining

                                                      26

Handwritten postal codes

        (From ESL p. 404)
We could have robot mailmen someday

                                      27

Subtypes of breast cancer

Subtypes of breastcancer based on wound response
                                                   28

Predicting Alzheimer’s disease

(From Raji et al. (2009), “Age, Alzheimer’s disease, and brain
                          structure”)
    Can we predict Alzheimer’s disease years in advance?
                                                                 29

Kaggle 2015 Challenge

Competition to find interesting information in Census data and
maps.
                                                                 30

What to expect

Expect to be able to deal with messy data and writing coding that
is reproducible

Why? Real applied problems are messy (and for others to
understand how you attacked something, it’s important that your
method, process, and code be accessible, well documented, and
reproducible)
  I   You can’t always open up R, download a package, and get a
      reasonable answer
  I   Real data is messy and always presents new complications
  I   Understanding why and how things work is a necessary
      precursor to figuring out what to do

                                                                    31

Reoccuring themes

Exact approach versus approximation: often when we can’t do
something exactly, we’ll settle for an approximation. Can perform
well, and scales well computationally to work for large problems

Bias-variance tradeoff: nearly every modeling decision is a tradeoff
between bias and variance. Higher model complexity means lower
bias and higher variance

Interpretability versus predictive performance: there is also usually
a tradeoff between a model that is interpretable and one that
predicts well under general circumstances

                                                                        32

There’s not a universal recipe book

Unfortunately, there’s no universal recipe book for when and in
what situations you should apply certain data mining methods

Statistics doesn’t work like that. Sometimes there’s a clear
approach; sometimes there is a good amount of uncertainty in what
route should be taken. That’s what makes it so hard, and so fun

This is true even at the expert level (and there are even larger
philosophical disagreements spanning whole classes of problems)

The best you can do is try to understand the problem, understand
the proposed methods and what assumptions they are making, and
find some way to evaluate their performances

                                                                    33

Hopefully you’re still awake

What do I need to know about the course?

                                           34

Course staff:
  I   Instructor: Rebecca Steorts (you can call me “Beka” or
      “Professor Steorts”, please not “Professor”)
  I   TAs: Abbas Zaidi and Yikun (Joey) Zhou

Why are you here?
  I   Because you love the subject, because it’s required, because
      you eventually want to make $$$ ...
  I   No matter the reason, everyone can get something out of the
      course
  I   Work hard and have fun!

                                                                     35

Culture of the class

I   Teaching you to fish (versus giving you one).
      I   It’s amazing what a determined individual can learn from
          documentation, small learning examples, and ... “gasp”
          Googling. And also stackoverflow.
I   Rewarding engagement, intellectual generosity and curiosity.
I   Speaking up, sharing success OR failure, showing some
    interest in something will earn marks.
I   Zero tolerance of plagiarism!
      I   Generating your own ideas, your own code, and finding your
          own way is a big reason you’re here. The process is much more
          important than simply getting to the end point or product.

                                                                          36

Logistics/Grading

I   Two lectures a week: concepts, methods, examples
I   Lab to try stuff out and get fast feedback (10%)
I   Participation, creativity in new ideas, sharing in your successes
    and failures, etc. (5%)
I   HW weekly to do longer and more complex things (35%)
I   Mid-term exam (25%)
I   Final project in groups of 2-3, will be fun! (25%)

                                                                        37

Prerequisites:
  I   Assuming you know basic probability and statistics, linear
      algebra, R programming (see syllabus for topics list)

Textbooks:
  I   Course textbook Introduction to Statistical Learning by
      James, Witten, Hastie, and Tibshirani. Get it online at
      http://www-bcf.usc.edu/~gareth/ISL.
  I   Bayesian Essentials with R, Marin and Robert, Please order
      this one (we won’t need it for about two weeks).
  I   More advanced textbook: Elements of Statistical Learning by
      Hastie, Tibshirani, and Friedman. Also available online at
      http://www-stat.stanford.edu/ElemStatLearn

                                                                    38

Markdown, RStudio, and LaTex

You must type all assignments and take home exams in Markdown
            (we will talk about submissions in class).
              All code must be written in RStudio.
                             https:
   //guides.github.com/features/mastering-markdown/
 https://rstudio-pubs-static.s3.amazonaws.com/18858_
         0c289c260a574ea08c0f10b944abc883.html

                                                                39

Turning in Your Own Work

I   You may work together on homework. In fact, you should.
    You’ll learn a great deal.
I   But.... All code, write ups, etc must be your own work and
    not shared.
I   All write ups must be your own work and not shared or copied
    in any manner.
I   All take home exams or projects that are not collaborative are
    to be your own work. You may not work together.
I   If I find out that any assignment is not your own, you will
    receive a 0 on the assignment and you will be reported to as
    per the university’s policy on cheating and plagiarism.
I   You will submit all work online, which Abbas will explain.

                                                                     40

Setting up for success

1. R or RStudio
2. Intro to RStudio and Markdown: first lab.
3. git and bitbucket

                                               41

Quick intro to git

Download git and bitbucket.
 1. git init
 2. git add file.txt
 3. git log
 4. git status
 5. git commit -a -m ”here are my changes.”
 6. git push
You will understand more complicated git commands in your lab.
(Branching, Merging, Uploading a file, etc.)

                                                                 42

Who am I

I   Assistant prof and affiliated faculty at SSRI and iiD.
I   Specialize in record linkage and dimension reduction methods
    and algorithms for applications in human rights conflicts,
    official statistics, social networks, medical databases, and
    many others.
I   Methods I work on focus on Bayesian methods, machine
    learning, and scalable algorithms (intensive computing).
I   PhD in 2012 from University of Florida and finished Visiting
    Assistant Professorship at CMU in 2015.
I   First semester at Duke and second time teaching the course,
    but changing many things! Very excited to be here.

                                                                   43

Next time: RStudio, Markdown, and git.

                                         44

You can also read

Observation usage in Météo-France data assimilation systems

Fact Sheet EMBARGOED TILL AFTER DELIVERY OF MINISTER'S SPEECH AT SG:D INDUSTRY DAY, 21 MAY 2018 - gov.sg

DEVELOPING WINDOWS MOBILE APPLICATIONS - TOM SLEE SQL ANYWHERE PRODUCT MANAGEMENT AUGUST 2009

Data Sheet FUJITSU Storage ETERNUS CS200c

Macroeconomic Predictions using Payments Data and Machine Learning

Monthly water situation report - Kent and South London Area - Gov.uk

Nonprofit Sector Analysis 2021 - Benefacts

Surveillance Impact Report

2021 CEOS Chair Proposed Theme & Priorities - Meetings

Muffatwerk Munich March 17th - 19th 2020 - participant.information - DATA festival

Affinity Water Optimising asset management with smart, accurate field data capture - AMT-SYBEX

QIP-NJ Chart-Based Measure Specification Guidance Transcript M6: Timely Transmission of Transition Record

AGATA Week - Sept 18, 2019 Michele Gulmini INFN-LNL IT Department

TECHNICAL INDICATOR DESCRIPTION - Independent Police Investigative Directorate - Independent Police ...

Hampshire COVID-19 weekly datapack - 3rd February 2021 - Produced by the Public Health Team and the Insight and Engagement Unit - Hampshire County ...

Systems Design for the Effective Management Of World Cup Soccer - Robert Milton Underwood, Jr. 2000

Euronext 2019 Investor Day - WORKSHOP - ADVANCED DATA SERVICES CHRIS TOPPLE, HEAD OF GLOBAL SALES

Earnings and employment from Pay As You Earn Real Time Information, UK: April 2020

Hospital Patient Safety Indicator Report - HSE

Hospital Patient Safety Indicator Report - St Vincent's ...

Meteorological Support in Scientific Ballooning - Chris Schwantes & Robert Mullenax Meteorologists Northrop Grumman / NASA Columbia Scientific ...

Hospital Patient Safety Indicator Report - HSE

9th ICA Rectors and Deans Forum 2019 Collaboration between European & African Life Science Universities for a Sustainable Future

PRESENTATION SECONDARY SCHOOL, TRALEE, CO.KERRY - SUBJECT CHOICE LEAVING CERT 2019 2021 - Presentation Secondary School ...

ELEMENTARY GUIDELINES REGARDING THE ALLOCATION OF INSTRUCTIONAL TIME - A reference for school administrators planning and scheduling reportable ...

Chemistry Lab-2019 Division C - Presenters: Phoebe Shih & Arya Pontula - NC Science Olympiad

2020-2021 CURRICULUM GUIDE - CHURCH ELEVEN32

Additive Manufacturing Meets Medicine - 3rd International Conference AMMM 2021 Sep 8 -10, 2021 | 12 CME points

Edexcel International GCSEs (9-1) - A guide to the new 9-1 grading scale for schools - British Council | Cyprus

School Improvement Plan - Middle School Mathematics Science Technology Cente Warren Consolidated Schools

Technology-Supported Learning Environment and its Impact on Attitudes Towards STEM

Español 3 Syllabus Legacy High School - School Year 2020-2021 - Hope Charter School

Disability Support Pension: Eligibility, Challenges and Resources - Disclaimer: The content of this presentation is information for educational ...

VOLTA New York | March 4-8, 2020 - JD Malat Gallery

Getting the Benefit A Guide to Housing Benefit and the Council Tax Reduction Scheme April 2015

ENTERTAINMENT VISAS IN AUSTRALIA - CONNECTING FILMMAKERS WORLDWIDE TO AUSTRALIA

PROPERTY & STRATA MANAGER - HOMEGROUND - CLOSING DATE: TUESDAY, 25 AUGUST 2020 - HOMEGROUND REAL ESTATE CANBERRA

VIRTUAL OVERSEAS INTERNSHIPS - WINTER BREAK 2021 - Swinburne

Finding Work Internationally - UNSW Careers and Employment - UNSW Current Students

Guide to benefits 2020-2021 - For employees in the Republic of Ireland - RBSelect

Writing and Language Tests-Sample Passages - Lecture is 2016: No unauthorized reproduction allowed - Ivy Aspire

Beyond Brexit: The UK's plans for a new - immigration system | JULY 2020

ARB Business Plan 2019 - Architects Registration Board

Returning to the office-Solutions for physical distance - BuzziSpace

Parliamentary Assistants Program 2021 - Job Description

Declutter your Email Inbox - Clarify, Simplify, Align PRESENTED BY: Dr. Jessica Louie - Clarify Simplify Align

Enhanced Primary Care Pathway: HELICOBACTER PYLORI - Divisions of Family Practice

International Student Guide: Initial Study Permit - NSCC International

Investing in your community - Have your say on options for next year Consultation on rent levels 2020/21

Writing and Publishing a Research Paper - Dr. Gamil Alrubaiee Department of Community Health Al-Razi University