CSC 311: Introduction to Machine Learning - GitHub Pages
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
CSC 311: Introduction to Machine Learning Lecture 1 - Introduction Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto Intro ML (UofT) CSC311-Lec1 1 / 54
This course Broad introduction to machine learning I First half: algorithms and principles for supervised learning I nearest neighbors, decision trees, ensembles, linear regression, logistic regression, SVMs I Unsupervised learning: PCA, K-means, mixture models I Basics of reinforcement learning Coursework is aimed at advanced undergrads. We will use multivariate calculus, probability, and linear algebra. Intro ML (UofT) CSC311-Lec1 2 / 54
Course Information Course Website: https://amfarahmand.github.io/csc311 Main source of information is the course webpage; check regularly! We will also use Quercus for announcements & grades. Did you receive the announcement today? We will use Piazza for discussions. Sign up: https://piazza.com/utoronto.ca/winter2020/csc311/home Your grade does not depend on your participation on Piazza. It is just a good way for asking questions, discussing with your instructor, TAs and your peers. We will only allow questions that are related to the course materials/assignments/exams. Intro ML (UofT) CSC311-Lec1 3 / 54
Course Information While cell phones and other electronics are not prohibited in lecture, talking, recording or taking pictures in class is strictly prohibited without the consent of your instructor. Please ask before doing! http://www.illnessverification.utoronto.ca is the only acceptable form of direct medical documentation. For accessibility services: If you require additional academic accommodations, please contact UofT Accessibility Services as soon as possible, studentlife.utoronto.ca/as. Intro ML (UofT) CSC311-Lec1 4 / 54
Course Information Recommended readings will be given for each lecture. But the following will be useful throughout the course: Trevor Hastie, Robert Tibshirani, and Jerome Friedman, The Elements of Statistical Learning, Second Edition, 2009. Christopher M. Bishop, Pattern Recognition and Machine Learning, 2006 Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An Introduction, Second Edition, 2018. Ian Goodfellow, Yoshua Bengio and Aaron Courville, Deep Learning, 2016 Kevin Murphy, Machine Learning: A Probabilistic Perspective, 2012. Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, An Introduction to Statistical Learning, 2017. Shai Shalev-Shwartz and Shai Ben-David, Understanding Machine Learning: From Theory to Algorithms, 2014. David MacKay, Information Theory, Inference, and Learning Algorithms, 2003. There are lots of freely available, high-quality ML resources. Intro ML (UofT) CSC311-Lec1 5 / 54
Requirements and Marking Four (4) assignments I Combination of pen & paper derivations and programming exercises I Weights to be decided (expect 10% each), for a total of 40% Read some seminal papers. I Worth 10%. I Short 1-paragraph summary and two questions on how the method(s) can be used or extended. Midterm I February 26, 6–7pm. (tentative) I Worth 20% of course mark Final Exam I Two (or Three) hours I Date and time TBA I Worth 30% of course mark Intro ML (UofT) CSC311-Lec1 6 / 54
Final Exam Everybody must take the final exam! No exceptions. Time/location will be announced later. You have to get at least 30% in the final exam in order to pass the course. Intro ML (UofT) CSC311-Lec1 7 / 54
More on Assignments Collaboration on the assignments is not allowed. Each student is responsible for their own work. Discussion of assignments should be limited to clarification of the handout itself, and should not involve any sharing of pseudocode or code or simulation results. Violation of this policy is grounds for a semester grade of F, in accordance with university regulations. The schedule of assignments will be posted on the course webpage. Assignments should be handed in by deadline; a late penalty of 10% per day will be assessed thereafter (up to 3 days, then submission is blocked). Extensions will be granted only in special situations, and you will need a Student Medical Certificate or a written request approved by the course coordinator at least one week before the due date. Intro ML (UofT) CSC311-Lec1 8 / 54
Related Courses and a Note on the Content More advanced ML courses such as CSC421 (Neural Networks and Deep Learning) and CSC412 (Probabilistic Learning and Reasoning) both build upon the material in this course. If you’ve already taken an applied statistics course, there will be some overlap. This is the first academic year that this course is listed only as an undergrad course. Previously it was CSC411, with a bit more content and heavier course-load. CSC411 has a long history at the University of Toronto. Each year, its content has been added, removed, or revised by different instructors. We liberally borrow from the previous editions. Warning: This is not a very difficult course, but it has a lot of content. To be successful, you need to work hard. For example, don’t start working on your homework assignments just two days before the deadline. Intro ML (UofT) CSC311-Lec1 9 / 54
What is learning? ”The activity or process of gaining knowledge or skill by studying, practicing, being taught, or experiencing something.” Merriam Webster dictionary “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.” Tom Mitchell Intro ML (UofT) CSC311-Lec1 10 / 54
What is machine learning? For many problems, it is difficult to program the correct behaviour by hand I recognizing people and objects I understanding human speech Machine learning approach: program an algorithm to automatically learn from data, or from experience Why might you want to use a learning algorithm? I hard to code up a solution by hand (e.g. vision, speech) I system needs to adapt to a changing environment (e.g. spam detection) I want the system to perform better than the human programmers I privacy/fairness (e.g. ranking search results) Intro ML (UofT) CSC311-Lec1 11 / 54
What is machine learning? It is similar to statistics... I Both fields try to uncover patterns in data I Both fields draw heavily on calculus, probability, and linear algebra, and share many of the same core algorithms But it is not statistics! I Stats is more concerned with helping scientists and policymakers draw good conclusions; ML is more concerned with building autonomous agents I Stats puts more emphasis on interpretability and mathematical rigor; ML puts more emphasis on predictive performance, scalability, and autonomy Intro ML (UofT) CSC311-Lec1 12 / 54
Relations to AI Nowadays, “machine learning” is often brought up with “artificial intelligence” (AI) AI does not often imply a learning based system I Symbolic reasoning I Rule based system I Tree search I etc. Learning based system → learned based on the data → more flexibility, good at solving pattern recognition problems. Intro ML (UofT) CSC311-Lec1 13 / 54
Relations to human learning Human learning is: I Very data efficient I An entire multitasking system (vision, language, motor control, etc.) I Takes at least a few years :) For serving specific purposes, machine learning doesn’t have to look like human learning in the end. It may borrow ideas from biological systems, e.g., neural networks. It may perform better or worse than humans. Intro ML (UofT) CSC311-Lec1 14 / 54
What is machine learning? Types of machine learning I Supervised learning: access to labeled examples of the correct behaviour I Reinforcement learning: learning system (agent) interacts with the world and learn to maximize a reward signal I Unsupervised learning: no labeled examples – instead, looking for “interesting” patterns in the data Intro ML (UofT) CSC311-Lec1 15 / 54
History of machine learning 1957 — Perceptron algorithm (implemented as a circuit!) 1959 — Arthur Samuel wrote a learning-based checkers program that could defeat him 1969 — Minsky and Papert’s book Perceptrons (limitations of linear models) 1980s — Some foundational ideas I Connectionist psychologists explored neural models of cognition I 1984 — Leslie Valiant formalized the problem of learning as PAC learning I 1986 — Backpropagation (re-)discovered by Geoffrey Hinton and colleagues I 1988 — Judea Pearl’s book Probabilistic Reasoning in Intelligent Systems introduced Bayesian networks Intro ML (UofT) CSC311-Lec1 16 / 54
History of machine learning 1990s — the “AI Winter”, a time of pessimism and low funding But looking back, the ’90s were also sort of a golden age for ML research I Markov chain Monte Carlo I variational inference I kernels and support vector machines I boosting I convolutional networks 2000s — applied AI fields (vision, NLP, etc.) adopted ML 2010s — deep learning I 2010–2012 — neural nets smashed previous records in speech-to-text and object recognition I increasing adoption by the tech industry I 2016 — AlphaGo defeated the human Go champion Intro ML (UofT) CSC311-Lec1 17 / 54
History of machine learning ML conferences selling out like Beyonce tickets. Source: medium.com/syncedreview/ Intro ML (UofT) CSC311-Lec1 18 / 54
History of machine learning ML conferences selling out like Beyonce tickets. Intro ML (UofT) CSC311-Lec1 18 / 54
Computer vision: Object detection, semantic segmentation, pose estimation, and almost every other task is done with ML. Instance segmentation - Link Intro ML (UofT) CSC311-Lec1 19 / 54
Speech: Speech to text, personal assistants, speaker identification... Intro ML (UofT) CSC311-Lec1 20 / 54
NLP: Machine translation, sentiment analysis, topic modeling, spam filtering. Intro ML (UofT) CSC311-Lec1 21 / 54
Playing Games DOTA2 - Link Intro ML (UofT) CSC311-Lec1 22 / 54
E-commerce & Recommender Systems : Amazon, Netflix, ... Intro ML (UofT) CSC311-Lec1 23 / 54
Why this class? Why not jump straight to CSC412/421, and learn neural nets first? The principles you learn in this course will be essential to really understand neural nets. The techniques in this course are still the first things to try for a new ML problem. I For example, you should try applying logistic regression before building a deep neural net! There is a whole world of probabilistic graphical models. Intro ML (UofT) CSC311-Lec1 24 / 54
Why this class? 2017 Kaggle survey of data science and ML practitioners: what data science methods do you use at work? Intro ML (UofT) CSC311-Lec1 25 / 54
ML Workflow ML workflow sketch: 1. Should I use ML on this problem? I Is there a pattern to detect? I Can I solve it analytically? I Do I have data? 2. Gather and organize data. I Preprocessing, cleaning, visualizing. 3. Establishing a baseline. 4. Choosing a model, loss, regularization, ... 5. Optimization (could be simple, could be a PhD!). 6. Hyperparameter search. 7. Analyze performance and mistakes, and iterate back to step 4 (or 2). Intro ML (UofT) CSC311-Lec1 26 / 54
Implementing machine learning systems You will often need to derive an algorithm (with pencil and paper), and then translate the math into code. Array processing (NumPy) I vectorize computations (express them in terms of matrix/vector operations) to exploit hardware efficiency I This also makes your code cleaner and more readable! Intro ML (UofT) CSC311-Lec1 27 / 54
Implementing machine learning systems Neural net frameworks: PyTorch, TensorFlow, Theano, etc. I automatic differentiation I compiling computation graphs I libraries of algorithms and network primitives I support for graphics processing units (GPUs) Why take this class if these frameworks do so much for you? I So you know what to do if something goes wrong! I Debugging learning algorithms requires sophisticated detective work, which requires understanding what goes on beneath the hood. I That is why we derive things by hand in this class! Intro ML (UofT) CSC311-Lec1 28 / 54
Preliminaries and Nearest Neighbourhood Methods Intro ML (UofT) CSC311-Lec1 29 / 54
Introduction Today (and for the next 5-6 lectures) we focus on supervised learning. This means we are given a training set consisting of inputs and corresponding labels, e.g. Task Inputs Labels object recognition image object category image captioning image caption document classification text document category speech-to-text audio waveform text .. .. .. . . . Intro ML (UofT) CSC311-Lec1 30 / 54
Input Vectors What an image looks like to the computer: [Image credit: Andrej Karpathy] Intro ML (UofT) CSC311-Lec1 31 / 54
Input Vectors Machine learning algorithms need to handle lots of types of data: images, text, audio waveforms, credit card transactions, etc. Common strategy: represent the input as an input vector in Rd I Representation = mapping to another space that is easy to manipulate I Vectors are a great representation since we can do linear algebra Intro ML (UofT) CSC311-Lec1 32 / 54
Input Vectors Can use raw pixels: Can do much better if you compute a vector of meaningful features. Intro ML (UofT) CSC311-Lec1 33 / 54
Input Vectors Mathematically, our training set consists of a collection of pairs of an input vector x ∈ Rd and its corresponding target, or label, t I Regression: t is a real number (e.g. stock price) I Classification: t is an element of a discrete set {1, . . . , C} I These days, t is often a highly structured object (e.g. image) Denote the training set {(x(1) , t(1) ), . . . , (x(N ) , t(N ) )} I Note: these superscripts have nothing to do with exponentiation! Intro ML (UofT) CSC311-Lec1 34 / 54
Nearest Neighbors Suppose we’re given a novel input vector x we’d like to classify. The idea: find the nearest input vector to x in the training set and copy its label. Can formalize “nearest” in terms of Euclidean distance v u d uX (a) (b) (a) (b) ||x − x ||2 = t (x − x )2 j j j=1 Algorithm: 1. Find example (x∗ , t∗ ) (from the stored training set) closest to x. That is: x∗ = argmin distance(x(i) , x) x(i) ∈train. set 2. Output y = t∗ Note: we do not need to compute the square root. Why? Intro ML (UofT) CSC311-Lec1 35 / 54
Nearest Neighbors: Decision Boundaries We can visualize the behaviour in the classification setting using a Voronoi diagram. Intro ML (UofT) CSC311-Lec1 36 / 54
Nearest Neighbors: Decision Boundaries Decision boundary: the boundary between regions of input space assigned to different categories. Intro ML (UofT) CSC311-Lec1 37 / 54
Nearest Neighbors: Decision Boundaries Example: 2D decision boundary Intro ML (UofT) CSC311-Lec1 38 / 54
Nearest Neighbors [Pic by Olga Veksler] Nearest neighbors sensitive to noise or mis-labeled data (“class noise”). Solution? Intro ML (UofT) CSC311-Lec1 39 / 54
k-Nearest Neighbors [Pic by Olga Veksler] Nearest neighbors sensitive to noise or mis-labeled data (“class noise”). Solution? Smooth by having k nearest neighbors vote Intro ML (UofT) CSC311-Lec1 39 / 54
k-Nearest Neighbors [Pic by Olga Veksler] Nearest neighbors sensitive to noise or mis-labeled data (“class noise”). Solution? Smooth by having k nearest neighbors vote Algorithm (kNN): 1. Find k examples {x(i) , t(i) } closest to the test instance x 2. Classification output is majority class X k y = argmax I{t(z) = t(i) } t(z) i=1 I{statement} is the identity function and is equal to one whenever the statement is true. We could also write this as δ(t(z) , t(i) ) with δ(a, b) = 1 if a = b, 0 otherwise. I{1}. Intro ML (UofT) CSC311-Lec1 39 / 54
K-Nearest neighbors k=1 [Image credit: ”The Elements of Statistical Learning”] Intro ML (UofT) CSC311-Lec1 40 / 54
K-Nearest neighbors k=15 [Image credit: ”The Elements of Statistical Learning”] Intro ML (UofT) CSC311-Lec1 41 / 54
k-Nearest Neighbors Tradeoffs in choosing k? Small k I Good at capturing fine-grained patterns I May overfit, i.e. be sensitive to random idiosyncrasies in the training data Large k I Makes stable predictions by averaging over lots of examples I May underfit, i.e. fail to capture important regularities Balancing k: I The optimal choice of k depends on the number of data points n. I Nice theoretical properties if k → ∞ and k → 0. n 2 I Rule of thumb: Choose k = n 2+d . I We explain an easier way to choose k using data. Intro ML (UofT) CSC311-Lec1 42 / 54
K-Nearest neighbors We would like our algorithm to generalize to data it hasn’t seen before. We can measure the generalization error (error rate on new examples) using a test set. [Image credit: ”The Elements of Statistical Learning”] Intro ML (UofT) CSC311-Lec1 43 / 54
Validation and Test Sets k is an example of a hyperparameter, something we can’t fit as part of the learning algorithm itself We can tune hyperparameters using a validation set: The test set is used only at the very end, to measure the generalization performance of the final configuration. Intro ML (UofT) CSC311-Lec1 44 / 54
Pitfalls: The Curse of Dimensionality Low-dimensional visualizations are misleading! In high dimensions, “most” points are far apart. If we want the nearest neighbor to be closer than , how many points do we need to guarantee it? The volume of a single ball of radius is O(d ) The total volume of [0, 1]d is 1. Therefore O ( 1 )d balls are needed to cover the volume. [Image credit: ”The Elements of Statistical Learning”] Intro ML (UofT) CSC311-Lec1 45 / 54
Pitfalls: The Curse of Dimensionality In high dimensions, “most” points are approximately the same distance. (Homework question coming up...) Saving grace: some datasets (e.g. images) may have low intrinsic dimension, i.e. lie on or near a low-dimensional manifold. So nearest neighbors sometimes still works in high dimensions. Intro ML (UofT) CSC311-Lec1 46 / 54
Pitfalls: Normalization Nearest neighbors can be sensitive to the ranges of different features. Often, the units are arbitrary: Simple fix: normalize each dimension to be zero mean and unit variance. I.e., compute the mean µj and standard deviation σj , and take xj − µj x̃j = σj Caution: depending on the problem, the scale might be important! Intro ML (UofT) CSC311-Lec1 47 / 54
Pitfalls: Computational Cost Number of computations at training time: 0 Number of computations at test time, per query (naı̈ve algorithm) I Calculuate D-dimensional Euclidean distances with N data points: O(N D) I Sort the distances: O(N log N ) This must be done for each query, which is very expensive by the standards of a learning algorithm! Need to store the entire dataset in memory! Tons of work has gone into algorithms and data structures for efficient nearest neighbors with high dimensions and/or large datasets. Intro ML (UofT) CSC311-Lec1 48 / 54
Example: Digit Classification Decent performance when lots of data Intro ML (UofT) CSC311-Lec1 49 / 54
Example: Digit Classification KNN can perform a lot better with a good similarity measure. Example: shape contexts for object recognition. In order to achieve invariance to image transformations, they tried to warp one image to match the other image. I Distance measure: average distance between corresponding points on warped images Achieved 0.63% error on MNIST, compared with 3% for Euclidean KNN. Competitive with conv nets at the time, but required careful engineering. [Belongie, Malik, and Puzicha, 2002. Shape matching and object recognition using shape contexts.] Intro ML (UofT) CSC311-Lec1 50 / 54
Example: 80 Million Tiny Images 80 Million Tiny Images was the first extremely large image dataset. It consisted of color images scaled down to 32 × 32. With a large dataset, you can find much better semantic matches, and KNN can do some surprising things. Note: this required a carefully chosen similarity metric. [Torralba, Fergus, and Freeman, 2007. 80 Million Tiny Images.] Intro ML (UofT) CSC311-Lec1 51 / 54
Example: 80 Million Tiny Images [Torralba, Fergus, and Freeman, 2007. 80 Million Tiny Images.] Intro ML (UofT) CSC311-Lec1 52 / 54
Conclusions Simple algorithm that does all its work at test time — in a sense, no learning! Can be used for regression too, which we encounter later. Can control the complexity by varying k Suffers from the Curse of Dimensionality Next time: decision trees, another approach to regression and classification Intro ML (UofT) CSC311-Lec1 53 / 54
Questions? ? Intro ML (UofT) CSC311-Lec1 54 / 54
You can also read