Advanced Course in Statistics: an overview - Departamento de ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Advanced Course in Statistics: an overview Antonio Cuevas Departamento de Matemáticas Universidad Autónoma de Madrid January, 2020
Prerequisites I I will assume that the course attendants have followed at least and introductory course in mathematical statistics and a basic course on probability. Anyway I will do my best to make the course as self-contained as possible. In case you need to recall some basic notions on mathematical statistics, please have a look at the slides of my undergraduate courses of Statistics I and Statistics II. Many other resources are freely available on internet. I It is desirable some familiarity with basic notions of measure theory, functional analysis (Banach and Hilbert spaces, operators theory, Lp spaces,...) and stochastic processes. I We will use for illustration purposes and practical examples the statistical software R. Some proposed exercises will require the use of R. While some familiarity with the use of this software is highly recommended, it is not strictly necessary in order to follow this course. Please, see the course web page for some additional information on the software R. Also, a very basic introduction to R can be found in the slides Statistics I .
The data In general terms, the aim of statistics is to obtain information from a data set (or sample) x1 , . . . , xn These data come from the repeated observation of a phenomenon of interest. The sample space is defined as the set of all possible values of the magnitude x. X = sample space In classical statistics X = R. In the so-called multivariate analysis X = Rd .
Descriptive statistics/Statistical Inference I Descriptive statistics (Exploratory Data Analysis): the aim is summarizing (e.g., via mean, median and mode) and visualizing a data set I Statistical inference: the data X1 , . . . , Xn are independent identically distributed observations drawn from a random variable X , X : (Ω, A, P) → (X , B). We will sometimes say that X represents the underlying population. The distribution of X (defined by P(B) = P(X ∈ B) for B ∈ B) is often assumed to depend on an unknown parameter θ taking values on a known parameter space Θ. We will sometimes denote P = Pθ Θ = Parameter space The general purpose is to use the random sample X1 , . . . , Xn in order to make inference (hypothesis testing, point estimation, confidence intervals,...) about the (unknown) ”true” value of θ ∈ Θ.
The evolution of statistical theory Statistical X Θ Time Theory Classical inference R Θ⊂R 1920’s Multivariate analysis Rd (n >> d) Θ ⊂ Rk (n >> k) 1940’s Nonparametrics Rd (n >> d) A function space 1960’s High dimensional problems Rd (n < d) Θ ⊂ Rk 2000’s Functional Data Analysis A function space Rk or a funct. space 1990’s Object Oriented D. Analysis A space of Rk , or space 2000’s images of images
General structure of the course Two parts: I Statistics with functional data: the sample data are real functions xi = xi (t) defined on a compact interval. I Nonparametric functional estimation: the data are real numbers (or vectors in Rd ) but the aim of the estimation is a function, for example a density or a regression function.
Statistics with functional data It is sometimes called Functional Data Analysis (FDA) The data x1 = x1 (t), . . . , xn = xn (t), t ∈ [0, 1]. are functions defined on some compact interval (say [0, 1]). The argument t corresponds often (but not necessarily) to the time instant in which the magnitude x(t) is measured. The functional data can be considered as random observations drawn from a stochastic process. The distribution of a stochastic process is a probability measure on the space of trajectories. So, we will need to use some probability theory on function spaces. In informal terms, Random variables Stochastic processes = Classical statistics FDA
Functional data: an example in cardiology ECG data 8 6 4 2 0 −2 Control group Patients group −4 0 10 20 30 40 50 60 70 80 90 Figure: 2026 electrocardiograms. 1506 correspond to the control group (in blue) and 520 correspond to ischemia patients (in red) A possible application here would be as follows: given the ECG curve of a new coming patient (still not diagnosed regarding the ischemia condition), might be get, in view of such ECG curve, a quick, preliminary diagnosis for the patient?
Functional data: an example in climate studies (I)
Functional data: an example in climate studies (II) In the figure above, the blue line corresponds to the average of 38 curves; each curve is obtained (via linear interpolation) from the maximum daily temperatures (365 values per year) recorded on the Barcelona Airport (El Prat), during the period 1944-1981. The red line is the analogous average obtained from the 38 curves corresponding to the period 1982-2019. The February 29 data (corresponding to leap years) have been omitted. The missing values have been imputed by linear interpolation. Some interesting questions: I If we assume that the temperatures in the first (resp. second) period are a sample of a process X (t) (resp. Y (t)) and we denote the respective mean functions m1 (t) = E(X (t)) and m2 (t) = E(Y (t)). There is enough statistical evidence (in view of the previous data) to conclude m1 6= m2 ? In other words, we would like to test the null hypothesis H0 : m1 = m2 versus the alternative H1 : m1 6= m2 . I Is there some useful information in the derivatives of the curves?
Functional data: an example in climate studies (III) The graph below (courtesy of J.E. Chacón) corresponds to temperatures recorded at Pittsburgh.
The troubles with FDA (I) To some extent, the progress of statistics has consisted on conquering more sophisticated sample and parameter spaces: from subsets of R or Rd to function or shapes spaces. This increase in generality entails some problems: I Lack of a natural order in the sample space: no distribution function is available to characterize the distributions. I Multiplicity of choices for the distance between two elements d(x1 , x2 )) (or kx1 − x2 k in the case of normed spaces) I How to define the “population mean” µ in order to properly respond to the notion of “average” and to satisfy EkX − µk2 = min EkX − ak2 ? a I How to define some basic notions such as median, mode, quantiles, outliers in order to properly generalize the analogous notions for the case of numerical data?
The troubles with FDA (II) I The closed, bounded sets are not in general compact in infinite-dimensional spaces. This entails some serious theoretical and practical consequences. I There are some difficulties to define simple, easy to handle regression models for the case of general data. I The need for pre-processing data which usually come in a discretized fashion. I Lack of a natural translation-invariant measure (analogous to the Lebesgue measure). So, no general notion of density function (similar to that of the finite-dimensional cases) is available. Some other important tools in classical statistics, such as characteristic functions, can only be partially used, under severe limitations. I In FDA, the non-invertibility of the covariance operators leads to some essential differences with the classical treatment of regression and classification models.
Some typical tools in FDA I The Karhunen-Loève expansion: X (t) can be expressed in the form X∞ X (t) = Zk ek (t), k=1 where the ek are the eigenfunctions of the covariance operator of the process X (t). I Depth measures. I Dimension reduction procedures. I Bochner integral. I Exponential inequalities. I Regularization and smoothing methods.
Some problems in FDA we will consider I Some probability background: probability theory in infinite-dimensional spaces. I Definition of centralization values: mean, median and mode. Estimation of these values. I Practical use of functional data. I Dimension reduction methods I Depth notions. I Supervised (or discrimination) and unsupervised (clustering) classification based on functional data. I Functional regression models I ANOVA models for functional data.
Nonparametric functional statistics (I) The general aim is to estimate from a sample X1 , . . . , Xn some function of interest which depends on the distribution of the Xi ’s. The term “nonparametric” comes from the fact that we don’t assume the membership of the target function to any family indexed for a finite-dimensional parameter.
Nonparametric functional statistics (II) Some typical examples: I Estimation of the distribution function: Let X1 , , X2 , . . . , Xn , . . . be (iid) observations drawn from a real random variable X with distribution F . The empirical distribution function n 1X Fn (t) = I(−∞,t] (Xi ). n i=1 is a natural estimator of F . The study of the properties of such estimator is a major topic in statistics. I Density estimation: Here the aim is to estimate the common density f of the Xi ’s. This topic has received a lot of attention since the 1960’s. Most density estimators are constructed from “smoothed” versions of Fn . They could be seen as “sophisticated versions” of the familiar histograms. I Estimation of the regression function E(Y |X = x): In this case we must have a sample of type (Xi , Yi ).
An example in Medicine Figure: In fetal cardiology studies it is often of interest the analysis of the so-called “(average) short term variability”. This variable is called ASTV in the data set cardio included in the R-package ks; see also the website of the book by Chacón and Duong and the UCI Machine Learning Repository for further analysis and details on these data. The above graph shows a nonparametric density estimator of the ASTV based on a sample of 2126 foetuses. The shape of this estimated density reveals some features (e.g., related to multimodality) of the distribution which will be necessarily hidden if we fit a usual parametric model (e.g. based on the normal model).
An example in Food Science Figure: The above graph shows the estimated regression function m(x) = E(Y |X = x) for the variables X = weight in Kg. of a fish, Y = concentration of mercury in the meat of the fish. The regression curve has been obtained from a sample of 171 fishes captured in the rivers Lumber and Wacamaw (North Carolina, USA). Again, nonparametric procedures provide more flexibility, when compared with the standard parametric methods (based on linear or polynomial regression)
Nonparametric functional statistics (III) I Set estimation: Te aim is estimating the (compact) support of a random variable X , with values in Rd , from a sample X1 , . . . , Xn . In other cases, the target of the estimation is a level set of type {x : f (x) ≥ c}, where f is the underlying density of the Xi ’s.
The R packages we will use The R software is free and easy to install. It is available for the usual platforms (Windows, Linux, Mac) from http://www.r-project.org/ This software consists of a “basic version” plus many additional packages which can be downloaded when needed. In particular, the package fda.usc is very useful for functional data analysis. In this link you can find some, quite complete, information on the R-packages currently available for FDA. The packages KernSmooth and ks include some standard procedures in nonparametric statistics. More details on software can be found in the course web page.
You can also read