Basic concepts - part 1 - SOS 2021 18-29 January, Online School - Julien Donini - UCA/LPC Clermont-Ferrand - Indico de l'IN2P3

Page created by Paula Bowman

Science

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Basic concepts - part 1 - SOS 2021 18-29 January, Online School - Julien Donini - UCA/LPC Clermont-Ferrand - Indico de l'IN2P3

Basic concepts – part 1

 SOS 2021 18-29 January, Online School

Julien Donini - UCA/LPC Clermont-Ferrand

Outline
 Basics Introductory books (non exhaustive)
 • Sample measurements Excellent book of reference
 • Error propagation • G. Cowan, Statistical Data Analysis
 (Oxford Science Publication)
 • Probabilities, Bayes Theorem
 • Probability density function Introduction to Bayesian analysis
 • D. Sivia, Data Analysis: A Bayesian
 Parameter estimation Tutorial (Oxford Science Publication)
 • Maximum likelihood method
 • Linear regression Classic textbook
 • Least square fit • Louis Lyons, Statistics for Nuclear and
 Particle Physicists (Cambridge
 Model testings University Press)
 • p-value and test statistics
 • Chi2 and KS tests En Français
 • Hypothesis testing • B. Clement, Analyse de données en
 sciences expérimentales (Dunod)

SOS 2021 2

Samples: basic basics
 Population
 • Let’s consider a sample of values (e.g. experimental measurements)
 N measurement of a random variable X: {xi} = {x1, x2, …, xN}
 • There are several quantities that can be determined to characterize this
 population without any knowledge of the underlying model/theory

 Measure of position

Arithmetic mean: Median: value that separates sample in half

Quartiles (Q1,Q2,Q3): values that separates sample in four equal-size sample

 x1 x8 x15
 x
 Q1 Q2 Q3
 median

SOS 2021 3

Samples: basic basics
 Measure of dispersion
 Variance: if truth sample mean μ is known

 But μ is in general not know and sample mean is used instead

 • Sample variance (biased):

 • Estimated variance (unbiased):

 → Bias is below α if N ≥ 1/α – 1 (ex for 1% bias, N≥101)

 Standard deviation (is of same unit as x):

SOS 2021 4

Standard deviation and error
In many situations repeating an experiment a large amount of time produces
a spread of results whose distribution is approximately Gaussian.
This is a consequence of the Central Limit Theorem.

 Gaussian (a.k.a normal) distribution
 → 68.3%
 → 95.5%

 Interval μ±σ contains 68.3% of distribution

A measurement = outcome of the sum of a large number of effects.
In general the distribution of this variable will be gaussian.
The standard deviation of the sample is associated to the standard
deviation of the normal distribution.
The standard deviation is then interpreted as an interval that could contain
the true value with a 68.3% confidence level.
SOS 2021 5

CLT at work
Simple illustration of CLT
• let’s consider x: a random variable uniformly distributed in [0,1]
• and the distribution of the sum of N values x:

 N=1 N=3
 Uniform (N=1)

 Irwin-Hall
 (see here)

 N=10 N=50

 Gauss
 (N>40)

SOS 2021 6

Multidimensional samples
 Case where N measurements are performed of M different variables
 → The sample then consists of N vectors of M measurements

 with
 (…)

 Mean and variance can be calculated for each variable xi(k) but to quantify
 how of one variable behaves w.r.t another one uses the covariance:

 For two variables x and y:

 Correlation factor is defined as:

 ρxy = 1(-1) → x and y are fully (anti)correlated
 ρxy = 0 → x and y are uncorrelated (≠ independent !)
SOS 2021 7

Covariance matrix
Covariance matrix (aka error matrix) of sample
• Real, symmetric, N×N matrix of the form:

Correlation matrix:

Example of usage of covariance matrix:
• Transformation of input variables
• Error propagation
• Combination of correlated measurements
• …

SOS 2021 8

Decorrelation
Decorrelation: choose a basis where C becomes diagonal.
→ transformation matrix A such that new covariance matrix U is diagonal

 ,

 (A is orthogonal A-1=AT)

Diagonalization of C: find orthonormal eigenvectors ej such that

 T
 and

 λi = eigenvalues of C = σ’j2 = variance of yi
SOS 2021 9

Decorrelation
 2D example: variables x1 and x2 with correlation factor ρ

Decorrelation: use cases
• Data pre-processing (for ML): remove correlation from input variables
• Reduce dimensionality of a problem: Principal Component Analysis (PCA)
 Consider only the M

Error propagation
Function f of several variables x={x1,…,xN}
• Each variable xi of mean μi and variance σi2
• Perform 1st order Taylor expansion of f around mean value

Variance of f(x):

 Validity: up to 2nd order, linear case, small errors
SOS 2021 11

Error propagation
2D Example:
x and y with correlation factor ρ

 →

 →

For a set of m function
• C is the covariance of variables x={xi}
• We can build the covariance matrix of {fi(x)}: U

This can be expressed as where (Jacobian
 matrix)

SOS 2021 12

Interlude

 You are given a coin, you toss it and obtain “tail”.
 What is the probability that both sides are “tail” ?

SOS 2021 13

Interlude

 It depends on the prior that the coin is unfair
 (and on the person that gave you the coin)

 Who is more likely to give a fair coin ?

SOS 2021 14

Probabilities
 Ω
 Sample space: Ω
 A
 • Set of all possible results of an experiment C
 B
 • Populated by events

 Probability
 • Frequentist: related to frequency of occurrence

 • Subjectivist (Bayesian): degree of belief that A is true
 Introduces concepts of prior and posterior probability

 Knowledge on A increases using data

SOS 2021 15

Axioms and rules
 Mathematical formalization (Kolmogorov)
 Ω
 B

 A

 Conditional probability:

SOS 2021 16

Bayes theorem
 An Essay towards solving a Problem in the Doctrine of Chances.
 By the late Rev. Mr. Bayes, communicated by Mr. Price (1763)
 “If there be two subsequent events, the probability of the
 second b/N and the probability of both together P/N, and it
 being first discovered that the second event has also
 happened, from hence I guess that the first event has also
 happened, the probability I am right is P/b.”
 http://www.stat.ucla.edu/history/essay.pdf
 Thomas Bayes (?)
 c. 1701 –1761 B
 Ω A2
 A1
 If the sample space Ω can be divided in disjoint subsets Ai A3
 A4
 A5 A6

SOS 2021 17

Bayes Theorem in everyday life
Example: 10 coins, one of which is unfair (two-sided tail): You flip a random
coin and obtain tail. What is the probability that this is the unfair coin ?

 A: event where the coin is unfair, B: event where the result is tail

 You want P(A|B):

 where:

 In Bayesian language: P(A) is the prior probability and P(A|B) the posterior
SOS 2021 18

Consequences of not knowing Bayes Th.
 Simple tools for understanding risks: from innumeracy to insight (2003)
 G. Gigerenzer, A. Edwards, BMJ 327, 2003 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC200816/

 Conditional probabilities
 The probability that a woman has breast cancer is
 0.8%. If she has breast cancer, the probability that a
 mammogram will show a positive result is 90%. If a
 woman does not have breast cancer the probability of
 a positive result is 7%. Take, for example, a woman
 who has a positive result. What is the probability
 that she actually has breast cancer?

 Natural frequencies
 Eight out of every 1000 women have breast cancer. Of
 these eight women with breast cancer seven will have
 a positive result on mammography. Of the 992 women
 who do not have breast cancer some 70 will still have a
 “Bad presentation of medical statistics
 positive mammogram. Take, for example, a sample of
 women who have positive mammograms. How many such as the risks associated with a
 of these women actually have breast cancer? particular intervention can lead to
 patients making poor decisions on
 treatment”

SOS 2021 19

Bayes Theorem and statistical inference
 Statistical inference
 Estimate true parameters of a theory or a model using data
 • Frequentist: perform measurement (or set limits)
 • Bayesian: Improve prior knowledge using data

 Going Bayesian
 Likelihood of observing
 these data given a theory
 Posterior
 knowledge on theory Prior knowledge
 on theory

 Usually just a
 normalisation factor
SOS 2021 20

Probability distribution

SOS 2021 21

Probability distribution
 Random variable X
 Discrete random variable: result (realizations) with probability

 → P is the probability distribution and

 For continuous variable: probability of observing x in infinitesimal interval
 → Given by the probability density function (p.d.f) f(x)

 Probability of x in

 with:
 F(x)

 1
 → Cumulative distribution F(x):

 hence:
 0 x
SOS 2021 22

Quantiles
 f(x)

 Probability density function: f(x)

 Cumulative distribution: F(x)=y

 x
 X1/2
 Inverse cumulative distribution: x=F-1(y)
 F(x)

 1
 Median: x such that F(x)=1/2 → x1/2 = F-1(1/2)
1/2

 Quantile of order α: xα = F-1(α)
 0 x
 X1/2 ●
 Ex: quartile, percentile, ...
 x=F-1(y)

X1/2

 y
 0 1/2 1

SOS 2021 23

Expectation value
 Expectation value of a random variable X:
 For a function of x, a(x), the expectation value is:

 - mean of X:

 - nth order moment:

 - Characteristic function Φ(t):

 - Variance:

 - Standard deviation:
SOS 2021 24

Some common distributions
 Binomial law: efficiency, trigger rates, …

 Poisson distribution: counting experiments, hypothesis testing

 Gauss distribution (aka normal): many use-case (asymptotic convergence)

 Cauchy distribution (aka Breit-Wigner): particle decay width, ....

 and not defined (divergent integral)

SOS 2021 25

Cumulative distribution and p-value

 F(x) p-value

 1 1

 0 0 x
 x xsel
 xsel

 One can choose any xsel to compute F(x) or p-value, that is xsel does not have a
 preferred value: it follows the uniform distribution
  The distributions of F(xsel) and p-value are also uniform [proof next page]
  Important for MC sample generation and hypothesis testing

SOS 2021 26

Cumulative distribution and p-value

 F(x) p-value

 1 1

 0 0 x
 x xsel
 xsel

[proof] Given any random continuous variable X, define
 Then:

 is just the cumulative distribution function of a uniform U(0,1) variable.
 → Thus, Y has a uniform distribution on the interval [0,1]
SOS 2021 27

(Silly) use case
Grading copies: Try Cauchy distribution

• 100 copies, grades: 0-20
• Peaked distribution at 10

 Cauchy x0=10 Cumulative Inverse Cumulative
 dist. f(x) γ=1 distribution distribution F-1 (x)
 F(x)

SOS 2021 28

(Silly) use case
Grading copies: Try Cauchy distribution

• 100 copies, grades: 0-20
• Peaked distribution at 10

SOS 2021 29

Data uniformization
(Inverse) cumulative distribution is naturally useful to uniformize data distributions

 For Machine Learning: data preprocessing is usually the 1st step
 → Uniformization of all input variables can sometime be a good idea.

 To know more about data transformation see for example:
 https://scikit-learn.org/stable/modules/preprocessing.html
SOS 2021

χ2 distribution
Pearson’s χ2 test: estimate global compatibility between data and a model
• The data is regrouped in an histogram of N bins
• A goodness-of-fit test K2 is computed as follows

A variant of this test statistics is the Neyman’s χ2

 Easier to code (in particular for fits)
 Asymptotically equivalent to Pearson’s χ2
 Follows χ2 with N-1 degrees of freedom

SOS 2021 31

χ2 distribution
Probability density function
k degrees of freedom, x>0

 K2

 Cumulative distribution

 The p-value of a χ2 test is obtained by integrating
 the χ2 distribution above the measured K2 value.

 Mean = k, variance = 2k
SOS 2021 32

Example
Procedure
- Generate events following a
 Gaussian distribution
- Calculate (Neyman’s) K2
- Repeat 10k time and plot the
 distribution of K2
- Compare to χ2 distribution

 Note:
 K2 is calculated only with non-empty bins NDF=12
 NDF is the number of non-empty bins - 1

SOS 2021 33

Multi-dimensional p.d.f
 An experiment can perform a set of measurement
 → Vector of N measurements

 Probability of observing in infinitesimal interval given by joint p.d.f

 Ex: for a measurement of 2 values x and y

SOS 2021 34

Marginal and conditional p.d.f
 Marginal distribution: p.d.f of one variable regardless of the others

 Conditional distribution: p.d.f of one variable given a constant other

 y
 x Note: k and g are both functions of x and y
SOS 2021 35

Marginal and conditional p.d.f
 Bayes theorem for continuous variables

 →

 Marginal p.d.f can also be expressed with conditional probabilities:

 Note: this is a generalization of the relation
 to continuous variables

 Independent variables: if x and y are independent

 Ex: 2D Gaussian function with uncorrelated variables

SOS 2021 36

Interlude: counting experiment

 What is the meaning of error bars on observed data ?

SOS 2021 37

You can also read

EASY HRT PRESCRIBING GUIDE - BY: DR LOUISE NEWSON BSC(HONS) MBCHB(HONS) MRCP FRCGP - PRIMARY CARE WOMEN'S HEALTH ...

Which of the major stockmarkets are 'cheap' going into 2018? - VEB

2021/22 The London Borough of ...

Appraisal for ground water quality analysis of Ahmedabad City of Gujarat, India.

CBI Trade Statistics: Jewellery

When Should You Adjust Standard Errors for Clustering? - Alberto Abadie, Susan Athey, Guido Imbens, & Jeffrey Wooldridge Session in Honor of Gary ...

National Ballot: Conservative 34.9, Liberal 33.4, NDP 18.9 Conservatives and Liberals gripped in a tight race for national ballot support.

SOURCE EMISSIONS MONITORING - LION CO TOOHEYS - MARCH 2021 R_0 DATE OF RELEASE: 6/04/2021

National Ballot: Liberal 34.1, Conservative 32.0, NDP 20.9 Liberals gain support and Conservatives decline in wake of TVA French debate.

2019 DNA Workshop Requirements

Finance What's Up? Spring 2019

Using data to create unique experiences for each shopper - STORIES

Merck Announces HIV Collaboration with Gilead - March 15, 2021

YSC2229: Introductory Data Structures - and Algorithms

Norwest Market Update - 2nd Half 2020

Enabling a distributed workforce with SD-WAN - Real-world examples of how Citrix is helping businesses deliver secure, uninterrupted connectivity ...

England & Wales mortality monitor - April 2020 - Institute and ...

ASPEKT Method: Supplemental Material - Analysis of Swallowing Physiology: Events, Kinematics & Timing (ASPEKT) - Swallowing Rehabilitation ...

Hong Kong Underpins its Commitment to Improve CRS Tax Transparency - Dion Global ...

THE EFFECT OF THE IMPLEMENTATION OF GOOGLE CLASSROOM DIGITAL MEDIA FOR THE EASY OF TEACHERS IN ASSESSING LEARNING OUTCOMES

The 'New Normal' For Selling To Small Business

Scrap Metal Manager SMM Implementations

A STUDY ON FINANCIAL ANALYSIS OF AUTOMOBILE INDUSTRIES

Vote Intention: Conservatives 34.4, Liberals 33.6, NDP 18.9. Conservatives and Liberals tangled in the lead for national ballot support - #ELXN44 ...

Work Programme FY 2021/22 - Bank for ...

Melbourne Institute Nowcast of Australian GDP & Dating the Business Cycle - February 2021

Private Continual Release of Real-Valued Data Streams - Date: February 26, 2019 Victor Perrier, Hassan Jameel Asghar, Dali Kaafar Macquarie ...

Threat modelling for creating technical information protection system at automated machine building plants - IOPscience

WHAT DOES MCDONALD'S SAY ABOUT YOUR CITY?

Renting Homes (Fees etc.) (Wales) Act 2019

OECD Skills Outlook 2021 Italy

UC Berkeley Recent Work - eScholarship

ASTRONAUT PHOTOGRAPHY OF THE EARTH: A LONG- TERM DATASET FOR EARTH SYSTEM RESEARCH

Special Reconnaissance (SR) Virtual Assessment Event Series

Courier Routing Study - OhioLINK

How are Passive Acoustic Data Used to Inform the Decision-Making Process? - Elizabeth Henderson NIWC Pacific 22 October 2020

Securing Microsoft 365: Advanced Security and the Fortinet Security Fabric

NYDFS Fines First Unum and Paul Revere Insurance Companies $1.8 Million for Violations Arising Out of Data Breaches

2020 Office of Tax and Revenue

Multiscale bi-Harmonic Analysis of Digital Data Bases and Earth moving distances - R. Coifman , Department of Mathematics, program of Applied ...

2020 PRESIDENTIAL ELECTION - The Inside Track - Ipsos

2020 Census Dr. Allison Jackson, Partnership Specialist Hunterdon & Somerset Counties NJ New York Regional Office - Hunterdon County, NJ

Varieties of Single-Factor Experimental Designs

Operational NWP System - ICON - DWD

Advanced Topics: Usage of bwHPC and ForHLR clusters - Samuel Braun, SCC, KIT - www.bwhpc.de - Indico@KIT (Indico)

SIGKDD Honors Career Achievements in Knowledge Discovery and Data Mining

CS1 - Actuarial Statistics - Syllabus for the 2021 exams - Institute and Faculty of Actuaries

EUROFLEETS+ Project INTAROS Research Infrastructures Dialog meeting Aodhán Fitzgerald, Project Coordinator - Integrated Arctic ...

School science practical work in a COVID-19 world: Are demonstrations, videos and textbooks effective replacements for hands-on practical ...

Clare Valley Wine Region - SA Winegrape Crush Survey 2020 Regional Summary Report - Vinehealth Australia