The smallest fraction of the sample data that has to be replaced with an arbitrary extreme value in order to make the statistic take an arbitrary large (or small) value, i.e. the measure of how resistant statistic is to data contamination
Quartile is each of the four equally-sized groups that the data sample can be divided into according to the distribution of a particular variable
denotes the split-point of the first quartile from the rest (median of the lower half of the data)
denotes the split-point of the fourth quartile from the rest (median of the higher half of the data)
Interquartile range is then defined as a difference measures the spread of the central half of the data and is much less affected by outliers than the standard deviation.
Hence, is more robust than the standard deviation
-based “fences” are robust - they depend on medians, not on extreme values
For symmetric, light-tailed data,
For heavy-tailed or contaminated data, stays stable, while can explode
Visualizing data
Histograms and box plots to visually represent the data
When plotting histograms, it is important to keep the number of bins balanced, as to not hide the overall pattern
Common choices for bin width (and consequently the number of bins)
Sturges:
Scott:
Freedman-Diaconis:
Empirical cumulative distribution function (ECDF) definition
Properties
is a non decreasing, right-continuous step function
A horizontal shift between two ECDFs indicates a difference in location - the dataset whose ECDF is to the right tends to have larger values
A difference in steepness (a slope of steps) indicates a difference in spread - the steeper ECDF corresponds to a more concentrated distribution
If one ECDF is consistently below the other, one sample tends to produce larger observations (stochastic dominance)
Relationships between two variables
Positive - both grow
Negative - one grows, another falls
None - no visible pattern
Spurious - both depend on some hidden factor
To show these relations visually, we can use the Scatterplot
Scatterplot
Horizontal axis shows one variable (sometimes called an explanatory variable)
Vertical axis shows the other variable (sometimes called a response variable)
The overall pattern reveals direction, form, and strength of the relationship
An example of a positive relationship on a scatterplot:
Note that scatterplots only show association, not causation(!!!)
Sometimes we wish to quantify the relationship with a single number instead.
Sample correlation coefficient
The numerator measures how and move together
The denominator rescales it so that
Note that non-linear trends cannot be observed using the sample correlation coefficient!
From association to prediction
Having an observable trend, we can try to predict the value of a response variable given the value of an explanatory variable
Let us discuss linear trends and, correspondingly, linear regression.
We can make predictions using the fitted line
Given a linear trend, a fitted line is a line that summarizes this trend, i.e. captures it’s central tendency, not the details.
The fitted line provides the prediction value for each
The difference between the actual value and the predicted one is called residual,
Residual measures how far the actual data points are from the fitted line
To obtain the best fitted line we use algebraic regression, so called least-squares fit, the goal of this regression is to minimize the sum of squared residuals.
Coefficient of determination is defined as follows
is the proportion of the total variation that is explained by the fitted line
For simple regression,
The higher the coefficient of determination, the better predictive ability.
measures the linear association, measures how well the fitted line fits.
Verifying the residuals
Using a scatterplot of residuals against fitted values , we can analyze the pattern they form and verify whether a fitted line is appropriate for the data.
If residuals are random, scattered around 0, then the line fits well
A pattern (curve, funnel, etc.) indicates problems - nonlinearity or heteroscedasticity
If the pattern is observed, it is possible that applying a variable transformation would help, i.e.
If not, a nonlinear model is probably necessary
But what if a pattern is not really a curve, funnel, etc., but a changing spread, meaning there is a relation between and ?
In this case, transformation or a weighted least-squares fit might help
If not, we can still fallback to a nonlinear model
Effect of outliers on least-squares fit
The least-squares fit uses squared deviations, this means that one large error ahs a huge impact on the total
Outliers can distort both and the correlation
Outliers might also inflate or deflate
The most important thing here is to perform a data cleanup - fixing or removing data errors
If the outlier is in fact an actual value and not a data error - use robust regression models