Variable definition

Variable is a property measured in units of observation.

Dataset definition

A finite sequence of observations representing the measured values of some variable

Types of variables

Discrete (quantitative)

Numerical, assumes separate values, often counts

Continuous (quantitative)

Numerical, can take any real value within the interval

Nominal (categorial)

Category-like values with no natural order, i.e. Cat, Dog, Horse

Ordinal (categorial)

Ordered categories-like, or in other words “named-numerical”, i.e. grades - Unsatisfactory, Satisfactory, Good, Excellent


Describing quantitative data

When describing some quantitative data, we usually want to describe two things

  • The central tendency - where is data concentrated
  • The dispersion (spread) - how far do extremities reach In probability theory, these correspond to expected value (mean) and dispersion

Summary statistic definition

Some single numerical value computed from a sample. This statistic summarizes some aspect of the data, i.e. center, spread, association

Measures of central tendencies

Sample mean(average) definition

Sample median definition

Summary statistic is called robust if small changes or few extreme observations have little influence on its value.

Breakdown point definition

The smallest fraction of the sample data that has to be replaced with an arbitrary extreme value in order to make the statistic take an arbitrary large (or small) value, i.e. the measure of how resistant statistic is to data contamination

Trimmed mean definition


Measures of spread (dispersion)

Sample variance and standard deviation definition

Sample variance measures the typical squared distance from the mean Standard deviation expresses this deviation in the same units as the data.

Why not use absolute values instead of squares? The answer is - we usually prefer smooth functions over non-smooth ones.

It is important to note that standard deviation is not robust, it is easily inflated by an extreme value.

Why do we square derivations?
  • Squaring avoids cancellations between positive and negative differences
  • Squaring emphasizes data points that are further from the mean
Why do we divide by instead of ?
  • Using gives us an unbiased estimate on the true population variance

Quartile and Interquartile range definition

Quartile is each of the four equally-sized groups that the data sample can be divided into according to the distribution of a particular variable denotes the split-point of the first quartile from the rest (median of the lower half of the data) denotes the split-point of the fourth quartile from the rest (median of the higher half of the data) Interquartile range is then defined as a difference measures the spread of the central half of the data and is much less affected by outliers than the standard deviation. Hence, is more robust than the standard deviation

Rule definition

The rule states the following:

-based “fences” are robust - they depend on medians, not on extreme values

  • For symmetric, light-tailed data,
  • For heavy-tailed or contaminated data, stays stable, while can explode

Visualizing data

Histograms and box plots to visually represent the data

When plotting histograms, it is important to keep the number of bins balanced, as to not hide the overall pattern Common choices for bin width (and consequently the number of bins)

  • Sturges:
  • Scott:
  • Freedman-Diaconis:

Empirical cumulative distribution function (ECDF) definition

Properties
  • is a non decreasing, right-continuous step function
  • Each ordered observation increases by

At any point , estimates the probability

p-quantile from ECDF definition

p-quantile is the smallest value of such that

  • For the quantile is the median
  • For and the quantiles are respectively
Comparing two ECDFs
  • A horizontal shift between two ECDFs indicates a difference in location - the dataset whose ECDF is to the right tends to have larger values
  • A difference in steepness (a slope of steps) indicates a difference in spread - the steeper ECDF corresponds to a more concentrated distribution
  • If one ECDF is consistently below the other, one sample tends to produce larger observations (stochastic dominance)

Relationships between two variables

  • Positive - both grow
  • Negative - one grows, another falls
  • None - no visible pattern
  • Spurious - both depend on some hidden factor To show these relations visually, we can use the Scatterplot

Scatterplot

  • Horizontal axis shows one variable (sometimes called an explanatory variable)
  • Vertical axis shows the other variable (sometimes called a response variable)
  • The overall pattern reveals direction, form, and strength of the relationship An example of a positive relationship on a scatterplot:

Note that scatterplots only show association, not causation(!!!) Sometimes we wish to quantify the relationship with a single number instead.

Sample correlation coefficient

  • The numerator measures how and move together
  • The denominator rescales it so that

Note that non-linear trends cannot be observed using the sample correlation coefficient!

From association to prediction

Having an observable trend, we can try to predict the value of a response variable given the value of an explanatory variable Let us discuss linear trends and, correspondingly, linear regression. We can make predictions using the fitted line

Fitted line definition

Given a linear trend, a fitted line is a line that summarizes this trend, i.e. captures it’s central tendency, not the details. The fitted line provides the prediction value for each The difference between the actual value and the predicted one is called residual, Residual measures how far the actual data points are from the fitted line To obtain the best fitted line we use algebraic regression, so called least-squares fit, the goal of this regression is to minimize the sum of squared residuals.

Assessing the result

The coefficient of determination definition

Coefficient of determination is defined as follows

  • is the proportion of the total variation that is explained by the fitted line
  • For simple regression, The higher the coefficient of determination, the better predictive ability. measures the linear association, measures how well the fitted line fits.
Verifying the residuals

Using a scatterplot of residuals against fitted values , we can analyze the pattern they form and verify whether a fitted line is appropriate for the data.

  • If residuals are random, scattered around 0, then the line fits well
  • A pattern (curve, funnel, etc.) indicates problems - nonlinearity or heteroscedasticity If the pattern is observed, it is possible that applying a variable transformation would help, i.e. If not, a nonlinear model is probably necessary

But what if a pattern is not really a curve, funnel, etc., but a changing spread, meaning there is a relation between and ? In this case, transformation or a weighted least-squares fit might help If not, we can still fallback to a nonlinear model

Effect of outliers on least-squares fit

The least-squares fit uses squared deviations, this means that one large error ahs a huge impact on the total

  • Outliers can distort both and the correlation
  • Outliers might also inflate or deflate The most important thing here is to perform a data cleanup - fixing or removing data errors If the outlier is in fact an actual value and not a data error - use robust regression models