class: inverse, left, bottom background-image: url("./figs/logo.png") background-position: 7% 2% background-size: 120px 70px <img src="slides_files/figure-html/title_fig-1.png" width="80%" style="display: block; margin: auto;" /> ### Statistics with R #### Lesson 01: Introduction to statistics --- ## Today <br> .huge[ * Introduction to statistics ] -- .huge[ * Measures of central tendency and dispersion ] --- class: inverse, middle, center # Introduction to statistics --- ## Population and sample <img src="slides_files/figure-html/p1-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Population and sample <img src="slides_files/figure-html/p2-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Population and sample <img src="slides_files/figure-html/p3-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Population and sample .center[![](./figs/sample_model_population.png)] --- ## Frequency distributions <img src="slides_files/figure-html/unnamed-chunk-1-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Frequency distributions .pull-left[ <img src="slides_files/figure-html/unnamed-chunk-3-1.png" width="90%" style="display: block; margin: auto;" /> .center[Positively skewed] ] .pull-right[ <img src="slides_files/figure-html/unnamed-chunk-4-1.png" width="90%" style="display: block; margin: auto;" /> .center[Negatively skewed] ] --- class: inverse, middle, center # Measures of central tendency and dispersion --- class: middle, center # Central tendency .large[Indicates the typical or central value in a distribution] --- ## Mode * The most frequent value -- <img src="slides_files/figure-html/unnamed-chunk-5-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Mode * Is the most frequent value * A dataset can have two (bimodal) or more (multimodal) modes <img src="slides_files/figure-html/unnamed-chunk-6-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Mode * Is the most frequent value * A dataset can have two (bimodal) or more (multimodal) modes * Hardly ever used --- ## Mean * Is the average score -- `$$\bar{X} = \frac{1}{n}\sum_{i=1}^nX_i$$` -- * Example: let's take a vector `x` ```r x <- c(4, 9, 7, 12, 5, 3, 6, 2) ``` -- `$$\sum_{i=1}^nX_i = 4 + 9 + 7 + 12 + 5 + 3 + 6 + 2 = 48$$` -- `$$\bar{X} = \frac{1}{n}\sum_{i=1}^nX_i = \frac{1}{8}48 = 6$$` -- ```r mean(x) ``` ``` ## [1] 6 ``` --- ## Mean * Is influenced by extreme scores -- ```r x <- c(x, 57) x ``` ``` ## [1] 4 9 7 12 5 3 6 2 57 ``` -- ```r mean(x) ``` ``` ## [1] 11.66667 ``` --- ## Median * Is the middle score when ranked in order of magnitude -- * Example: let's take a vector `y` ```r y <- c(3, 9, 12, 4, 1, 8, 6) ``` -- ```r sort(y) ``` ``` ## [1] 1 3 4 6 8 9 12 ``` -- ```r median(y) ``` ``` ## [1] 6 ``` --- ## Median * And when the vector has an even number of values? ```r x <- c(4, 9, 7, 12, 5, 3, 6, 2) sort(x) ``` ``` ## [1] 2 3 4 5 6 7 9 12 ``` -- * In this case, the median is the average of the two "middle" values ```r (5 + 6) / 2 ``` ``` ## [1] 5.5 ``` ```r median(x) ``` ``` ## [1] 5.5 ``` --- ## Median * Is NOT influenced by extreme scores -- ```r x <- c(x, 57) sort(x) ``` ``` ## [1] 2 3 4 5 6 7 9 12 57 ``` -- ```r median(x) ``` ``` ## [1] 6 ``` --- ## Mean vs. Median .pull-left[ <img src="slides_files/figure-html/unnamed-chunk-19-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="slides_files/figure-html/unnamed-chunk-20-1.png" width="100%" style="display: block; margin: auto;" /> ] --- class: middle, center # Dispersion .large[Indicates the spread of a distribution] --- ## Range * The variation between the highest and lowest values in a vector -- * Example: let's take our old friend again, the vector `x` ```r x <- c(4, 9, 7, 12, 5, 3, 6, 2) ``` -- ```r range(x) ``` ``` ## [1] 2 12 ``` -- ```r diff(range(x)) ``` ``` ## [1] 10 ``` -- * Is influenced by extreme scores -- ```r x <- c(x, 57) diff(range(x)) ``` ``` ## [1] 55 ``` --- ## Interquartile range * Divide the vector in 4 parts of equal size -- * Cut off the top and bottom 25% of values -- * Calculate the range of the middle 50% -- * Can be represented in a boxplot <img src="./figs/boxplot.png" width="70%" style="display: block; margin: auto;" /> --- ## Interquartile range * Divide the vector in 4 parts of equal size * Cut off the top and bottom 25% of values * Calculate the range of the middle 50% * Can be represented in a boxplot <img src="./figs/boxplot_annotated.png" width="70%" style="display: block; margin: auto;" /> --- ## Interquartile range * To calculate it in R: -- ```r IQR(x) ``` ``` ## [1] 5 ``` -- * It returns Q3 - Q1 -- * If we want the values of the quartiles: ```r quantile(x) ``` ``` ## 0% 25% 50% 75% 100% ## 2 4 6 9 57 ``` --- ## Mean error .pull-left[ <img src="slides_files/figure-html/unnamed-chunk-29-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ * Plot of our good and old vector `x` ```r x <- c(4, 9, 7, 12, 5, 3, 6, 2) ``` ] --- ## Mean error .pull-left[ <img src="slides_files/figure-html/unnamed-chunk-31-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ * Plot of our good and old vector `x` ```r x <- c(4, 9, 7, 12, 5, 3, 6, 2) ``` * Horizontal line: `\(\bar{X}\)` ] --- ## Mean error .pull-left[ <img src="slides_files/figure-html/unnamed-chunk-33-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ * Plot of our good and old vector `x` ```r x <- c(4, 9, 7, 12, 5, 3, 6, 2) ``` * Horizontal line: `\(\bar{X}\)` * Red dashed line: `\(e_i = X_i - \bar{X}\)` ] --- ## Mean error .pull-left[ <img src="slides_files/figure-html/unnamed-chunk-35-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ * Plot of our good and old vector `x` ```r x <- c(4, 9, 7, 12, 5, 3, 6, 2) ``` * Horizontal line: `\(\bar{X}\)` * Red dashed line: `\(e_i = X_i - \bar{X}\)` ```r e <- x - mean(x) e ``` ``` ## [1] -2 3 1 6 -1 -3 0 -4 ``` ] --- ## Mean error .pull-left[ <img src="slides_files/figure-html/unnamed-chunk-38-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ * Plot of our good and old vector `x` ```r x <- c(4, 9, 7, 12, 5, 3, 6, 2) ``` * Horizontal line: `\(\bar{X}\)` * Red dashed line: `\(e_i = X_i - \bar{X}\)` ```r e <- x - mean(x) e ``` ``` ## [1] -2 3 1 6 -1 -3 0 -4 ``` * We then can take the mean of the errors ] --- ## Mean error .pull-left[ <img src="slides_files/figure-html/unnamed-chunk-41-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ * Plot of our good and old vector `x` ```r x <- c(4, 9, 7, 12, 5, 3, 6, 2) ``` * Horizontal line: `\(\bar{X}\)` * Red dashed line: `\(e_i = X_i - \bar{X}\)` ```r e <- x - mean(x) e ``` ``` ## [1] -2 3 1 6 -1 -3 0 -4 ``` * We then can take the mean of the errors ```r mean(e) ``` ``` ## [1] 0 ``` ] --- ## Mean error <br> <br> .center[ .large[An error with mean 0...] .huge[🤔] .large[Does it mean 0 dispersion?] ] -- .center[ .large[Clearly not!] ] -- .center[ .large[So, we must have an alternative...] ] --- ## Variance .pull-left[ <img src="slides_files/figure-html/unnamed-chunk-45-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ * We can square the errors, which brings two consequences: ] --- ## Variance .pull-left[ <img src="slides_files/figure-html/unnamed-chunk-46-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ * We can square the errors, which brings two consequences: `1.` The error terms turn all positive ```r e^2 ``` ``` ## [1] 4 9 1 36 1 9 0 16 ``` ] --- ## Variance .pull-left[ <img src="slides_files/figure-html/unnamed-chunk-48-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ * We can square the errors, which brings two consequences: `1.` The error terms turn all positive ```r e^2 ``` ``` ## [1] 4 9 1 36 1 9 0 16 ``` `2.` Greater errors get more penalized ] --- ## Variance .pull-left[ <img src="slides_files/figure-html/unnamed-chunk-50-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ * We can square the errors, which brings two consequences: `1.` The error terms turn all positive ```r e^2 ``` ``` ## [1] 4 9 1 36 1 9 0 16 ``` `2.` Greater errors get more penalized ] --- ## Variance * Ok, now that we have squared the errors, how can we calculate the variance? -- `$$s^2 = \frac{1}{n}\sum_{i=1}^n(X_i - \bar{X})^2$$` --- ## Variance * Ok, now that we have squared the errors, how can we calculate the variance? `$$s^2 = \frac{1}{n-1}\sum_{i=1}^n(X_i - \bar{X})^2$$` -- * Why `\(n-1\)`? -- * To use the error in the sample to estimate the error in the population -- <br> .center[.large[Degrees of freedom]] --- ## Degrees of freedom * The number of observations that are free to vary <br> -- <img src="slides_files/figure-html/unnamed-chunk-52-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Degrees of freedom * The number of observations that are free to vary <br> <br> <img src="slides_files/figure-html/unnamed-chunk-53-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Degrees of freedom * The number of observations that are free to vary * Assume `\(\bar{X}_{sample} = \bar{X}_{population}\)` <br> <img src="slides_files/figure-html/unnamed-chunk-54-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Degrees of freedom * The number of observations that are free to vary * Assume `\(\bar{X}_{sample} = \bar{X}_{population}\)` * *e.g.,* `\(\bar{X} = 10\)` - with this parameter fixed, can all scores vary? <img src="slides_files/figure-html/unnamed-chunk-55-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Degrees of freedom * The number of observations that are free to vary * Assume `\(\bar{X}_{sample} = \bar{X}_{population}\)` * *e.g.,* `\(\bar{X} = 10\)` - with this parameter fixed, can all scores vary? <img src="slides_files/figure-html/unnamed-chunk-56-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Degrees of freedom * The number of observations that are free to vary * Assume `\(\bar{X}_{sample} = \bar{X}_{population}\)` * *e.g.,* `\(\bar{X} = 10\)` - with this parameter fixed, can all scores vary? <img src="slides_files/figure-html/unnamed-chunk-57-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Degrees of freedom * The number of observations that are free to vary * Assume `\(\bar{X}_{sample} = \bar{X}_{population}\)` * *e.g.,* `\(\bar{X} = 10\)` - with this parameter fixed, can all scores vary? <img src="slides_files/figure-html/unnamed-chunk-58-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Degrees of freedom * The number of observations that are free to vary * Assume `\(\bar{X}_{sample} = \bar{X}_{population}\)` * *e.g.,* `\(\bar{X} = 10\)` - with this parameter fixed, can all scores vary? <img src="slides_files/figure-html/unnamed-chunk-59-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Degrees of freedom * The number of observations that are free to vary * Assume `\(\bar{X}_{sample} = \bar{X}_{population}\)` * *e.g.,* `\(\bar{X} = 10\)` - with this parameter fixed, can all scores vary? <img src="slides_files/figure-html/unnamed-chunk-60-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Variance `$$s^2 = \frac{1}{n-1}\sum_{i=1}^n(X_i - \bar{X})$$` * How to calculate in R? -- ```r x <- c(4, 9, 7, 12, 5, 3, 6, 2) ``` --- ## Variance `$$s^2 = \frac{1}{n-1}\sum_{i=1}^n(X_i - \bar{X})$$` * How to calculate in R? ```r x <- c(4, 9, 7, 12, 5, 3, 6, 2) e <- x - mean(x) ``` --- ## Variance `$$s^2 = \frac{1}{n-1}\sum_{i=1}^n(X_i - \bar{X})$$` * How to calculate in R? ```r x <- c(4, 9, 7, 12, 5, 3, 6, 2) e <- x - mean(x) sum(e^2) / (length(x) - 1) ``` ``` ## [1] 10.85714 ``` --- ## Variance `$$s^2 = \frac{1}{n-1}\sum_{i=1}^n(X_i - \bar{X})$$` * How to calculate in R? ```r x <- c(4, 9, 7, 12, 5, 3, 6, 2) e <- x - mean(x) sum(e^2) / (length(x) - 1) ``` ``` ## [1] 10.85714 ``` -- * OR ```r var(x) ``` ``` ## [1] 10.85714 ``` --- ## Variance .center[.middle2[.large[There is a limitation: it gives a measure in units square]]] --- ## Standard deviation `$$s = \sqrt{\frac{1}{n-1}\sum_{i=1}^n(X_i - \bar{X})}$$` -- * How to calculate in R? ```r sd(x) ``` ``` ## [1] 3.295018 ``` -- * *e.g.,* Height: -- * Variance: `\(1.73m ± 0.0144m^2\)` -- * Standard deviation: `\(1.73m ± 0.12m\)` --- ## Dispersion measures * Range, interquartile range, variance and standard deviation -- * These are measures of variation in the SAMPLE -- .pull-left[ <img src="slides_files/figure-html/unnamed-chunk-67-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="slides_files/figure-html/unnamed-chunk-68-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Dispersion measures .center[.middle2[.large[What about the POPULATION?]]] --- ## Standard error of the mean .pull-left[ .center[Population mean] `$$\mu = 10$$` ] <img src="slides_files/figure-html/unnamed-chunk-69-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Standard error of the mean .pull-left[ .center[Population mean] `$$\mu = 10$$` ] .pull-right[ .center[Sample mean] `$$\bar{X} = 8$$` ] <img src="slides_files/figure-html/unnamed-chunk-70-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Standard error of the mean .pull-left[ .center[Population mean] `$$\mu = 10$$` ] .pull-right[ .center[Sample mean] `$$\bar{X} = 13$$` ] <img src="slides_files/figure-html/unnamed-chunk-71-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Standard error of the mean .pull-left[ .center[Population mean] `$$\mu = 10$$` ] .pull-right[ .center[Sample mean] `$$\bar{X} = 7$$` ] <img src="slides_files/figure-html/unnamed-chunk-72-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Standard error of the mean .pull-left[ .center[Population mean] `$$\mu = 10$$` ] .pull-right[ .center[Sample mean] `$$\bar{X} = 9$$` ] <img src="slides_files/figure-html/unnamed-chunk-73-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Standard error of the mean <br> <img src="slides_files/figure-html/unnamed-chunk-74-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Standard error of the mean -- .pull-left[ * `\(\bar{X} = \frac{1}{n}\sum_{i=1}^nX_i\)` ] -- .pull-right[ * `\(Var(aX) = a^2Var(x)\)` ] -- `$$\begin{align} Var(\frac{1}{n}\sum_{i=1}^nX_i) &= \frac{1}{n^2}\sum_{i=1}^nVar(X_i) \\ \end{align}$$` --- ## Standard error of the mean .pull-left[ * `\(\bar{X} = \frac{1}{n}\sum_{i=1}^nX_i\)` ] .pull-right[ * `\(Var(aX) = a^2Var(x)\)` ] `$$\begin{align} Var(\frac{1}{n}\sum_{i=1}^nX_i) &= \frac{1}{n^2}\sum_{i=1}^nVar(X_i) \\ &= \frac{1}{n^2}\sum_{i=1}^ns^2 \\ \end{align}$$` --- ## Standard error of the mean .pull-left[ * `\(\bar{X} = \frac{1}{n}\sum_{i=1}^nX_i\)` ] .pull-right[ * `\(Var(aX) = a^2Var(x)\)` ] `$$\begin{align} Var(\frac{1}{n}\sum_{i=1}^nX_i) &= \frac{1}{n^2}\sum_{i=1}^nVar(X_i) \\ &= \frac{1}{n^2}\sum_{i=1}^ns^2 \\ &= \frac{n}{n^2}s^2 \\ \end{align}$$` --- ## Standard error of the mean .pull-left[ * `\(\bar{X} = \frac{1}{n}\sum_{i=1}^nX_i\)` ] .pull-right[ * `\(Var(aX) = a^2Var(x)\)` ] `$$\begin{align} Var(\frac{1}{n}\sum_{i=1}^nX_i) &= \frac{1}{n^2}\sum_{i=1}^nVar(X_i) \\ &= \frac{1}{n^2}\sum_{i=1}^ns^2 \\ &= \frac{n}{n^2}s^2 \\ &= \frac{s^2}{n} \end{align}$$` --- ## Standard error of the mean .pull-left[ * `\(\bar{X} = \frac{1}{n}\sum_{i=1}^nX_i\)` ] .pull-right[ * `\(Var(aX) = a^2Var(x)\)` ] `$$\begin{align} Var(\frac{1}{n}\sum_{i=1}^nX_i) &= \frac{1}{n^2}\sum_{i=1}^nVar(X_i) \\ &= \frac{1}{n^2}\sum_{i=1}^ns^2 \\ &= \frac{n}{n^2}s^2 \\ &= \frac{s^2}{n} \end{align}$$` <br> `$$\sigma_{\bar{X}} = \sqrt{\frac{s^2}{n}} = \frac{s}{\sqrt{n}}$$` --- ## Standard error of the mean * How can we compute it in R? -- ```r x <- c(4, 9, 7, 12, 5, 3, 6, 2) sd(x) / sqrt(length(x)) ``` ``` ## [1] 1.164965 ``` -- * R does NOT have a built-in function to compute the `\(\sigma_{\bar{X}}\)` -- <br> But we can use the code above to define one ```r se <- function(x) sd(x) / sqrt(length(x)) ``` --- ## Standard error of the mean `$$\sigma_{\bar{X}} = \frac{s}{\sqrt{n}}$$` * Larger samples have lower `\(\sigma_{\bar{X}}\)` -- ```r x1 <- rnorm(10) x2 <- rnorm(100) x3 <- rnorm(1000) ``` -- ```r se(x1) ``` ``` ## [1] 0.2851783 ``` ```r se(x2) ``` ``` ## [1] 0.09098424 ``` ```r se(x3) ``` ``` ## [1] 0.03010591 ``` --- ## A different approach <br> * The `\(\sigma_{\bar{X}}\)` is the standard deviation of the sample distribution -- * But we can also establish limits in which we believe the true value of the mean lies -- <img src="slides_files/figure-html/unnamed-chunk-79-1.png" width="40%" style="display: block; margin: auto;" /> -- .center[.large[Confidence interval]] --- ## Confidence interval <br> <br> <br> <br> .center[.large[Typical value: 95% confidence interval]] -- .center[.large[How do we calculate these limits?]] --- ## Confidence interval .center[Normal distribution: *z*-scores] `$$\mu = 0; \sigma = 1$$` <img src="slides_files/figure-html/unnamed-chunk-80-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Confidence interval .center[Normal distribution: *z*-scores] `$$\mu = 0; \sigma = 1$$` <img src="slides_files/figure-html/unnamed-chunk-81-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Confidence interval .center[Normal distribution: *z*-scores] `$$\mu = 0; \sigma = 1$$` <img src="slides_files/figure-html/unnamed-chunk-82-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Confidence interval .center[Normal distribution: *z*-scores] `$$\mu = 0; \sigma = 1$$` <img src="slides_files/figure-html/unnamed-chunk-83-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Confidence interval * Our mean and standard deviation almost certainly won't be 0 and 1... -- * But: `$$z = \frac{X - \bar{X}}{s}$$` -- `$$1.96 = \frac{X - \bar{X}}{s}$$` -- `$$X - \bar{X} = 1.96 \cdot s$$` -- `$$X = \bar{X} + 1.96 \cdot s$$` -- `$$X = \bar{X} - 1.96 \cdot s$$` -- * So, the confidence interval: `$$CI_{95\%} = \bar{X} \pm 1.96 \cdot s$$` --- ## Confidence interval * Our mean and standard deviation almost certainly won't be 0 and 1... * But: `$$z = \frac{X - \bar{X}}{s}$$` `$$1.96 = \frac{X - \bar{X}}{s}$$` `$$X - \bar{X} = 1.96 \cdot s$$` `$$X = \bar{X} + 1.96 \cdot s$$` `$$X = \bar{X} - 1.96 \cdot s$$` * So, the confidence interval: `$$CI_{95\%} = \bar{X} \pm 1.96 \cdot \sigma_{\bar{X}}$$` --- ## Confidence interval `$$CI_{95\%} = \bar{X} \pm 1.96 \cdot \sigma_{\bar{X}}$$` * How do we calculate it in R? -- ```r x <- rnorm(100, mean = 10, sd = 2) ``` --- ## Confidence interval `$$CI_{95\%} = \bar{X} \pm 1.96 \cdot \sigma_{\bar{X}}$$` * How do we calculate it in R? ```r x <- rnorm(100, mean = 10, sd = 2) lower_limit <- mean(x) - 1.96 * (sd(x) / sqrt(length(x))) ``` --- ## Confidence interval `$$CI_{95\%} = \bar{X} \pm 1.96 \cdot \sigma_{\bar{X}}$$` * How do we calculate it in R? ```r x <- rnorm(100, mean = 10, sd = 2) lower_limit <- mean(x) - 1.96 * (sd(x) / sqrt(length(x))) upper_limit <- mean(x) + 1.96 * (sd(x) / sqrt(length(x))) ``` -- ```r lower_limit ``` ``` ## [1] 9.632801 ``` ```r upper_limit ``` ``` ## [1] 10.44809 ``` --- ## Confidence interval `$$CI_{95\%} = \bar{X} \pm 1.96 \cdot \sigma_{\bar{X}}$$` * How do we calculate it in R? * We can put it in a function -- ```r ci <- function(x) { lower_limit <- mean(x) - 1.96 * (sd(x) / sqrt(length(x))) upper_limit <- mean(x) + 1.96 * (sd(x) / sqrt(length(x))) data.frame(mean = mean(x), lower_limit, upper_limit) } ``` -- ```r ci(x) ``` ``` ## mean lower_limit upper_limit ## 1 10.04045 9.632801 10.44809 ``` --- ## Other confidence intervals <br> <br> <br> .center[.large[95% confidence interval ✅]] -- .center[.large[And if we want 90%? Or 99%?]] --- ## Other confidence intervals .pull-left[ <br> <img src="slides_files/figure-html/unnamed-chunk-90-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Other confidence intervals .pull-left[ <br> <img src="slides_files/figure-html/unnamed-chunk-91-1.png" width="100%" style="display: block; margin: auto;" /> ] -- `$$z_{1 - \frac{1 - p}{2}}$$` -- <br> .center[For 95%:] `$$1 - \frac{1 - 0.95}{2}$$` -- `$$1 - \frac{0.05}{2}$$` -- `$$1 - 0.025 = 0.975 = 97.5\%$$` -- <br> .center[For 99%:] `$$1 - \frac{1 - 0.99}{2} = 0.995 = 99.5\%$$` --- ## Other confidence intervals * How can we calculate this in R? -- ```r qnorm(0.975) ``` ``` ## [1] 1.959964 ``` -- ```r ci <- function(x, p = 0.95) { lower_limit <- mean(x) - qnorm(1 - ((1 - p) / 2)) * (sd(x) / sqrt(length(x))) upper_limit <- mean(x) + qnorm(1 - ((1 - p) / 2)) * (sd(x) / sqrt(length(x))) data.frame(mean = mean(x), lower_limit, upper_limit) } ``` -- ```r ci(x) ``` ``` ## mean lower_limit upper_limit ## 1 10.04045 9.632809 10.44808 ``` --- ## Other confidence intervals * How can we calculate this in R? ```r qnorm(0.975) ``` ``` ## [1] 1.959964 ``` ```r ci <- function(x, p = 0.95) { lower_limit <- mean(x) - qnorm(1 - ((1 - p) / 2)) * (sd(x) / sqrt(length(x))) upper_limit <- mean(x) + qnorm(1 - ((1 - p) / 2)) * (sd(x) / sqrt(length(x))) data.frame(mean = mean(x), lower_limit, upper_limit) } ``` ```r ci(x, 0.99) ``` ``` ## mean lower_limit upper_limit ## 1 10.04045 9.50472 10.57617 ``` --- ## Confidence intervals in small samples * Until now we have been using quantiles of the normal distribution -- * In large samples, the distribution tends to be normal -- * But if we have a small sample? -- <br> .center[.large[*t*-distribution]] -- * Changes shape with sample size -- * In very large samples, it has the shape of the normal distribution --- ## Confidence intervals in small samples <img src="slides_files/figure-html/unnamed-chunk-98-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Confidence intervals in small samples <img src="slides_files/figure-html/unnamed-chunk-99-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Confidence intervals in small samples <img src="slides_files/figure-html/unnamed-chunk-100-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Confidence intervals in small samples * And how to calculate in R? ```r qt(p, df) ``` -- * Now, let's add this to our function ```r ci <- function(x, p = 0.95) { lower_limit <- mean(x) - qt(1 - ((1 - p) / 2), length(x) - 1) * (sd(x) / sqrt(length(x))) upper_limit <- mean(x) + qt(1 - ((1 - p) / 2), length(x) - 1) * (sd(x) / sqrt(length(x))) data.frame(mean = mean(x), lower_limit, upper_limit) } ``` -- ```r ci(x) ``` ``` ## mean lower_limit upper_limit ## 1 10.04045 9.627765 10.45313 ``` --- ## Interpreting confidence intervals <br> <br> `$$\bar{X} \pm t_{df} \cdot \sigma_{\bar{X}}$$` --- ## Interpreting confidence intervals <br> <br> `$$\bar{X} \pm t_{df} \cdot \frac{s}{\sqrt{n}}$$` -- * Depends on: -- * Dispersion within our sample * Sample size --- ## Interpreting confidence intervals * Example 1: same sample size, different dispersion -- ```r rnorm_fix <- function(n, mean, sd) mean + sd * scale(rnorm(n)) a <- rnorm_fix(20, mean = 10, sd = 2) b <- rnorm_fix(20, mean = 10, sd = 4) ``` -- ```r ci(a) ``` ``` ## mean lower_limit upper_limit ## 1 10 9.063971 10.93603 ``` ```r ci(b) ``` ``` ## mean lower_limit upper_limit ## 1 10 8.127942 11.87206 ``` --- ## Interpreting confidence intervals * Example 2: different sample size, same dispersion -- ```r c <- rnorm_fix(40, mean = 10, sd = 2) d <- rnorm_fix(20, mean = 10, sd = 2) ``` -- ```r ci(c) ``` ``` ## mean lower_limit upper_limit ## 1 10 9.360369 10.63963 ``` ```r ci(d) ``` ``` ## mean lower_limit upper_limit ## 1 10 9.063971 10.93603 ``` --- ## Visualizing confidence intervals <img src="slides_files/figure-html/unnamed-chunk-108-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Visualizing confidence intervals <img src="slides_files/figure-html/unnamed-chunk-109-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Visualizing confidence intervals <img src="slides_files/figure-html/unnamed-chunk-110-1.png" width="60%" style="display: block; margin: auto;" /> --- class: inverse, middle, center # Summary --- ## Central tendency ```r mean(x) median(x) ``` ## Dispersion ```r range(x) # The max and mean values diff(range(x)) # The difference between max and min values IQR(x) # The difference between Q3 and Q1 quantile(x) # The values of all quartiles var(x) sd(x) ``` --- ## Dispersion ```r se <- function(x) sd(x) / sqrt(length(x)) ``` ```r ci <- function(x, p = 0.95) { lower_limit <- mean(x) - qt(1 - ((1 - p) / 2), length(x) - 1) * (sd(x) / sqrt(length(x))) upper_limit <- mean(x) + qt(1 - ((1 - p) / 2), length(x) - 1) * (sd(x) / sqrt(length(x))) data.frame(mean = mean(x), lower_limit, upper_limit) } ``` --- class: inverse, middle, center # Exercises --- ## Exercises * For these exercises, let's use the `starwars` dataset from the `dplyr` package. -- ```r library(dplyr) starwars ``` ``` ## # A tibble: 87 × 14 ## name height mass hair_color skin_color eye_color birth_year ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> ## 1 Luke … 172 77 blond fair blue 19 ## 2 C-3PO 167 75 <NA> gold yellow 112 ## 3 R2-D2 96 32 <NA> white, bl… red 33 ## 4 Darth… 202 136 none white yellow 41.9 ## 5 Leia … 150 49 brown light brown 19 ## 6 Owen … 178 120 brown, gr… light blue 52 ## 7 Beru … 165 75 brown light blue 47 ## 8 R5-D4 97 32 <NA> white, red red NA ## 9 Biggs… 183 84 black light brown 24 ## 10 Obi-W… 182 77 auburn, w… fair blue-gray 57 ## # … with 77 more rows, and 7 more variables: sex <chr>, ## # gender <chr>, homeworld <chr>, species <chr>, films <list>, ## # vehicles <list>, starships <list> ``` --- ## Exercises `1`. Calculate the mean height in this dataset -- * For this, we need to access the `height` column in the `starwars` dataset. ```r starwars$height ``` -- * And then use it to calculate the mean ```r mean(starwars$height) ``` -- ``` ## [1] NA ``` -- <br> .center[.large[What happened?]] --- ## Exercises `1`. Calculate the mean height in this dataset ```r starwars$height ``` ``` ## [1] 172 167 96 202 150 178 165 97 183 182 188 180 228 180 173 175 ## [17] 170 180 66 170 183 200 190 177 175 180 150 NA 88 160 193 191 ## [33] 170 196 224 206 183 137 112 183 163 175 180 178 94 122 163 188 ## [49] 198 196 171 184 188 264 188 196 185 157 183 183 170 166 165 193 ## [65] 191 183 168 198 229 213 167 79 96 193 191 178 216 234 188 178 ## [81] 206 NA NA NA NA NA 165 ``` -- ```r mean(starwars$height, na.rm = TRUE) ``` ``` ## [1] 174.358 ``` --- ## Exercises `2`. Calculate summary statistics for the `mass` -- * Mean or Median? -- * Let's plot a histogram -- .pull-left[ ```r library(ggplot2) ggplot(data = starwars) ``` ] -- .pull-right[ <img src="slides_files/figure-html/hist1-1.png" width="2800" /> ] --- ## Exercises `2`. Calculate summary statistics for the `mass` * Mean or Median? * Let's plot a histogram .pull-left[ ```r library(ggplot2) ggplot(data = starwars) + geom_histogram(aes(x = mass)) ``` ] .pull-right[ <img src="slides_files/figure-html/hist1-1.png" width="2800" /> ] --- ## Exercises `2`. Calculate summary statistics for the `mass` * Mean or Median? * Let's plot a histogram .pull-left[ ```r library(ggplot2) ggplot(data = starwars) + geom_histogram(aes(x = mass)) ``` ] .pull-right[ <img src="slides_files/figure-html/hist2-1.png" width="2800" /> ] --- ## Exercises `2`. Calculate summary statistics for the `mass` ```r median(starwars$mass, na.rm = TRUE) ``` ``` ## [1] 79 ``` -- ```r quantile(starwars$mass, na.rm = TRUE) ``` ``` ## 0% 25% 50% 75% 100% ## 15.0 55.6 79.0 84.5 1358.0 ``` --- ## Exercises `2`. Calculate summary statistics for the `mass` * We can format it to better present the values: -- .center[Median (Q1-Q3)] -- ```r paste0( median(starwars$mass, na.rm = TRUE), " (", quantile(starwars$mass, na.rm = TRUE)[2], "-", quantile(starwars$mass, na.rm = TRUE)[4], ")" ) ``` ``` ## [1] "79 (55.6-84.5)" ``` --- ## Exercises `3`. Calculate summary statistics for the `mass` --- class: inverse, middle, center background-image: url("./figs/logo.png") background-position: 7% 2% background-size: 120px 70px .enormous[Thank you!] <br> Slides at: <img src="./figs/qrcode.png" width="20%" style="display: block; margin: auto;" /> <svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:white;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg> @verasls | <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:white;" xmlns="http://www.w3.org/2000/svg"> <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg> @verasls <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:white;" xmlns="http://www.w3.org/2000/svg"> <path d="M502.3 190.8c3.9-3.1 9.7-.2 9.7 4.7V400c0 26.5-21.5 48-48 48H48c-26.5 0-48-21.5-48-48V195.6c0-5 5.7-7.8 9.7-4.7 22.4 17.4 52.1 39.5 154.1 113.6 21.1 15.4 56.7 47.8 92.2 47.6 35.7.3 72-32.8 92.3-47.6 102-74.1 131.6-96.3 154-113.7zM256 320c23.2.4 56.6-29.2 73.4-41.4 132.7-96.3 142.8-104.7 173.4-128.7 5.8-4.5 9.2-11.5 9.2-18.9v-19c0-26.5-21.5-48-48-48H48C21.5 64 0 85.5 0 112v19c0 7.4 3.4 14.3 9.2 18.9 30.6 23.9 40.7 32.4 173.4 128.7 16.8 12.2 50.2 41.8 73.4 41.4z"></path></svg> lucasdsveras@gmail.com