Introduction

class: inverse, left, bottom
background-image: url("./figs/logo.png")
background-position: 7% 2%
background-size: 120px 70px

### Statistics with R

#### Lesson 01: Introduction to statistics

---

## Today

---

# Introduction to statistics

---

## Population and sample

---

## Population and sample

---

## Population and sample

---

## Population and sample

---

## Frequency distributions

---

## Frequency distributions

]

]

---

# Measures of central tendency and dispersion

---

# Central tendency

---

## Mode

* The most frequent value

---

## Mode

* Is the most frequent value

* A dataset can have two (bimodal) or more (multimodal) modes

---

## Mode

* Is the most frequent value

* A dataset can have two (bimodal) or more (multimodal) modes

* Hardly ever used

---

## Mean

* Is the average score

`$$\bar{X} = \frac{1}{n}\sum_{i=1}^nX_i$$`

* Example: let's take a vector `x`

```r
x <- c(4, 9, 7, 12, 5, 3, 6, 2)
```
--

`$$\sum_{i=1}^nX_i = 4 + 9 + 7 + 12 + 5 + 3 + 6 + 2 = 48$$`

`$$\bar{X} = \frac{1}{n}\sum_{i=1}^nX_i = \frac{1}{8}48 = 6$$`

```r
mean(x)
```

```
## [1] 6
```

---

## Mean

* Is influenced by extreme scores

```r
x <- c(x, 57)
x
```

```
## [1]  4  9  7 12  5  3  6  2 57
```

```r
mean(x)
```

```
## [1] 11.66667
```

---

## Median

* Is the middle score when ranked in order of magnitude

* Example: let's take a vector `y`

```r
y <- c(3, 9, 12, 4, 1, 8, 6)
```

```r
sort(y)
```

```
## [1]  1  3  4  6  8  9 12
```

```r
median(y)
```

```
## [1] 6
```

---

## Median

* And when the vector has an even number of values?

```r
x <- c(4, 9, 7, 12, 5, 3, 6, 2)
sort(x)
```

```
## [1]  2  3  4  5  6  7  9 12
```

* In this case, the median is the average of the two "middle" values

```r
(5 + 6) / 2
```

```
## [1] 5.5
```

```r
median(x)
```

```
## [1] 5.5
```

---

## Median

* Is NOT influenced by extreme scores

```r
x <- c(x, 57)
sort(x)
```

```
## [1]  2  3  4  5  6  7  9 12 57
```

```r
median(x)
```

```
## [1] 6
```

---

## Mean vs. Median

]

]

---

# Dispersion

---

## Range

* The variation between the highest and lowest values in a vector

* Example: let's take our old friend again, the vector `x`

```r
x <- c(4, 9, 7, 12, 5, 3, 6, 2)
```

```r
range(x)
```

```
## [1]  2 12
```

```r
diff(range(x))
```

```
## [1] 10
```

* Is influenced by extreme scores

```r
x <- c(x, 57)
diff(range(x))
```

```
## [1] 55
```

---

## Interquartile range

* Divide the vector in 4 parts of equal size

* Cut off the top and bottom 25% of values

* Calculate the range of the middle 50%

* Can be represented in a boxplot

---

## Interquartile range

* Divide the vector in 4 parts of equal size

* Cut off the top and bottom 25% of values

* Calculate the range of the middle 50%

* Can be represented in a boxplot

---

## Interquartile range

* To calculate it in R:

```r
IQR(x)
```

```
## [1] 5
```

* It returns Q3 - Q1

* If we want the values of the quartiles:

```r
quantile(x)
```

```
##   0%  25%  50%  75% 100% 
##    2    4    6    9   57
```

---

## Mean error

]

* Plot of our good and old vector `x`

```r
x <- c(4, 9, 7, 12, 5, 3, 6, 2)
```

]

---

## Mean error

]

* Plot of our good and old vector `x`

```r
x <- c(4, 9, 7, 12, 5, 3, 6, 2)
```

* Horizontal line: `$\bar{X}$`

]

---

## Mean error

]

* Plot of our good and old vector `x`

```r
x <- c(4, 9, 7, 12, 5, 3, 6, 2)
```

* Horizontal line: `$\bar{X}$`

* Red dashed line: `$e_i = X_i - \bar{X}$`

]

---

## Mean error

]

* Plot of our good and old vector `x`

```r
x <- c(4, 9, 7, 12, 5, 3, 6, 2)
```

* Horizontal line: `$\bar{X}$`

* Red dashed line: `$e_i = X_i - \bar{X}$`

```r
e <- x - mean(x)
e
```

```
## [1] -2  3  1  6 -1 -3  0 -4
```

]

---

## Mean error

]

* Plot of our good and old vector `x`

```r
x <- c(4, 9, 7, 12, 5, 3, 6, 2)
```

* Horizontal line: `$\bar{X}$`

* Red dashed line: `$e_i = X_i - \bar{X}$`

```r
e <- x - mean(x)
e
```

```
## [1] -2  3  1  6 -1 -3  0 -4
```

* We then can take the mean of the errors

]

---

## Mean error

]

* Plot of our good and old vector `x`

```r
x <- c(4, 9, 7, 12, 5, 3, 6, 2)
```

* Horizontal line: `$\bar{X}$`

* Red dashed line: `$e_i = X_i - \bar{X}$`

```r
e <- x - mean(x)
e
```

```
## [1] -2  3  1  6 -1 -3  0 -4
```

* We then can take the mean of the errors

```r
mean(e)
```

```
## [1] 0
```

]

---

## Mean error

---

## Variance

]

* We can square the errors, which brings two consequences:

]

---

## Variance

]

* We can square the errors, which brings two consequences:

`1.` The error terms turn all positive

```r
e^2
```

```
## [1]  4  9  1 36  1  9  0 16
```

]

---

## Variance

]

* We can square the errors, which brings two consequences:

`1.` The error terms turn all positive

```r
e^2
```

```
## [1]  4  9  1 36  1  9  0 16
```

`2.` Greater errors get more penalized

]

---

## Variance

]

* We can square the errors, which brings two consequences:

`1.` The error terms turn all positive

```r
e^2
```

```
## [1]  4  9  1 36  1  9  0 16
```

`2.` Greater errors get more penalized

]

---

## Variance

* Ok, now that we have squared the errors, how can we calculate the variance?

`$$s^2 = \frac{1}{n}\sum_{i=1}^n(X_i - \bar{X})^2$$`

---

## Variance

* Ok, now that we have squared the errors, how can we calculate the variance?

`$$s^2 = \frac{1}{n-1}\sum_{i=1}^n(X_i - \bar{X})^2$$`

* Why `$n-1$`?

* To use the error in the sample to estimate the error in the population

---

## Degrees of freedom