-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathBasicProbabilityFunctionality_in_R.Rmd
More file actions
441 lines (270 loc) · 13.7 KB
/
BasicProbabilityFunctionality_in_R.Rmd
File metadata and controls
441 lines (270 loc) · 13.7 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
---
title: "BIO708 Probability distributions in R"
author: "Ian Dworkin"
date: "`r format(Sys.time(),'%d %b %Y')`"
output:
html_document:
toc: yes
number_sections: yes
keep_md: yes
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(digits = 3)
```
# Some basic functionality for probability in R
Working with probability distributions and density functions in R - the basics. At the bottom of the script is a list of the names of most of the common probability distributions you might use in R (and what library they are in if they are not part of base R)
## Some basic mathematical functions that sometimes come up when working with probability distributions (i.e for "counting")
```{r, eval = FALSE}
factorial(x) # Factorial i.e. x*(x-1)*(x-2)*....*(x-n+2)*1
gamma(x) # gamma function, generalization of factorials. NOT THE GAMMA PROBABILITY DISTRIBUTION
choose(n,k) # n choose k
beta(a,b) # beta function. NOT THE BETA PROBABILITY DISTRIBUTION
```
## Standard distributions and the functions with it
For most of the standard distributions (normal, binomial, gamma, F etc.) there are 4 functions (illustrated with Gaussian/Normal)
`dnorm`, `pnorm`, `qnorm`, `rnorm`
### dnorm - the density function
If you want the height of the density function (or the probability for discrete functions) of getting a particular value (or on a particular interval for continuous distributions) given a particular set of parameter values (i.e mean=0, sd=1), you would use dnorm.
What is the "height" of the density function (proportional to the probability) on the interval including zero (how big is the interval?) for a normal distribution with mean 0 and sd 1.
```{r}
dnorm(0, mean = 0, sd = 1) # or alternatively....
dnorm(0, 0, 1)
dnorm (2, 0, 1) # not surprisingly t"relative" probability is less for a value of 2, for the same distribution as above
```
## we can look more thoroughly at the density function
```{r}
par(mfrow = c(2, 1))
curve(dnorm(x, mean = 0, sd = 1), from = -5, to = 10) # f(x)~ N(0,1)
curve(dnorm(x, mean = 5, sd = 1.5), from=-5, to=10) # f(x)~ N(5,1.5)
par(mfrow = c(1, 1))
```
## cumulative probability
Instead of wanting to known a point (or interval) probability, what if we want to know what the cumulative probability is for a particular value, with respect to a particular distribution (i.e. what is the probability)? In other words the cumulative distribution function $\Phi$. By default it gives the probability that a normally distributed random number (based on the distributional info you provide) will be less than a focal number (the q argument of the function)
What is the probability that the value is less than 2 for $f(x) \sim N(\mu = 0, \sigma = 1)$
```{r}
pnorm(2, mean = 0,sd = 1, lower.tail = T)
```
Thus the probability that an observation is sampled from a distribution $f(x) \sim N(\mu = 0, \sigma = 1)$ is 2 or less is ~98%.
Or from the other side, the probability that an observation is sampled from a distribution of $f(x) \sim N(\mu = 0, \sigma = 1)$ is 2 or greater is:
```{r}
pnorm(2, 0, 1, lower.tail = F)
```
about 2.3%.
## You use this kind of probability all the time, whenever you calculate a p-value
```{r}
sct_data <- read.csv("http://beaconcourse.pbworks.com/f/dll.csv",
h = T, stringsAsFactors = TRUE)
```
```{r}
model.1 <- lm(SCT ~ genotype, data = sct_data)
anova(model.1)
```
Instead, Let's generate this p-value directly from the f distribution
```{r}
pf(q = 60.2, df1 = 1, df2 = 1946, lower.tail = F) # lower.tail=F since we want to know the probability of getting this value (60.2), or something more extreme.
```
## calculating probabilities in an interval
You can use also use the "p"dist class of functions to calculate the probability in an interval for a continuous probability function. For instance if you want to calculate the probability of observing a value between 10-15 from a normal distribution with mean = 10 and sd = 3.3:
```{r}
pnorm(15, 10, 3.3) - pnorm(10, 10, 3.3) # This calculates the probability from the lower tail.
```
This gives you a sense of the cumulative probabilities.
```{r}
par(mfrow=c(1,1))
curve(dnorm(x), from = -5, to = 5, ylim = c(0, 1)) # f(x)~ N(0,1)
curve(pnorm(x), from=-5, to=5, add = T, col="red")
par(mfrow=c(1,1))
```
The inverse function for pnorm $\Phi^{-1}$ will provide a value for "x", at a given probability value based on the probability value (the argument p) you input. i.e. at the 95% percentile of the distribution what would the parameter value be given a particular f(x). This is sometimes called the standard *quantile* function (which is why it is "q" at the front)
So for $\sim N(\mu = 0, \sigma = 1)$, the value corresponding to the 95% quantile is
```{r}
qnorm(0.95, mean = 0, sd = 1, lower.tail = T) # 1.644854
```
Not surprising if you put this value back in pnorm you will get back the prob (0.95)
```{r}
pnorm(1.644854, mean = 0, sd = 1)
```
## Let's think about this from a "p-value" centric point of view (boo!)
If we knew we wanted to have an $\alpha = 0.05$ for an F distribution with df1 = 1 (numerator) and df2 = 49 (denominator), what would be the F value "threshold" for this alpha?
```{r}
qf(p=0.95, df1 = 1, df2 = 49, lower.tail=T)
```
Thus if we are living in a *NHST world* an F > 4.04 would suggest to us that we should reject the null hypothesis of observing the data given the null hypothesis is true $p(D|H_0)$.
The main utility of pnorm and qnorm are for looking at potential extreme values compared to the theoretical distribution, and the p"put your dist here" and q"put your dist here" are particularly useful for comparing observed versus the predicted values from the distributions of test statistics.
## Practice
compose examples for qf, pf and df for fdist
## generating random values from this distribution
Perhaps the most generally useful function in my mind is the random number generator based on particular distributions (good for simulations, power analyses etc...)
```{r}
par(mfrow=c(1,1))
rran <- rnorm(500,0,1) # generate 500 random samples from the f(x)~N(0,1)
hist(rran, freq = F, xlim = c(-3, 3), ylim = c(0, 0.5)) # histogram of the 500 randomly generated samples
curve(dnorm(x,0,1), from = -3, to = 3, col = "red", add = T) # just to add the theoretical distribution onto the random samples!!!!
```
## assorted and sundry
### what does a bimodal normal look like
```{r}
par(mfrow=c(2,1))
curve(dnorm(x, 5, 1), 0, 15)
curve(dnorm(x, 6, 1), 0, 15, add = T, col = "red")
```
VS
```{r}
par(mfrow=c(2,1))
curve(dnorm(x, 5, 1), 0, 15)
curve(dnorm(x, 8, 1), 0, 15, add=T, col="red")
curve((dnorm(x, 8, 1) + dnorm(x, 5, 1)), 0, 15)
par(mfrow=c(1,1))
```
### what happens to a t distribution as sample size increases? How does it compare to a normal distribution?
```{r}
par(mfrow = c(1,1))
curve(dt(x, 3), -5, 5, ylim=c(0, 0.5), lty = 2)
curve(dt(x, 20), -5, 5, col = "red", lty = 3, add = T)
curve(dt(x, 2100), -5, 5, col = "blue", lty = 2, lwd = 2, add = T)
curve(dnorm(x, 0, 1), -5, 5, col = "purple", lty = 3, lwd = 2, add = T)
```
## A couple of notes on using the r"dist" functions (i.e. rnorm)
there are often a number of ways of specifying the equivalent vector of random numbers
For instance given a vector x (we will use a common seed for random numbers. set.seed is not nescessary, I am just using it to pull the same set of random numbers)
```{r}
x <- seq(1:100)
set.seed(1)
```
say we want add some random variation to x (for a simulated regression problem for instance)
you could do this
```{r}
y1 <- x + rnorm(length(x), 0, 3)
```
Here for each value of x, we are adding a single random number from a normal distribution with mean 0, and sd=1 to create the vector y1. If R was not a vectorized language, then you would use an approach like this with a loop (for/while) to generate such a vector
The other way to get an equivalent result (in this case the exact same result since I am setting the seed for the random number generator):
```{r}
set.seed(1)
y2 <- rnorm(length(x), x, 3)
all.equal(y1, y2)
```
While the results are equivalent, what we are doing is a little bit different. Here we are creating a vector y2, which has has number of rows equal to x (length(x)), and with a different mean for each value of the vector, starting at the first value of x all the way to the final value in x, all of the values of y will have a standard deviation of 1. This is considered the vectorized version of this.
## Looking at distributions for discrete data
If you try something like
```{r, warning = FALSE}
curve(dpois(x,lambda = 11), 0, 20)
```
It just looks funky. Not surprising since it is computing the value of the function at non-integer values which make no sense for a poisson.
**this is better**
```{r}
par(mfrow = c(2, 1))
x.discrete <- seq(0, 25, by = 1)
plot(dpois(x.discrete, lambda = 11),
type="h", lwd = 5,
ylab = "probability", xlab = "x values") # a bit better, but not awesome
barplot(dpois(x.discrete, 10),
ylab = "probability",
xlab = "x values") # My personal preference for this
par(mfrow = c(1, 1))
```
You can check that the probabilities add to approximately 1 (or close to it)
```{r}
sum(dpois(x.discrete, lambda = 11))
```
### Some questions raised from previous lectures on probability
Why can the probablities sum up to greater than 1 for a probability density function if the area under the curve = 1. ie. why are the following not the same?
```{r}
a1 <- seq(1, 10, 0.1)
sum(dnorm(a1, 5, 1))
a2 <- seq(1, 10, 1)
sum(dnorm(a2, 5, 1))
```
Same distribution, sampling at different intervals. To get the absolute (instead of relative probabilities) divide out by the sum we calculated above.
```{r}
sum(dnorm(a2, 5, 1))/sum(dnorm(a2, 5, 1))
```
```{r}
x <- rnorm(100000, 5, 1) # random variables from N(5,1)
plot(density(x),
xlim = c(0,10),
ylim = c(0,0.5),
col = "blue", type ="p", lwd = 0.5)
hist(x, freq = F, add = T)
curve(dnorm(x, 5, 1), 0, 10,
add = T, col = "red", lwd = 2)
```
but wait.....
```{r}
z1 <- seq(0, 10, 1)
plot(dnorm(z1, 5, 1) ~ z1, col = "blue", lwd = 2)
curve(dnorm(x, 5, 1), 0, 10, add = T, col = "red", lwd = 2)
z2 <- seq(0, 10, 0.1)
plot(dnorm(z2, 5, 1) ~ z2, col = "blue", lwd = 2)
curve(dnorm(x, 5, 1), 0, 10, add = T, col = "red", lwd = 2)
# but
sum(dnorm(z1,5,1))
sum(dnorm(z2,5,1))
```
**Can you explain what R is doing?**
```{r}
z1.table <- cbind(z1, dnorm(z1,5,1))
z2.table <- cbind(z2, dnorm(z2,5,1))
```
take a look at the tables, can you see what R is doing.....
R is using the same interval size to do its computation.... That is why 10X then number of points gives 10X greater probability.
### What histograms show in R
```{r}
rm(x)
x <-rnorm(10000, 5, 1)
crap <- hist(x, plot = F)
crap$counts
crap$counts/sum(crap$counts)
par(mfrow=c(3, 1))
hist(x, main= "The raw counts")
hist(x, freq=F, main="Scaled so that area under 'curve' = 1")
barplot(crap$counts/sum(crap$counts), names.arg = crap$mids, main="proportions = counts/N")
par(mfrow=c(1, 1))
```
### Below is more of reference sheet to find distributions, and for congugacy.
#### distributions (add your own d,p,q,r)
- norm - Normal/Gaussian (i.e. dnorm, rnorm)
- binom - Binomial (i.e dbinom, rbinom)
- pois - Poisson
- exp - Exponential (Special case of the gamma with alpha=1)
- norm - Normal/Gaussian
- gamma - Gamma
- nbinom - Negative binomial
- beta - Beta
- lnorm - log normal
- f - F distribution
- chisq - chi square distribution (special case of gamma with alpha=df/2 & beta=0.5 (rate in R))
- t - T (student) distributions
- weibull - Weibull
- cauchy - Cauchy
- geom - Geometric (Geometric is a special case of a Negative Binomial with alpha=1)
- hyper - Hypergeometric
- multinom - Multinomial
- logis - Logistic
- signrank - Distribution of the Wilcoxon Signed rank statistic
- wilcox - Wilcoxon Rank Sum Statistic
- unif - uniform distribution (A uniform on [0,1] is a special case of a beta(alpha=beta=1))
- mvrnorm - Multivariate Normal Distribution (MASS library)
- dirichlet - dirichlet distribution (gtools library)
- wish - Wishart distribution (MCMCpack library)
- iwish - Inverse Wishart (MCMCpack library). Also see IW in MCMCglmm
- invgamma - Inverse Gamma (MCMCpack). Also see igamma in pscl library.
- invchisq - (Scaled) Inverse Chi Squared disrtibution. You can also use iwish()
#### quick reference for conjugacy ()
As a reminder The prior and the posterior take the same form.i.e. So if the we use the poisson to model the likelihood, and use the gamma for the prior, the posterior will be gamma.
i.e. If we use the binomial to model the likelihood p(Data|model), and use the beta for a prior, then the posterior p(model|data) will also be prior.
#### discrete likelihoods
Binomial likelihood - the beta distribution is the conjugate prior
Poisson likelihood - the gamma distribution is the conjugate prior
multinomial likelihood - the dirichlet distribution is the conjugate prior (dirichlet is the multivariate extension of the beta)
#### Continuous likelihoods
Normal is the conjugate prior for the mean for the likelihood of a normal distribution with fixed variance.
Gamma is the conjugate prior for the INVERSE of the variance for a normal distribution.
Thus the inverse gamma is the conjugate prior of the variance for a normal distribution.
Gamma is the conjugate prior for the mean parameter for the Poisson.
#### Multivariate (continuous and discrete)
multinomial likelihood - the dirichlet distribution is the conjugate prior (dirichlet is the multivariate extension of the beta)
Wishart is the conjugate prior for the INVERSE covariance matrix (Wishart is the multivariate extension of the gamma)
Inverse Wishart is the conjugate prior for the covariance matrix.