2022-09-22
Discrete Random Variables and Probability Mass Functions
[PSDR] Definition 3.3
A discrete random variable is a random variable that takes integer values. A discrete random variable is characterized by its probability mass function (PMF). The PMF \(p\) of a random variable \(X\) is given by \[p(x) = P (X=x).\]
[PSDR] Theorem 3.1
Let \(p\) be the probability mass function of \(X\).
The Bernoulli Probability Mass Function
Let \(X\) be a Bernoulli random variable with PMF given as
\[P(X=x; p) = p^x (1-p)^{1-x}.\]
[PSDR] Definition 3.15
A Bernoulli trial is an experiment that can result in two outcomes, which we will denote as “success” and “failure.” The probability of a success will be denoted \(p\), and the probability of failure is therefore \(1−p\).
The Binomial Probability Mass Function
[PSDR] Definition 3.17
A random variable \(X\) is said to be a binomial random variable with parameters \(n\) and \(p\) if
\[P(X = x; n, p) = \binom{n}{x} p^x (1-p)^{(n-x)} \hspace{5px} \text{ for } x = 0,1,2,\cdots, n\]
where \(\binom{n}{x} = \frac{n!}{x!(n-x)!}\). We will sometimes write \(X \sim Binom(n,p)\).
[PSDR] Theorem 3.3
If \(X\) counts the number of successes in \(n\) independent and identically distributed (iid) Bernoulli trials, each with probability of success \(p\), then \(X \sim Binom(n,p)\).
The following example scenarios CAN be modeled using a Binomial PMF because the sampling method is independent and the there are only two outcomes.
A player rolls a pair of fair dice 10 times. The number \(X\) of 7s rolled is recorded.
In a class of 30 students, 55% are female. The instructor randomly selects 4 students The number \(X\) of females selected is recorded.
Let \(X\) be the certain number of credit card transactions that are fraudulent.
Testing a newly developed drug on whether it is effective or not. Let \(X\) be the number of patients who are cured by the drug.
Let \(X\) be the number of spam emails. We want to know the probability of getting a spam email or non-spam email in a given day.
What are the chances that you encounter bots on Twitter?
Twitter bots are complicated but let’s simplify. Say it is known that 5% of all Tweets are spam bots or tweets from a fake account.
If you encounter 40 tweets in a given day, we can compute the probability that a certain number of those tweets are bots using a Binomial PMF. Here, we assume that each tweet encounter is independent and identically distributed Bernoulli trial.
Let \(X\) be a binomial random variable with probability of success \(p = 0.05\) and number of independent trials \(n = 40\).
\[\begin{align*} P(X = 3) = & \binom{40}{3} (0.05)^3 (1-0.05)^{(40-3)} \\ = & 0.1851 \\ \end{align*}\]
That is 18.51% chance of encountering three bot tweets out of 40 tweets.
You can use the dbinom
function in R to compute the
Binomial PMF.
## [1] 0.1851145
\[\begin{align*} P(X \le 3) = & \sum_{x = 0}^3 \binom{40}{x} (0.05)^x (1-0.05)^{(40-x)} \\ & - \binom{40}{0} (0.05)^0 (1-0.05)^{(40-0)} \\ & = 0.7333 \end{align*}\]
The summation term above is what’s known as the Cumulative Density Function (CDF).
That is 73.33% chance of encountering at most three bot tweets out of 40 tweets.
You can use the pbinom
function in R to compute the
Binomial CDF.
## [1] 0.7333381
[PSDR] Definition 4.4
The cumulative distribution function (cdf) associated with \(X\) is the function
\[F(x)= P (X ≤ x).\]
For a discrete random variable \(X\) with PMF \(P(X = x)\), the the above is written as
\[F(x) = \sum_{n=-\infty}^x P(X = n).\]
Let \(X \sim Bernoulli(p)\) where \(p\) is the probability of success.
\[P(X=x; p) = p^x (1-p)^{1-x}.\]
\[\begin{align*} F(x) = & P(X \le x; p) \\ = & \sum_{n=0}^x p^x (1-p)^{1-x} \\ \end{align*}\]
\[ F(x) = \begin{cases} 0 & \text{ if } x < 0 \\ 1-p & \text{ if } 0 \le x < 1 \\ 1 & \text{ if } x \ge 1 \end{cases} \]
Let \(X \sim Binomial(np)\) where \(p\) is the probability of success and \(n\) be the number of trials \(n\).
\[P(X = x; n, p) = \binom{n}{x} p^x (1-p)^{(n-x)} \hspace{5px} \text{ for } x = 0,1,2,\cdots, n\]
Use dbinom
in R.
\[\begin{align*} F(x) = & P(X \le x; p) \\ = & \sum_{k=0}^x \binom{n}{k} p^k (1-p)^{(n-k)} \\ \end{align*}\]
Use pbinom
in R.
Now, let’s ask a slightly different question from the questions in Examples 1 and 2.
Consider a scenario where you look at multiple tweets. You don’t know how many tweets you want to look at. What is the probability of observing a 1st bot on the 3rd tweet that you see?
Again, it is known that 5% of all Tweets are spam bots or tweets from a fake account.
\[\begin{align*} P(\text{1 bot tweet until 1st tweet}) & = 0.05 = \frac{1}{20} \\ P(\text{2 bot tweets until 1st tweet}) & = \frac{19}{20}\frac{1}{20} \\ P(\text{3 bot tweets until 1st tweet}) & = \frac{19}{20}\frac{19}{20}\frac{1}{20} = \frac{19^2}{20^3} \end{align*}\]
The answer is \(P(\text{3 bot tweets until 1st tweet}) = 0.045\). There is a 4.50% probability of observing a bot tweet in the 3rd tweet that you see. In other words, the probability of observing failures in the 1st and 2nd tweet that you see, and a success in the 3rd tweet.
What is the probability of observing a 1st bot tweet on \((x+1)\)-th tweet that you see?
\[\begin{align*} P(\text{1 bot tweet until 1st tweet}) & = 0.05 = \frac{1}{20} \\ P(\text{2 bot tweets until 1st tweet}) & = \frac{19}{20}\frac{1}{20} \\ P(\text{3 bot tweets until 1st tweet}) & = \frac{19}{20}\frac{19}{20}\frac{1}{20} = \frac{19^2}{20^3} \cdots & \\ P(\text{x bot tweets until 1st tweet}) & = \left(\frac{19}{20}\right)^{x-1} \frac{1}{20} \end{align*}\]
In this example, the Geometric PMF is given as
\[P(X = x) = \left(\frac{19}{20}\right)^{x-1} \left(\frac{1}{20}\right)\]
where \(X\) is a Geometric random variable, \(x\) is the number of failures because the \((x+1)\)-th iteration is the first success with probability \(p = \frac{1}{20}\).
The geometric PMF models the probability of the number of failures can you have before getting a 1st success. In this example, observing a bot tweet is a “success”.
[PSDR] Definition 3.22
A random variable \(X\) is said to be a geometric random variable with parameter \(p\) if
\[P(X = x) = (1-p)^{x-1} p, \hspace{20px} x = 1, 2, 3, \cdots\]
[PSDR] Theorem 3.5
Let \(X\) be the random variable that counts the number of failures before the first success in a Bernoulli process with probability of success \(p\). Then \(X\) is a geometric random variable.
Notice that the sample space is now infinite!
Let \(X \sim Geometric(p)\) where \(p\) is the probability of success.
\[P(X = x; p) = (1-p)^{x-1} p, \hspace{20px} x = 1, 2, 3, \cdots\]
Use dgeom
in R. Note that the input for
dgeom
is \((x-1)\). So if
\(x=3\), then you need
dgeom(3-1)
.
\[\begin{align*} F(x) = & P(X \le x; p) \\ = & \sum_{k=0}^x (1-p)^k p \\ \end{align*}\]
\[ F(x) = \begin{cases} 1-(1-p)^{x} & \text{ if } x \ge 1 \\ 0 & \text{ if } x < 1 \\ \end{cases} \]
Use pgeom
in R.
In your mini-assignment today, you are going to verify that the Geometric PMF is a valid PMF and the form of the Geometric CDF closed form is \(F(x)\), shown above.