Statistics Question Problems and Solutions

1. Total Probability and Bayes Rule

Problem Statement:

Let's say you're playing a game where someone rolls a 6 sided die and then reports the number rolled. The opponent may lie - and if you can guess when he lies, you win. Since this game is for money, you've done the due preparations and collected the following data:

  • The other player lies 20% of the time
  • The die was cooked in an oven and is weighted to land with a 6 up, giving you the following probabilities:
    • $P(R=1)=.05$
    • $P(R=2)=.20$
    • $P(R=3)=.20$
    • $P(R=4)=.15$
    • $P(R=5)=.15$
    • $P(R=6)=.25$

Alright, it's game day, you're sitting at the table, and sweat is running down your brow. You've placed 25,000 dollars on this bet because you suffer from a gambling addiction. The opponent rolls the die, looks up, and says, "3." What is the probability that the roll is really 3?

Solution:

Be careful - it may be tempting to assume that the answer is 20%, because that's how often they lie. But that is not correct - we must also consider the probability that a 3 would be rolled at all (Plus pro-tip to students, if information is given on a test, you probably need to utilize it). It is valuable to first draw a diagram to understand the probability distribution:

The probabilities used in the diagram are derived from the probability of a lie. The truth can only be a 3 was rolled, and he reported a 3. You know no information on how he may lie, so assume that he randomly selects alternate values at an equal probability. Therefore the 20% is divided across the 5 answers which are lies. This tool will help us answer the question, which can be mathematically stated as;

$$P(R_P=3|R_R=3)$$

Where $R_P$ is the potential roll, and $R_R$ is the roll reported. To get some guidance on what to do next, let's put this in Bayes theorem form;

$$P(R_P=3|R_R=3) = \frac{P(R_R=3|R_P=3)P(R_P=3)}{P(R_R=3)}$$

The first term, $P(R_R=3|R_P=3)$ can be found from our little diagram. Read it as, "Given the potential roll was a 3, what is the probability the reported roll will be a 3?" Just look at the link between $R=3$ and $R=3$ to see that the probability is 0.8

The second term, $P(R_P=3)$, can be found from the problem statement. This is just the probability that a 3 would be rolled. Therefore this term is 0.2.

The bottom term is a little vexing when you think about it. What is the probability that the reported value, $R_R$ is 3? Here is where the law of total probability comes in. Here we simply take the sum of all the probabilities for $R_P$ multiplied by their connections in the diagram.

$$P(R_R=3)=P(R_R=3|R_P=1)*P(R_P=1)+P(R_R=3|R_P=2)*P(R_P=2)+P(R_R=3|R_P=3)*P(R_P=3)+P(R_R=3|R_P=4)*P(R_P=4)+P(R_R=3|R_P=5)*P(R_P=5)+P(R_R=3|R_P=6)*P(R_P=6)$$

Substituting values we get:

$$P(R_R=3)=0.04*0.05+0.04*0.20+0.80*0.20+0.04*0.15+0.04*0.15+0.04*0.25=0.192$$

There we go now we just need to plug the values into our Bayes Theorem Equation:

$$P(R_P=3|R_R=3) = \frac{0.8*0.2}{0.192}=0.8333$$

So what does this number mean? In our equation, we were asking "What is the probability that the opponent reported exactly what was rolled?" There is an 83% chance that he is telling the truth. So this guy's a moron! If he would have lied more when you were stalking him, you would have had much less certainty. Or perhaps your data is flawed because your opponent knew you were watching so he acted differently. Bottom line: don't gamble kids - invest in a sturdy mutual fund.

All joking aside, you can do a quick sanity check before letting this be your final answer. If the ONLY information you had was that he lies 20% of the time, then you know that he speaks the truth 80% of the time. But when ever you condition this on other information, your certainty will always increase. So if the number is smaller than 0.8, we have a problem.

Discrete Random Variables

Problem Statement:

You won your previous bet thanks to your knowhow of statistics and you're feeling pretty confident. So naturally you go to the black jack tables. First hand you're dealt a 10 and then a 5. If you hit (receive another card) what is the probability that you will bust? (surpass 21). For those not familiar with blackjack, you'll need to know the following

  • Face cards are all worth 10
  • Aces can be 1 or 11. But selecting an 11 in this case would result in a bust
  • Assume you have no knowledge of the cards left in the deck (counting cards is a good way to get the crap beat out of you in a back alley by casino staff)

Solution:

This is a fairly simple computation problem. All that must be done is add up the probabilities of each card that will put you past 21. Namely, 7,8,9,10, and any face cards. There are 14 total cards (1-10, J, K, Q, A). So there are 7 out of 14 cards that will bust or exactly 50%. So there you go - if you ever have a hand of 15 in blackjack, you know that you walk a very fine line between a high hand and busting - with no certainty either way.

Continuous Random Variables

Problem Statement 1:

Let's say you just got done cooking some food to bring to a pot luck. You had to cook the food in a pot at 200C. You turn the pot off and run out the door. You'll be gone for exactly 20 minutes. And you know that your cat always jumps up on the counter and walks around once while you're gone. What is the probability that your cat will burn her cute little kitty paws? Some things you should know:

  • The temperature is defined by an exponential decreasing function, where $T$ is temp and $t$ is time (in minutes): $$T=(200-27)*e^{-.5t}+27$$
  • Burns can be caused at 44C or higher.

Solution 1:

When the cat jumps on the stove is a random uniform variable with a range from $t=0$ to $t=20$. And here's the point where I realize this problem is too simple. All I have to do is find the amount of time that the cat could get burned and divide it by the amount of time total. Back calculate from the given temperature equation to find that when $T=44$, $t=33$. Using Sage Math:

f(t)=(200-27)*e^(-.25*t)+27
solve(f(t)==44,t)[0].rhs().n()
plot(f(t),(t,0,50),ymin=0)

We find that the time when the stove is okay to walk on is at 9.28 minutes. That means the probability is simply $9.28/20$ which equates to a 46.4% chance. Oh no that's far too unsafe! Don't leave your cat unattended in these conditions - the potluck can wait. As a quick sanity check, analyze the plot of the function

That said, this problem was written on the fly, and was kind of boring. So let's come up with one akin to this with a bit more complexity.

Problem Statement 2:

You've set up a hidden camera and have observed that your cat only jumps up on the counter near when she expects you to arrive. You've found the pdf to be:

$$P_J(t)=k(e^{0.1t}-1)$$

$k$ is a constant that will depend on the amount of time you are gone. Given the scenario from the previous problem, what is the probability your cat might burn her little paws?

Solution 2:

First let us define $P_J(t)$ for this specific problem. We must solve for $k$ using Sage Mathematics:

var('k')
PJ(t)=k*(e^(.1*t)-1)
Knew=solve(integral(PJ(t),(t,0,20))==1,k)[0].rhs().n()

$k$ is found to be 0.0228 for this problem. As a quick sanity check, validate that the PDF with the a substituted value of K integrated over the range is equal to 1.

integrate(PJ(t,k=Knew),(t,0,20))

Now that we have the PDF we can find the probability that the cat may get burned to do this, integrate the PDF from $t=0$ to $t=9.28$ (this value for $t$ was found in the previous problem):

integrate(PJ(t,k=Knew),(t,0,9.28))

The probability is found to be 13.7%. Phew that is a lot more comfortable than the previous statement.

Problem Statement 3:

Your wife still doesn't like those odds and wants to train the kitty to stop jumping on the counter entirely. So let's say she's going to hide in the bathroom while you leave (once again for 20 minutes). Your wife only has one shot to catch your cat in the act (Otherwise your cat will obviously know she's there). When is the best time for your wife to try and catch your cat in the act?

Solution 3:

In order to calculate this, we must calculate the expected value of when they will jump or $E[j]$. To calculate expected value, you must use the following equation:

$$E[x]=\int^{\infty}_{-\infty}xf_X(x)dx$$

So for our application:

$$E[j]=\int^{20}_{0}tP_J(t)dt$$

Using some of the Sage Math work from before, we can find this value:

integrate(t*PJ(t,k=Knew),(t,0,20)).n()

And this value is 14.55. So at $t=14.55$ your wife should check on the kitty. This may not be the most valid solution to the originally asked question when you really think about it however. Should you check at the expected value, or when the pdf is at the highest probability $t=20$?

Gaussian Random Variables

Problem Statement:

Suppose you have an electrode generating an arc used for welding. The arc will contact the target substrate on the x axis according to a Gaussian random variable with a mean of 0 (the position is relative to the electrode). You have found a clever way to control the standard deviation of the arc by introducing a solenoid to "guide" the arc. Adjusting the current through the solenoid will proportionally decrease the standard deviation according to the simple equation:

$$\sigma=10-0.1I$$

A few things to note about this equation, if there is no current through the solenoid, the standard deviation is 10mm. If the current is 100A, then the standard deviation is 0 (which is not possible). So this model is slightly broken, but will still suit this question.

You want 95% of the arc pulses to fall within a rand of -0.5mm to 0.5mm, and want to use as little current as possible (to avoid paying for a high current supply, as minimize heat generated by the solenoid). How much current do you need?

Solution:

Let us jump right to the CDF of the function (found from Wikipedia):

$$F_X(x)=\frac{1}{2}[1+erf(\frac{x}{\sigma{\sqrt{2}}})]$$

Now we can set up an equation to solve for $\sigma$

$$0.95=F_X(0.5)-F_X(-0.5)$$

This can be solved in Sage Math

var('s',latex_name='\\sigma')
FX(x)=0.5*(1+erf(x/(s*sqrt(2))))
solve(0.95==FX(.5)-FX(-.5),s)

And the solution is:

$$\frac{\sqrt{2}}{4*erf^{-1}(\frac{19}{20})}$$

By default, sigma does not have an inverse error function. So use the following script to find a numerical approximation we find our standard deviation is 0.255

import scipy.special as st
n(sqrt(2)/(4*st.erfinv(19/20)))

Now how can we be sure if this is really correct? Let's run a quick simulation in Octave GNU. We'll run 1,000,000 trials, and then find how many of the values are between -0.5 and 0.5.

X=normrnd(0,0.255106728462327, 1,1E7); mean(-0.5<=X & X<=0.5); hist(X,30)

We get a value of 0.95003. Running the test a few more times, it stays pretty close to 0.95, so we know we didn't just get lucky. This script also generates a plot:

It's pretty easy to convince yourself with this plot that 95% of the samples are between the limits of -0.5 and 0.5.