🔮 Day 4 - Spell Solutions

🔮 Day 4 - Spell Solutions#

# Load our magical libraries
library(ggplot2)
library(dplyr)
library(gridExtra)
library(infer)  # For rep_sample_n function
set.seed(123)  # For reproducible results

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Attaching package: ‘gridExtra’

The following object is masked from ‘package:dplyr’:

    combine

🔮 Day 4 - Spell 1: Distribution Detective Mission#

🎯 Mission Briefing#

Discover the magical secrets of mean, median, and mode through different creature power distributions!

💡 What are Distributions?#

Think of distributions like different ways data can be shaped - some are balanced (symmetrical), some lean to one side (skewed), and some have unusual outliers!

# Load our magical creature dataset
creatures <- read.csv("../datasets/magical_creatures_stats.csv")

# Let's peek at our data first
head(creatures)

# Separate each creature type for easier analysis
dragons <- creatures[creatures$creature_type == "dragon", ]
unicorns <- creatures[creatures$creature_type == "unicorn", ]
phoenixes <- creatures[creatures$creature_type == "phoenix", ]

cat("We have", nrow(dragons), "dragons,", nrow(unicorns), "unicorns, and", nrow(phoenixes), "phoenixes!")

A data.frame: 6 × 3
	creature_id	creature_type	power_level
	<int>	<chr>	<int>
1	1	dragon	71
2	2	dragon	42
3	3	dragon	55
4	4	dragon	59
5	5	dragon	56
6	6	dragon	48

We have 1000 dragons, 1000 unicorns, and 1000 phoenixes!

🐉 Investigation 1: Dragon Power (Normal Distribution)#

Dragons have very balanced power - most dragons are medium strength, with fewer very weak or very strong dragons.

# Create a histogram of dragon power
ggplot(dragons, aes(x = power_level)) +
  geom_histogram(fill ="#a4ce89", color = "white", alpha = 0.7, bins = 25) +
  labs(title = "Dragon Power Distribution (Normal/Symmetrical)", 
       x = "Dragon Power Level", 
       y = "Number of Dragons") +
  theme_minimal()

# Calculate statistics for dragon power
dragon_stats <- dragons %>%
  summarise(
    mean_power = mean(power_level),
    median_power = median(power_level),
    min_power = min(power_level),
    max_power = max(power_level)
  )

print("Dragon Power Statistics:")
print(dragon_stats)

[1] "Dragon Power Statistics:"

  mean_power median_power min_power max_power
1     49.656           50         5        98

../../../_images/e86c7f7e67883a8f3f02e0745310c4dfed53045ae558d6d2f8179902bbbd21dd.png

💡 Key Learning#

In symmetrical distributions, mean and median are very close!

🦄 Investigation 2: Unicorn Power (Left-Skewed Distribution)#

Most unicorns have high power, but a few have very low power - this creates a “tail” on the left.

# Create histogram of unicorn magic
ggplot(unicorns, aes(x = power_level)) +
  geom_histogram(fill = "pink", color = "white", alpha = 0.7, bins = 25) +
  labs(title = "Unicorn Power Distribution (Left-Skewed)", 
       x = "Unicorn Power Level", 
       y = "Number of Unicorns") +
  theme_minimal()

# Calculate statistics for unicorn power
unicorn_stats <- unicorns %>%
  summarise(
    mean_power = mean(power_level),
    median_power = median(power_level),
    min_power = min(power_level),
    max_power = max(power_level)
  )

print("Unicorn Power Statistics:")
print(unicorn_stats)

[1] "Unicorn Power Statistics:"

  mean_power median_power min_power max_power
1     80.331           82        22       100

../../../_images/f0cafc40a7babd8267daf61468eac13dd52a07a5d4744ba9bf0de0c00ef5dd6e.png

💡 Key Learning#

In left-skewed distributions, the mean is PULLED DOWN by the low values!

🔥 Investigation 3: Phoenix Power (Right-Skewed Distribution)#

Most phoenixes have low power, but a few have extremely high power - this creates a “tail” on the right.

# Create histogram of phoenix energy
ggplot(phoenixes, aes(x = power_level)) +
  geom_histogram(fill = "#f1c979", color = "white", alpha = 0.7, bins = 25) +
  labs(title = "Phoenix Power Distribution (Right-Skewed)", 
       x = "Phoenix Power Level", 
       y = "Number of Phoenixes") +
  theme_minimal()

# Calculate statistics for phoenix power
phoenix_stats <- phoenixes %>%
  summarise(
    mean_power = mean(power_level),
    median_power = median(power_level),
    min_power = min(power_level),
    max_power = max(power_level)
  )

print("🔥 Phoenix Power Statistics:")
print(phoenix_stats)

[1] "🔥 Phoenix Power Statistics:"

  mean_power median_power min_power max_power
1     19.493           17         0        73

../../../_images/b64b7b2619243624a32f95219c920a39fb6e6e721490527922a9b3a9e8ff31a1.png

💡 Key Learning#

In right-skewed distributions, the mean is PULLED UP by the high values!

🎨 Stacked Distribution Gallery#

Let’s see all three distributions stacked vertically in one beautiful view!

# Create individual plots
p1 <- ggplot(dragons, aes(x = power_level)) +
  geom_histogram(fill ="#a4ce89", color = "white", alpha = 0.7, bins = 25)  +
  labs(title = "Dragon Power (Normal Distribution)", 
       x = "Power Level", 
       y = "Count") +
  theme_minimal() +
  theme(plot.title = element_text(size = 12, hjust = 0.5))

p2 <- ggplot(unicorns, aes(x = power_level)) +
  geom_histogram(fill = "pink", color = "white", alpha = 0.7, bins = 25) +
  labs(title = "Unicorn Power (Left-Skewed Distribution)", 
       x = "Power Level", 
       y = "Count") +
  theme_minimal() +
  theme(plot.title = element_text(size = 12, hjust = 0.5))

p3 <- ggplot(phoenixes, aes(x = power_level)) +
  geom_histogram(fill = "#f1c979", color = "white", alpha = 0.7, bins = 25)  +
  labs(title = "Phoenix Power (Right-Skewed Distribution)", 
       x = "Power Level", 
       y = "Count") +
  theme_minimal() +
  theme(plot.title = element_text(size = 12, hjust = 0.5))

# Stack them in a 3x1 grid (3 rows, 1 column)
grid.arrange(p1, p2, p3, nrow = 3, ncol = 1)

../../../_images/72995e43ec5b36d236ad29f91b8e8647be54584ff51f7216fade07e91d3524f4.png

💡 What Do You Notice?#

Look at how different each shape is! Can you see the “tails” and where most of the data clusters?

🔍 The Great Comparison Challenge#

Let’s compare all three distributions with numbers!

# Combine all the statistics
all_stats <- data.frame(
  Creature = c("Dragon (Normal)", "Unicorn (Left-Skewed)", "Phoenix (Right-Skewed)"),
  Mean = c(dragon_stats$mean_power, unicorn_stats$mean_power, phoenix_stats$mean_power),
  Median = c(dragon_stats$median_power, unicorn_stats$median_power, phoenix_stats$median_power),
  Difference = c(
    abs(dragon_stats$mean_power - dragon_stats$median_power),
    abs(unicorn_stats$mean_power - unicorn_stats$median_power),
    abs(phoenix_stats$mean_power - phoenix_stats$median_power)
  )
)

print("📊 The Great Statistics Comparison:")
print(all_stats)

[1] "📊 The Great Statistics Comparison:"

                Creature   Mean Median Difference
      Dragon (Normal) 49.656     50      0.344
Unicorn (Left-Skewed) 80.331     82      1.669
Phoenix (Right-Skewed) 19.493     17      2.493

💡 Mystery Question#

Which creature type has the biggest difference between mean and median? Why?

💡 Take-Home Messages from Spell 1#

Symmetrical distributions: Mean ≈ Median (they’re close friends!)
Left-skewed distributions: Mean < Median (mean gets pulled down)
Right-skewed distributions: Mean > Median (mean gets pulled up)
When data is skewed, median is often a better measure of “typical” value

🎈 Your Turn!#

Try running each code chunk and observe how the different distribution shapes affect the relationship between mean, median, and mode. Can you explain why the mean “moves” toward the tail in skewed distributions?

🔮 Day 4 - Spell 2: The Great Creature Sampling Adventure#

💡 What is Sampling?#

Imagine you want to know how many dragons live in the entire Enchanted Forest, but you can’t count them all! So you explore small areas, count the dragons there, and use that to guess about the whole forest. That’s sampling!

🌲 Population = All the creatures in the entire Enchanted Forest (everything we want to know about)
🔍 Sample = The creatures we find in one small area we explore (the small group we actually study)
📊 Statistic = Something we calculate using our sample (like “30% of creatures in our sample are dragons”)
🎯 Inference = Using our sample to make a conclusion about the whole population

🐉 Our Magical Population: The Enchanted Forest#

Let’s create our population of 100000 magical creatures where exactly 30% are dragons!

# Create our population of 100000 creatures
# 30000 dragons (30%) and 70000 other creatures (70%)
magical_creatures <- tibble(
  creature_id = 1:100000,
  creature_type = factor(c(rep("dragon", 30000), rep("other", 70000)))
)

# Let's verify our true population proportion
true_dragon_proportion <- sum(magical_creatures$creature_type == "dragon") / nrow(magical_creatures)
print(paste("🎯 True proportion of dragons in population:", true_dragon_proportion))

# Visualize our population
population_summary <- magical_creatures %>%
  count(creature_type) %>%
  mutate(proportion = n / sum(n))

ggplot(population_summary, aes(x = creature_type, y = proportion, fill = creature_type)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("dragon" = "orange", "other" = "gray")) +
  labs(title = "True Population: Enchanted Forest Creatures",
       x = "Creature Type", 
       y = "Proportion",
       subtitle = "Population Size: 100000 creatures, 30% dragons") +
  theme_minimal() +
  theme(text = element_text(size = 16),
        plot.title = element_text(size = 20),
        axis.title = element_text(size = 18),
        axis.text = element_text(size = 19),
        legend.position = "none")

[1] "🎯 True proportion of dragons in population: 0.3"

../../../_images/2af0650d2e063e914ce6c9c66f0b2fd0d350251ceb0d1673d81a0c70352ce335.png

🎣 Taking Our First Sample (Size 40)#

Let’s explore one area of the forest and see what creatures we find!

# Take one random sample of 40 creatures
creature_sample_1 <- rep_sample_n(magical_creatures, size = 40)
head(creature_sample_1)

# What's the proportion of dragons in our sample?
dragon_estimate_1 <- creature_sample_1 %>%
  count(creature_type) %>%
  mutate(proportion = n / sum(n))

print("🔍 Our first sample results:")
head(dragon_estimate_1)

ggplot(dragon_estimate_1, aes(x = creature_type, y = proportion, fill = creature_type)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("dragon" = "orange", "other" = "gray")) +
  labs(title = "First Sample: Enchanted Forest Creatures",
       x = "Creature Type", 
       y = "Proportion",
       subtitle = "Sample Size: 40 creatures") +
  theme_minimal() +
  theme(text = element_text(size = 16),
        plot.title = element_text(size = 20),
        axis.title = element_text(size = 18),
        axis.text = element_text(size = 19),
        legend.position = "none")

A grouped_df: 6 × 3
replicate	creature_id	creature_type
<int>	<int>	<fct>
1	51663	other
1	57870	other
1	2986	dragon
1	29925	dragon
1	95246	other
1	68293	other

[1] "🔍 Our first sample results:"

A grouped_df: 2 × 4
replicate	creature_type	n	proportion
<int>	<fct>	<int>	<dbl>
1	dragon	15	0.375
1	other	25	0.625

../../../_images/104c1d2f3214eab0e14c3d4b3e95255c6e52694fe6cfd1626f7fbbabd9244df6.png

🎣 Taking Another Sample#

What happens if we explore a different area of the forest? Will we get the same result?

# Take another random sample of 40 creatures
dragon_estimate_2 <- rep_sample_n(magical_creatures, size = 40) %>%
  count(creature_type) %>%
  mutate(proportion = n / sum(n))

print("🔍 Our second sample results:")
head(dragon_estimate_2)

ggplot(dragon_estimate_2, aes(x = creature_type, y = proportion, fill = creature_type)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("dragon" = "orange", "other" = "gray")) +
  labs(title = "Second Sample: Enchanted Forest Creatures",
       x = "Creature Type", 
       y = "Proportion",
       subtitle = "Sample Size: 40 creatures") +
  theme_minimal() +
  theme(text = element_text(size = 16),
        plot.title = element_text(size = 20),
        axis.title = element_text(size = 18),
        axis.text = element_text(size = 19),
        legend.position = "none")

[1] "🔍 Our second sample results:"

A grouped_df: 2 × 4
replicate	creature_type	n	proportion
<int>	<fct>	<int>	<dbl>
1	dragon	13	0.325
1	other	27	0.675

../../../_images/026ad3674ff95847b89602d9c8519ae2a6f66971efd074940afb7de4d38a5251.png

🎣 And One More Sample!#

Let’s try once more to see the variability between samples:

# Take a third random sample of 40 creatures
dragon_estimate_3 <- rep_sample_n(magical_creatures, size = 40) %>%
    count(creature_type) %>%
  mutate(proportion = n / sum(n))

print("🔍 Our third sample results:")
head(dragon_estimate_3)

ggplot(dragon_estimate_3, aes(x = creature_type, y = proportion, fill = creature_type)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("dragon" = "orange", "other" = "gray")) +
  labs(title = "Third Sample: Enchanted Forest Creatures",
       x = "Creature Type", 
       y = "Proportion",
       subtitle = "Sample Size: 40 creatures") +
  theme_minimal() +
  theme(text = element_text(size = 16),
        plot.title = element_text(size = 20),
        axis.title = element_text(size = 18),
        axis.text = element_text(size = 19),
        legend.position = "none")

[1] "🔍 Our third sample results:"

A grouped_df: 2 × 4
replicate	creature_type	n	proportion
<int>	<fct>	<int>	<dbl>
1	dragon	8	0.2
1	other	32	0.8

../../../_images/9a57245fc6a920798361de2a3af24820f454cec135982676b3b2fda6b0e37160.png

📊 The Sampling Distribution: Taking LOTS of Samples!#

What if we explore 1000 different areas of the forest? Let’s see what the distribution of our estimates looks like!

# Take 1000 samples of size 40 and calculate proportions for each
sampling_results <- rep_sample_n(magical_creatures, size = 40, reps = 300) %>%
  group_by(replicate) %>%
  summarize(proportion = sum(creature_type == "dragon") / 40)

mean_of_sampling <- mean(sampling_results$proportion)
sd_of_sampling <- sd(sampling_results$proportion)

# Plot the sampling distribution
ggplot(sampling_results, aes(x = proportion)) +
  geom_histogram(binwidth = 0.05, fill = "#91d4eb", color = "white", alpha = 0.7) +
  geom_vline(xintercept = true_dragon_proportion, color = "#9090ea", linetype = "solid", linewidth = 1) +
  geom_vline(xintercept = mean_of_sampling, color = "#e1b086", linetype = "dashed", linewidth = 1) +
  labs(title = "Sampling Distribution: 1000 Samples of Size 40",
       x = "Sample Proportion of Dragons", 
       y = "Count",
       subtitle = "Purple solid line shows true population proportion (0.30), Orange dashed line shows mean of sampling distribution") +
  theme_minimal() +
  theme(text = element_text(size = 12),
        plot.title = element_text(size = 16),
        axis.title = element_text(size = 15),
        axis.text = element_text(size = 12),
        legend.position = "none")


# Calculate statistics about our sampling distribution
print(paste("📊 Average of all sample proportions:", round(mean_of_sampling, 3)))
print(paste("📏 Spread (standard deviation) of sample proportions:", round(sd_of_sampling, 3)))
print(paste("🎯 True population proportion:", true_dragon_proportion))

[1] "📊 Average of all sample proportions: 0.307"

[1] "📏 Spread (standard deviation) of sample proportions: 0.071"

[1] "🎯 True population proportion: 0.3"

../../../_images/00c93a72530ce9f2db9ed707f71a26bfbbe0b4eec391e3277a4a64c19d1862c7.png

📊 The Magic of Sample Size: What If We Change Sample Size?#

Let’s see what happens when we use different sample sizes!

# Compare different sample sizes: 20, 40, and 200
sample_sizes <- c(20, 40, 200)
all_results <- tibble()

for (size in sample_sizes) {
  results <- rep_sample_n(magical_creatures, size = size, reps = 300) %>%
    group_by(replicate) %>%
    summarize(proportion = sum(creature_type == "dragon") / size,
              sample_size = paste("Size", size))
  
  all_results <- bind_rows(all_results, results)
}
  
  # Order facets small → large and compute mean per sample size for dashed lines
  all_results <- all_results %>%
    mutate(sample_size = factor(sample_size, levels = c("Size 20", "Size 40", "Size 200")))

  sample_means <- all_results %>%
    group_by(sample_size) %>%
    summarize(mean_proportion = mean(proportion), .groups = "drop")

  # Create comparison plot (3x1 facets, styled like the chart above)
  ggplot(all_results, aes(x = proportion, fill = sample_size)) +
    geom_histogram(binwidth = 0.05, color = "white", alpha = 0.7) +
    geom_vline(xintercept = true_dragon_proportion, color = "#9090ea", linetype = "solid", linewidth = 1) +
    geom_vline(data = sample_means, aes(xintercept = mean_proportion), color = "#e1b086", linetype = "dashed", linewidth = 1) +
    facet_wrap(~sample_size, ncol = 1) +
    scale_fill_manual(values = c(
      "Size 20" = "#cfe8ff",  # light blue
      "Size 40" = "#91d4eb",  # medium blue (matches above)
      "Size 200" = "#4da3d9"  # darker blue
    )) +
    labs(title = "The Magic of Sample Size: All Distributions Compared",
         x = "Sample Proportion of Dragons", 
         y = "Count",
         subtitle = "Purple line = true proportion (0.30). Orange dashed line = mean of each sampling distribution.") +
    theme_minimal() +
    theme(
      legend.position = "none",
      text = element_text(size = 12),
      plot.title = element_text(size = 16),
      axis.title = element_text(size = 15),
      axis.text = element_text(size = 12)
    )

# Calculate summary statistics
summary_stats <- all_results %>%
  group_by(sample_size) %>%
  summarize(
    Average_Proportion = mean(proportion),
    Spread_SD = sd(proportion),
    .groups = "drop"
  ) %>%
  mutate(True_Proportion = true_dragon_proportion,
         Difference = abs(Average_Proportion - True_Proportion))

print("🏆 Summary of All Experiments:")
print(summary_stats)

[1] "🏆 Summary of All Experiments:"

# A tibble: 3 × 5
  sample_size Average_Proportion Spread_SD True_Proportion Difference
  <fct>                    <dbl>     <dbl>           <dbl>      <dbl>
1 Size 20                  0.306    0.104              0.3   0.00583 
2 Size 40                  0.303    0.0720             0.3   0.00333 
3 Size 200                 0.299    0.0309             0.3   0.000567

../../../_images/deae790c4ca1b4b1144b2540db635ab5c44765640e49411661ee3462ecc84380.png

🎯 Key Magical Discoveries from Spell 2!#

🔍 Discovery 1: The Law of Large Numbers

All sample sizes give us averages close to the true population proportion (0.30)
Larger samples don’t change the average, but they make it more reliable!

📏 Discovery 2: Sample Size and Spread

Small samples (10): High variability - samples can be quite different from each other
Medium samples (50): Medium variability - samples are more consistent
Large samples (100): Low variability - samples are very consistent and close to truth

🎪 Discovery 3: The Sampling Distribution Shape

All sampling distributions are roughly bell-shaped (normal)
They all center around the true population value
Larger sample sizes create narrower, more precise distributions

💡 Real-World Magic Applications#

🗳️ Political Polls: When pollsters survey 1000 people instead of 100, their predictions are much more accurate!

🎮 Game Testing: Game companies test with large groups to get reliable feedback about what players really think.

🍕 Restaurant Reviews: A restaurant with 500 reviews gives you a better idea of quality than one with 5 reviews!

⚕️ Medical Research: Doctors test medicines on large groups to be confident about the results.

🔮 Day 4 - Spell 3: Bootstrap Bootcamp - The Great Dragon Fire Sheep Rescue#

📖 The Story: Dragon Fire Sheep Rescue! 🐑🔥#

URGENT QUEST ALERT! The ancient Dragon of Mount Statistics has accidentally breathed fire across the Enchanted Meadows, where Farmer Luna (white sheep) and Farmer Obsidian (black sheep) graze their magical flocks. The fire didn’t harm the sheep, but created a massive cloud of magical smoke that mixed all the flocks together!

The two farmers are worried and need to know how many of their sheep are mixed together in the smoky field. But here’s the problem: the smoke is so thick that shepherds can only see a small group of sheep at a time!

⚡ Your Challenge: Use the ancient art of Bootstrap Magic to estimate what percentage of the total flock is black sheep!

🐑 Creating Our Magical Sheep Population#

Let’s create our mixed flock where 60% are black sheep and 40% are white sheep!

# Create our population of 10000 magical sheep
# 6000 black sheep (60%) and 4000 white sheep (40%)
magical_sheep <- tibble(
  sheep_id = 1:10000,
  sheep_color = factor(c(rep("black", 6000), rep("white", 4000)))
)

# Let's verify our true population proportion
true_black_proportion <- sum(magical_sheep$sheep_color == "black") / nrow(magical_sheep)
print(paste("🎯 True proportion of black sheep in population:", true_black_proportion))

# Visualize our population
sheep_summary <- magical_sheep %>%
  count(sheep_color) %>%
  mutate(proportion = n / sum(n))

ggplot(sheep_summary, aes(x = sheep_color, y = proportion, fill = sheep_color)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("black" = "black", "white" = "lightgray")) +
  labs(title = "True Population: Mixed Sheep Flock",
       x = "Sheep Color", 
       y = "Proportion",
       subtitle = "Population Size: 1000 sheep, 60% black sheep") +
  theme_minimal() +
  theme(legend.position = "none")

[1] "🎯 True proportion of black sheep in population: 0.6"

../../../_images/672df71becd803ccc1b3921d90aec45762e46caf93cd5d8073944ba7dceee3e3.png

💡 What is Bootstrapping?#

🐑 The Sheep Magic Trick! You can only see 20 sheep through the smoke, but you want to know what the whole flock looks like. Bootstrapping is like having a magic hat:

🎩 The Magic Hat Steps:

Put your 20 sheep in the magic hat (your original sample through the smoke)
Pull one sheep out, write down its color (black or white)
PUT IT BACK in the hat! (this is the magic part - WITH REPLACEMENT)
Repeat 20 times to create a “pretend new sample”
Do this magic trick hundreds of times to see all possible pretend samples!

By doing this magic trick hundreds of times, you can imagine what it would be like if you could take hundreds of real samples from the smoky field!

🔍 Phase 1: Your Original Vision (Taking Our First Sample)#

You can only see 20 sheep through the smoke. Let’s take our first sample!

# Take one original sample of 20 sheep that we can see through the smoke
original_sample <- rep_sample_n(magical_sheep, size = 20)

# Count black vs white sheep in our original sample
original_count <- original_sample %>%
  count(sheep_color) %>%
  mutate(proportion = n / sum(n))

print("🔍 Our original sample through the smoke:")
print(original_count)

ggplot(original_count, aes(x = sheep_color, y = proportion, fill = sheep_color)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("black" = "black", "white" = "lightgray")) +
  labs(title = "First sample through the smoke",
       x = "Sheep Color", 
       y = "Proportion",
       subtitle = "Sample Size: 20 sheep") +
  theme_minimal() +
  theme(legend.position = "none")

# Calculate our initial estimate
original_black_prop <- sum(original_sample$sheep_color == "black") / 20
print(paste("📊 Our estimate of black sheep percentage:", round(original_black_prop * 100, 1), "%"))
print(paste("🎯 True black sheep percentage:", round(true_black_proportion * 100, 1), "%"))

[1] "🔍 Our original sample through the smoke:"

# A tibble: 2 × 4
# Groups:   replicate [1]
  replicate sheep_color     n proportion
      <int> <fct>       <int>      <dbl>
1         1 black          14        0.7
2         1 white           6        0.3

[1] "📊 Our estimate of black sheep percentage: 70 %"

[1] "🎯 True black sheep percentage: 60 %"

../../../_images/e1c064075d5607f700728e110691940f08790d917ea16bdeeedfec7aadcb07cb.png

🔄 Phase 2: The Bootstrap Magic Ritual#

Now we’ll use bootstrap magic! We’ll resample from our original sample WITH REPLACEMENT to create many bootstrap samples.

# Let's try one bootstrap resample to see how it works
# We sample WITH REPLACEMENT from our original sample
bootstrap_sample_1 <- original_sample %>%
  ungroup() %>%
  select(sheep_color) %>%
  rep_sample_n(size = 20, replace = TRUE)

bootstrap_count <- bootstrap_sample_1 %>%
  count(sheep_color) %>%
  mutate(proportion = n / sum(n))

ggplot(bootstrap_count, aes(x = sheep_color, y = proportion, fill = sheep_color)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("black" = "black", "white" = "lightgray")) +
  labs(title = "First bootstrap sample",
       x = "Sheep Color", 
       y = "Proportion",
       subtitle = "BootstrapSample Size: 20 sheep (with replacement)") +
  theme_minimal() +
  theme(legend.position = "none")

bootstrap_prop_1 <- sum(bootstrap_sample_1$sheep_color == "black") / 20
print(paste("🎲 Bootstrap sample 1 - black sheep proportion:", round(bootstrap_prop_1, 3)))

[1] "🎲 Bootstrap sample 1 - black sheep proportion: 0.55"

../../../_images/0b6fc34aac5dfc9e0b7ccfb86a8b81e49a74040c16ee52b72bb13c42f386aaed.png

# Let's try another bootstrap sample
bootstrap_sample_2 <- original_sample %>%
  ungroup() %>%
  select(sheep_color) %>%
  rep_sample_n(size = 20, replace = TRUE)

bootstrap_count <- bootstrap_sample_2 %>%
  count(sheep_color) %>%
  mutate(proportion = n / sum(n))

ggplot(bootstrap_count, aes(x = sheep_color, y = proportion, fill = sheep_color)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("black" = "black", "white" = "lightgray")) +
  labs(title = "🐑 Second bootstrap sample",
       x = "Sheep Color", 
       y = "Proportion",
       subtitle = "BootstrapSample Size: 20 sheep (with replacement)") +
  theme_minimal() +
  theme(legend.position = "none")

bootstrap_prop_2 <- sum(bootstrap_sample_2$sheep_color == "black") / 20
print(paste("🎲 Bootstrap sample 2 - black sheep proportion:", round(bootstrap_prop_2, 3)))

Warning message in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
“conversion failure on '🐑 Second bootstrap sample' in 'mbcsToSbcs': dot substituted for <f0>”

Warning message in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
“conversion failure on '🐑 Second bootstrap sample' in 'mbcsToSbcs': dot substituted for <9f>”

Warning message in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
“conversion failure on '🐑 Second bootstrap sample' in 'mbcsToSbcs': dot substituted for <90>”

Warning message in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
“conversion failure on '🐑 Second bootstrap sample' in 'mbcsToSbcs': dot substituted for <91>”

Warning message in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
“conversion failure on '🐑 Second bootstrap sample' in 'mbcsToSbcs': dot substituted for <f0>”

Warning message in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
“conversion failure on '🐑 Second bootstrap sample' in 'mbcsToSbcs': dot substituted for <9f>”

Warning message in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
“conversion failure on '🐑 Second bootstrap sample' in 'mbcsToSbcs': dot substituted for <90>”

Warning message in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
“conversion failure on '🐑 Second bootstrap sample' in 'mbcsToSbcs': dot substituted for <91>”

[1] "🎲 Bootstrap sample 2 - black sheep proportion: 0.65"

../../../_images/413cef02d0d9c79aa140b77feaee4f6c9a0d9649eb073f493e8b8e34c2d36717.png

# And one more to see the variability
bootstrap_sample_3 <- original_sample %>%
  ungroup() %>%
  select(sheep_color) %>%
  rep_sample_n(size = 20, replace = TRUE)

bootstrap_count <- bootstrap_sample_3 %>%
  count(sheep_color) %>%
  mutate(proportion = n / sum(n))

ggplot(bootstrap_count, aes(x = sheep_color, y = proportion, fill = sheep_color)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("black" = "black", "white" = "lightgray")) +
  labs(title = "Third bootstrap sample",
       x = "Sheep Color", 
       y = "Proportion",
       subtitle = "BootstrapSample Size: 20 sheep (with replacement)") +
  theme_minimal() +
  theme(legend.position = "none")

bootstrap_prop_3 <- sum(bootstrap_sample_3$sheep_color == "black") / 20
print(paste("🎲 Bootstrap sample 3 - black sheep proportion:", round(bootstrap_prop_3, 3)))

[1] "🎲 Bootstrap sample 3 - black sheep proportion: 0.75"

../../../_images/415008390164000c23bce75de196bc3a596f7d637737adc2d53d3d9262b87d26.png

🌟 The Full Bootstrap Distribution: 1000 Magic Resamples!#

Now let’s create many bootstrap samples to see the full distribution of possible estimates!

# Create 1000 bootstrap samples from our original sample
original_sample_clean <- original_sample %>%
  ungroup() %>%
  select(sheep_color)

bootstrap_results <- original_sample_clean %>%
  rep_sample_n(size = 20, replace = TRUE, reps = 1000) %>%
  group_by(replicate) %>%
  summarize(black_proportion = sum(sheep_color == "black") / 20,
            .groups = "drop")

# Calculate statistics about our bootstrap distribution
bootstrap_mean <- mean(bootstrap_results$black_proportion)
bootstrap_sd <- sd(bootstrap_results$black_proportion)

# Plot the bootstrap distribution
ggplot(bootstrap_results, aes(x = black_proportion)) +
  geom_histogram(binwidth = 0.05, fill = "#89e5ed", color = "white", alpha = 0.7) +
  geom_vline(xintercept = original_black_prop, color = "#ce8ff0", linetype = "dashed", linewidth = 2) +
  geom_vline(xintercept = true_black_proportion, color = "#ea9d69", linetype = "solid", linewidth = 2) +
  geom_vline(xintercept = bootstrap_mean, color = "#62aa55", linetype = "longdash", linewidth = 2) +
  labs(title = "Bootstrap Distribution: 1000 Magic Resamples",
       x = "Bootstrap Sample Proportion of Black Sheep", 
       y = "Count",
       subtitle = "Purple line: Our original sample estimate, \n Orange line: True proportion, \n Green line: Mean of bootstrap distribution") +
  theme_minimal()


print(paste("📊 Average of bootstrap proportions:", round(bootstrap_mean, 3)))
print(paste("📏 Spread (standard deviation) of bootstrap proportions:", round(bootstrap_sd, 3)))
print(paste("🔍 Our original sample proportion:", round(original_black_prop, 3)))
print(paste("🎯 True population proportion:", round(true_black_proportion, 3)))

[1] "📊 Average of bootstrap proportions: 0.702"

[1] "📏 Spread (standard deviation) of bootstrap proportions: 0.104"

[1] "🔍 Our original sample proportion: 0.7"

[1] "🎯 True population proportion: 0.6"

../../../_images/0c203a62764ae25713c2470ba0b01de0e2f2deb7a83e888fc1bce1cc7e6ae7b0.png

🏆 Phase 3: Creating Our Confidence Prophecy (Confidence Interval)#

🎯 Your Confidence Prophecy! After doing our magic hat trick many times, we noticed that most of our “pretend samples” gave us similar results. Now we can tell the farmers: “Based on all our magic hat tricks, I’m 90% confident that the true percentage of black sheep in your mixed flock is somewhere in this range!”

It’s like saying “I’m pretty sure the answer is somewhere between these two numbers, but I can’t be 100% certain because I could only see a small part of the flock through the smoke.”

# Calculate 90% confidence interval
# This means we're 90% confident the true proportion is in this range
confidence_level <- 0.90
alpha <- 1 - confidence_level
lower_percentile <- (alpha / 2) * 100
upper_percentile <- (1 - alpha / 2) * 100

confidence_interval <- quantile(bootstrap_results$black_proportion, 
                               probs = c(alpha/2, 1 - alpha/2))

print("Our 90% Confidence Prophecy (Confidence Interval):")
print(paste("We are 90% confident that the true proportion of black sheep is between", 
            round(confidence_interval[1], 3), "and", round(confidence_interval[2], 3)))

# Did our confidence interval capture the true value?
captured_truth <- (true_black_proportion >= confidence_interval[1]) & 
                  (true_black_proportion <= confidence_interval[2])

if (captured_truth) {
  print("🎉 SUCCESS! Our confidence interval captured the true proportion!")
} else {
  print("😅 Oops! Our confidence interval missed the true proportion this time.")
}

# Visualize our confidence interval

ggplot(bootstrap_results, aes(x = black_proportion)) +
  geom_histogram(binwidth = 0.05, fill = "#89e5ed", color = "white", alpha = 0.7) +
  geom_vline(xintercept = original_black_prop, color = "#ce8ff0", linetype = "dashed", linewidth = 2) +
  geom_vline(xintercept = true_black_proportion, color = "#ea9d69", linetype = "solid", linewidth = 2) +
  geom_vline(xintercept = bootstrap_mean, color = "#62aa55", linetype = "longdash", linewidth = 2) +
  geom_vline(xintercept = confidence_interval[1], color = "#9ecb97", linetype = "dotted", linewidth = 1) +
  geom_vline(xintercept = confidence_interval[2], color = "#9ecb97", linetype = "dotted", linewidth = 1) +
  labs(title = "🔮 Bootstrap Distribution: 1000 Magic Resamples with 90% Confidence Prophecy",
       x = "Bootstrap Sample Proportion of Black Sheep", 
       y = "Count",
       subtitle = "Purple line: Our original sample estimate, \n Orange line: True proportion, \n Green line: Mean of bootstrap distribution \n Green dotted lines: 90% Confidence Prophecy") +
  theme_minimal()

[1] "Our 90% Confidence Prophecy (Confidence Interval):"

[1] "We are 90% confident that the true proportion of black sheep is between 0.55 and 0.85"

[1] "🎉 SUCCESS! Our confidence interval captured the true proportion!"

Warning message in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
“conversion failure on '🔮 Bootstrap Distribution: 1000 Magic Resamples with 90% Confidence Prophecy' in 'mbcsToSbcs': dot substituted for <f0>”

Warning message in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
“conversion failure on '🔮 Bootstrap Distribution: 1000 Magic Resamples with 90% Confidence Prophecy' in 'mbcsToSbcs': dot substituted for <9f>”

Warning message in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
“conversion failure on '🔮 Bootstrap Distribution: 1000 Magic Resamples with 90% Confidence Prophecy' in 'mbcsToSbcs': dot substituted for <94>”

Warning message in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
“conversion failure on '🔮 Bootstrap Distribution: 1000 Magic Resamples with 90% Confidence Prophecy' in 'mbcsToSbcs': dot substituted for <ae>”

Warning message in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
“conversion failure on '🔮 Bootstrap Distribution: 1000 Magic Resamples with 90% Confidence Prophecy' in 'mbcsToSbcs': dot substituted for <f0>”

Warning message in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
“conversion failure on '🔮 Bootstrap Distribution: 1000 Magic Resamples with 90% Confidence Prophecy' in 'mbcsToSbcs': dot substituted for <9f>”

Warning message in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
“conversion failure on '🔮 Bootstrap Distribution: 1000 Magic Resamples with 90% Confidence Prophecy' in 'mbcsToSbcs': dot substituted for <94>”

Warning message in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
“conversion failure on '🔮 Bootstrap Distribution: 1000 Magic Resamples with 90% Confidence Prophecy' in 'mbcsToSbcs': dot substituted for <ae>”

../../../_images/c91fd97a7c937a6837744dd1aaff29ed8c8c8a3020c77ff8dcbc0d4809558852.png

🎲 Experiment: What If We Had Different Sample Sizes?#

Let’s see how our confidence gets better with larger samples!

# Function to create bootstrap CI for different sample sizes
create_bootstrap_ci <- function(sample_size, n_bootstrap = 500) {
  # Take original sample of given size
  sample_data <- rep_sample_n(magical_sheep, size = sample_size)
  sample_clean <- sample_data %>% ungroup() %>% select(sheep_color)
  
  # Create bootstrap distribution
  bootstrap_dist <- sample_clean %>%
    rep_sample_n(size = sample_size, replace = TRUE, reps = n_bootstrap) %>%
    group_by(replicate) %>%
    summarize(black_proportion = sum(sheep_color == "black") / sample_size,
              .groups = "drop")
  
  # Calculate 90% CI
  ci <- quantile(bootstrap_dist$black_proportion, probs = c(0.05, 0.95))
  
  return(list(
    sample_size = sample_size,
    ci_lower = ci[1],
    ci_upper = ci[2],
    ci_width = ci[2] - ci[1],
    bootstrap_dist = bootstrap_dist
  ))
}

# Compare different sample sizes
sample_sizes <- c(10, 20, 50, 200, 500)
ci_results <- tibble()

for (size in sample_sizes) {
  result <- create_bootstrap_ci(size)
  new_row <- tibble(
    sample_size = result$sample_size,
    ci_lower = result$ci_lower,
    ci_upper = result$ci_upper,
    ci_width = result$ci_width
  )
  ci_results <- bind_rows(ci_results, new_row)
}

print("🔬 How Sample Size Affects Confidence Interval Width:")
print(ci_results)

# Visualize the effect of sample size
ggplot(ci_results, aes(x = factor(sample_size))) +
  geom_linerange(aes(ymin = ci_lower, ymax = ci_upper), linewidth = 3, color = "#9ecb97") +
  geom_hline(yintercept = true_black_proportion, color = "#ea9d69", linetype = "dashed") +
  labs(title = "Magic of Sample Size: Confidence Intervals \nGet Narrower!",
       x = "Sample Size",
       y = "90% Confidence Interval",
       subtitle = "Orange line: True proportion. Green lines: Confidence intervals") +
  theme_minimal()

[1] "🔬 How Sample Size Affects Confidence Interval Width:"

# A tibble: 5 × 4
  sample_size ci_lower ci_upper ci_width
        <dbl>    <dbl>    <dbl>    <dbl>
1          10    0.3      0.8     0.5   
2          20    0.45     0.8     0.35  
3          50    0.44     0.68    0.24  
4         200    0.565    0.68    0.115 
5         500    0.554    0.626   0.0720

../../../_images/9cd8b1eac44b6cd43cde4a77d47580ea659ec22cac34d2330fa2e2ec6c3bc95a.png

🔮 Day 4 - Ice Breaker Activity: Human Histogram & Statistics Discovery 📏#

This example shows you how to:

Load messy height data
Clean and fix common data problems
Calculate min, max, median, and mean
Handle missing or weird values

# 📁 Step 1: Load the data
########################################################

# Load the height data from our CSV file
height_data <- read.csv("../datasets/height.csv")

# 🧹 Step 2: Clean the data
########################################################
# only keep the second and third columns
height_data <- height_data[, c(2:3)] 

# The height column has a very long name! Let's rename it
colnames(height_data)[1] <- "wizard_name"
colnames(height_data)[2] <- "height_cm"

# Let's see the updated column names
print("New column names:")
print(colnames(height_data))

# Convert height to numeric (in case it's stored as text)
height_data$height_cm <- as.numeric(height_data$height_cm)

# Check for any weird values or missing data
print("Summary of height data:")
summary(height_data$height_cm)

[1] "New column names:"

[1] "wizard_name" "height_cm"

Warning message:
“NAs introduced by coercion”

[1] "Summary of height data:"

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
1.00e+00 1.35e+02 1.58e+02 4.00e+22 1.72e+02 1.00e+24        2 

# 🔍 Step 3: Data quality checks
########################################################

# Check for missing values (NA)
missing_count <- sum(is.na(height_data$height_cm))
print(paste("Number of missing heights:", missing_count))

# Check for unrealistic heights (too small or too large)
# Kids are usually between 100-200 cm tall
weird_heights <- height_data$height_cm[height_data$height_cm < 30 | height_data$height_cm > 190]
if(length(weird_heights) > 0) {
  print("Warning: Found some unusual heights:")
  print(weird_heights)
}

[1] "Number of missing heights: 2"

[1] "Warning: Found some unusual heights:"
[1] 5e+00    NA 6e+00 1e+00 3e+00 1e+24    NA 2e+02

# 📊 Step 4: Calculate basic statistics
########################################################

# Remove any missing values for our calculations
clean_heights <- height_data$height_cm[!is.na(height_data$height_cm)]

# Calculate the basic statistics
min_height <- min(clean_heights)
max_height <- max(clean_heights)
mean_height <- mean(clean_heights)
median_height <- median(clean_heights)

# Print our results in a nice format
print("🎉 HEIGHT STATISTICS 🎉")
print("========================")
print(paste("📏 Minimum height:", min_height, "cm"))
print(paste("📏 Maximum height:", max_height, "cm"))
print(paste("📏 Average height:", round(mean_height, 2), "cm"))
print(paste("📏 Median height:", median_height, "cm"))
print(paste("👥 Total people measured:", length(clean_heights)))

[1] "🎉 HEIGHT STATISTICS 🎉"

[1] "========================"

[1] "📏 Minimum height: 1 cm"

[1] "📏 Maximum height: 1e+24 cm"

[1] "📏 Average height: 4e+22 cm"

[1] "📏 Median height: 158 cm"

[1] "👥 Total people measured: 25"

🔮 Day 4 - Game 1: The Enchanted Halloween Neighborhood Mystery🎃#

The Mgaic Council of Halloween has received reports that the Shadow Monster has been cursing neighborhoods, turning precious candies into worthless shadow stones! As young Statistical Wizards, your mission is to determine if the Enchanted Forest Neighborhood is safe for trick-or-treating.

The neighborhood has a magical protection spell - if it contains 60% or more real candies (not shadow stones), it’s safe to visit. But beware! You only have 3 minutes of magical sight to examine your sample before the spell fades.

Round 1#

## 🧹 Step 1: Set Parameters and Load Round 1 Data
library(dplyr)
library(ggplot2)

# 🎯 Set the maximum number of candies possible in each bag
MAX_CANDIES_PER_BAG <- 10  # Change this if needed!

# 🏰 Set the safety threshold (60% or more means SAFE)
SAFETY_THRESHOLD <- 60

cat("🎃 Maximum candies per bag:", MAX_CANDIES_PER_BAG, "\n")
cat("🛡️ Safety threshold:", SAFETY_THRESHOLD, "% (>= this percentage means SAFE)\n")

🎃 Maximum candies per bag: 10

🛡️ Safety threshold: 60 % (>= this percentage means SAFE)

# 💡 Load Round 1 candy count data
# Make sure this CSV file is in your datasets folder!

# Round 1 data (replace with your actual file name)
round1_data <- read.csv("../datasets/candy_round1.csv")

# 🧹 Clean Round 1 data - use only columns 2, 3, 4 and clean data types

# Function to clean candy data
clean_candy_data <- function(data, round_name, candy_max = MAX_CANDIES_PER_BAG) {
  # Select only columns 2, 3, 4 and rename them
  cleaned <- data %>%
    select(
      wizard_name = 2,      # Column 2: wizard name
      candy_count_raw = 3,  # Column 3: candy count
      decision = 4          # Column 4: decision
    ) %>%
    # Convert candy count to numeric and clean invalid entries
    mutate(
      candy_count = as.numeric(candy_count_raw),
      round = round_name
    ) %>%
    # Remove rows with invalid candy counts (NA, negative, or above maximum)
    filter(
      !is.na(candy_count), 
      candy_count >= 0, 
      candy_count <= candy_max
    ) %>%
    # Calculate candy percentage
    mutate(
      candy_percentage = candy_count / candy_max * 100
    ) %>%
    # Keep only the columns we need
    select(wizard_name, candy_count, candy_percentage, decision, round)
  
  return(cleaned)
}

# Clean Round 1 dataset
round1_clean <- clean_candy_data(round1_data, "Round 1")

cat("✨ Round 1 Cleaned Data Summary:\n")
cat("Total wizards:", nrow(round1_clean), "\n")
cat("Valid candy counts range:", min(round1_clean$candy_count), "to", max(round1_clean$candy_count), "\n")

✨ Round 1 Cleaned Data Summary:

Total wizards: 21

Valid candy counts range: 1 to 10

Sampling Distribution for Round 1#

# 📊 Create a histogram showing everyone's candy percentages from Round 1
round1_avg <- mean(round1_clean$candy_percentage)

hist_round1 <-ggplot(round1_clean, aes(x = candy_percentage)) +
  geom_histogram(fill = "pink", color = "white", alpha = 0.7) +
  geom_vline(xintercept = round1_avg, color = "orange", linewidth = 2) +
  labs(
    title = "Round 1: Sampling Distribution",
    subtitle = paste("Orange line = Sample average (", round(round1_avg, 1), "%)"),
    x = "Candy Percentage (%)",
    y = "Number of Wizards"
  ) +
  theme_minimal() +
  theme(text = element_text(size = 16),
        plot.title = element_text(size = 20),
        axis.title = element_text(size = 18),
        axis.text = element_text(size = 19))

print(hist_round1)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

../../../_images/8d5241512119b9d11c58cd8711893f7045616ad893bf14b0d07416bd135f30f7.png

Calculate Round 1 Statistics#

# 📊 Calculate important statistics for Round 1
round1_std <- sd(round1_clean$candy_percentage)

cat("Round 1 Results:\n")
cat("Sample average:", round(round1_avg, 2), "%\n")
cat("Sample standard deviation:", round(round1_std, 2), "\n")
cat("Number of wizards:", nrow(round1_clean), "\n")
cat("Range of candy counts:", min(round1_clean$candy_count), "to", max(round1_clean$candy_count), "candies\n")

Round 1 Results:

Sample average: 54.29 %

Sample standard deviation: 21.81

Number of wizards: 21

Range of candy counts: 1 to 10 candies

Round 1 Predictions Analysis#

# 📊 Analyze the predictions made by wizards in Round 1
round1_with_predictions <- round1_clean %>%
  mutate(
    predicted_answer = case_when(
      grepl("SAFE", decision, ignore.case = TRUE) ~ "SAFE",
      grepl("CURSED", decision, ignore.case = TRUE) ~ "CURSED",
      TRUE ~ "UNKNOWN"
    )
  )

# Count SAFE vs CURSED predictions
round1_summary <- round1_with_predictions %>%
  count(predicted_answer) %>%
  filter(predicted_answer %in% c("SAFE", "CURSED"))

cat("📊 Round 1 Prediction Summary:\n")
for(i in seq_len(nrow(round1_summary))) {
  cat(round1_summary$predicted_answer[i], ":", round1_summary$n[i], "wizards\n")
}

# Show individual results
#cat("\n🧙‍♀️ Individual Wizard Results (Round 1):\n")
#for(i in seq_len(nrow(round1_with_predictions))) {
#  cat("⭐", round1_with_predictions$wizard_name[i], "- Found", 
#      round1_with_predictions$candy_count[i], "candies (", 
#      round(round1_with_predictions$candy_percentage[i], 1), "%) - Predicted:", 
#      round1_with_predictions$predicted_answer[i], "\n")
#}

📊 Round 1 Prediction Summary:

CURSED : 10 wizards
SAFE : 11 wizards

Round 2#

Load Round 2 Data#

# 💡 Load Round 2 candy count data
round2_data <- read.csv("../datasets/candy_round2.csv")

MAX_CANDIES_PER_BAG <- 20

# Clean Round 2 dataset using the same function
round2_clean <- clean_candy_data(round2_data, "Round 2", candy_max = MAX_CANDIES_PER_BAG)
# Remove wizards with very low candy counts for better analysis
round2_clean <- round2_clean[which(round2_clean$candy_count > 1),]

cat("\n✨ Round 2 Cleaned Data Summary:\n")
cat("Total wizards:", nrow(round2_clean), "\n")
cat("Valid candy counts range:", min(round2_clean$candy_count), "to", max(round2_clean$candy_count), "\n")

✨ Round 2 Cleaned Data Summary:

Total wizards: 22

Valid candy counts range: 4 to 14

Sampling Distribution for Round 2#

# 📊 Create a histogram showing everyone's candy percentages from Round 2
round2_avg <- mean(round2_clean$candy_percentage)

hist_round2 <- ggplot(round2_clean, aes(x = candy_percentage)) +
  geom_histogram(fill = "#6cb2e0", color = "white", alpha = 0.7) +
  geom_vline(xintercept = round2_avg, color = "purple", linewidth = 2) +
  labs(
    title = "Round 2: Sampling Distribution",
    subtitle = paste("Purple line = Sample average (", round(round2_avg, 1), "%)"),
    x = "Candy Percentage (%)",
    y = "Number of Wizards"
  ) +
  theme_minimal() +
  theme(text = element_text(size = 16),
        plot.title = element_text(size = 20),
        axis.title = element_text(size = 18),
        axis.text = element_text(size = 19))

print(hist_round2)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

../../../_images/775489e184a6dbd984f1be71b09399008adea06d2ecdb4da84bfda2aa49389bf.png

Calculate Round 2 Statistics#

# 📊 Calculate important statistics for Round 2
round2_std <- sd(round2_clean$candy_percentage)

cat("Round 2 Results:\n")
cat("Sample average:", round(round2_avg, 2), "%\n")
cat("Sample standard deviation:", round(round2_std, 2), "\n")
cat("Number of wizards:", nrow(round2_clean), "\n")
cat("Range of candy counts:", min(round2_clean$candy_count), "to", max(round2_clean$candy_count), "candies\n")

Round 2 Results:

Sample average: 46.99 %

Sample standard deviation: 12.08

Number of wizards: 22

Range of candy counts: 4 to 14 candies

Round 2 Predictions Analysis#

# 📊 Analyze the predictions made by wizards in Round 2
round2_with_predictions <- round2_clean %>%
  mutate(
    predicted_answer = case_when(
      grepl("SAFE", decision, ignore.case = TRUE) ~ "SAFE",
      grepl("CURSED", decision, ignore.case = TRUE) ~ "CURSED",
      TRUE ~ "UNKNOWN"
    )
  )

# Count SAFE vs CURSED predictions
round2_summary <- round2_with_predictions %>%
  count(predicted_answer) %>%
  filter(predicted_answer %in% c("SAFE", "CURSED"))

cat("📊 Round 2 Prediction Summary:\n")
for(i in seq_len(nrow(round2_summary))) {
  cat(round2_summary$predicted_answer[i], ":", round2_summary$n[i], "wizards\n")
}

# Show individual results
#cat("\n🧙‍♀️ Individual Wizard Results (Round 2):\n")
#for(i in seq_len(nrow(round2_with_predictions))) {
#  cat("⭐", round2_with_predictions$wizard_name[i], "- Found", 
#      round2_with_predictions$candy_count[i], "candies (", 
#      round(round2_with_predictions$candy_percentage[i], 1), "%) - Predicted:", 
#      round2_with_predictions$predicted_answer[i], "\n")
#}

📊 Round 2 Prediction Summary:

CURSED : 15 wizards
SAFE : 7 wizards

Overlay the Distributions#

# 🎨 Combine both datasets for comparison
combined_data <- bind_rows(round1_clean, round2_clean)

# Create overlapping histograms
hist_combined <- ggplot(combined_data, aes(x = candy_percentage, fill = round)) +
  geom_histogram(alpha = 0.6, position = "identity", color = "white") +
  geom_vline(xintercept = round1_avg, color = "orange", linewidth = 3, linetype = "dotted") +
  geom_vline(xintercept = round2_avg, color = "purple", linewidth = 3, linetype = "dotted") +
  scale_fill_manual(values = c("Round 1" = "pink", "Round 2" = "#6cb2e0")) +
  labs(
    title = "Comparison: Round 1 vs Round 2",
    subtitle = "Orange dotted = Round 1 avg, Purple dotted = Round 2 avg",
    x = "Candy Percentage (%)",
    y = "Number of Wizards",
    fill = "Round"
  ) +
  theme_minimal() +
  theme(text = element_text(size = 16),
        plot.title = element_text(size = 20),
        axis.title = element_text(size = 18),
        axis.text = element_text(size = 19))

print(hist_combined)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

../../../_images/62b0c478b376b02fe55efbed20359ad6c01f834668ea135ff1ed24204f0da9e7.png

Basic Statistical Comparison#

cat("📊 COMPARISON OF BOTH ROUNDS:\n")
cat(paste(rep("=", 40), collapse = ""), "\n")

cat("📈 Sample Averages:\n")
cat("Round 1 average:", round(round1_avg, 2), "%\n")
cat("Round 2 average:", round(round2_avg, 2), "%\n")

cat("\n📏 Sample Spreads (Standard Deviation):\n")
cat("Round 1 spread:", round(round1_std, 2), "%\n")
cat("Round 2 spread:", round(round2_std, 2), "%\n")

cat("\n🎯 Sample Sizes:\n")
cat("Round 1:", nrow(round1_clean), "wizards\n")
cat("Round 2:", nrow(round2_clean), "wizards\n")

# Basic comparison without true value
avg_difference <- abs(round1_avg - round2_avg)
cat("\n🔍 Difference between round averages:", round(avg_difference, 2), "%\n")

if(round1_std < round2_std) {
  cat("📊 Round 1 had less spread (more consistent results)\n")
} else if(round2_std < round1_std) {
  cat("📊 Round 2 had less spread (more consistent results)\n")
} else {
  cat("📊 Both rounds had similar spread\n")
}

📊 COMPARISON OF BOTH ROUNDS:

========================================

📈 Sample Averages:

Round 1 average: 54.29 %

Round 2 average: 46.99 %

📏 Sample Spreads (Standard Deviation):

Round 1 spread: 21.81 %

Round 2 spread: 12.08 %

🎯 Sample Sizes:

Round 1: 21 wizards

Round 2: 22 wizards

🔍 Difference between round averages: 7.29 %

📊 Round 2 had less spread (more consistent results)

The Big Reveal - True Population Parameter!#

# 🎯 NOW we reveal the TRUE candy percentage for the Enchanted Forest Neighborhood
TRUE_CANDY_PERCENTAGE <- 50  # Change this if needed!

# 🏰 Determine the correct answer based on the 60% safety threshold
CORRECT_ANSWER <- ifelse(TRUE_CANDY_PERCENTAGE >= 60, "SAFE", "CURSED")

cat("🔮 THE BIG REVEAL!\n")
cat(paste(rep("=", 50), collapse = ""), "\n")
cat("🏰 TRUE candy percentage in the neighborhood:", TRUE_CANDY_PERCENTAGE, "%\n")
cat("🏆 CORRECT answer should be:", CORRECT_ANSWER, "\n")
cat("🛡️ (Remember: >= ", SAFETY_THRESHOLD, "% means SAFE, < ", SAFETY_THRESHOLD, "% means CURSED)\n")

🔮 THE BIG REVEAL!

==================================================

🏰 TRUE candy percentage in the neighborhood: 50 %

🏆 CORRECT answer should be: CURSED

🛡️ (Remember: >=  60 % means SAFE, <  60 % means CURSED)

Updated Comparison with Truth Revealed#

# Create the overlay plot WITH the true value line
hist_true <- ggplot(combined_data, aes(x = candy_percentage, fill = round)) +
  geom_histogram(alpha = 0.6, position = "identity", color = "white") +
  geom_vline(xintercept = TRUE_CANDY_PERCENTAGE, color = "#dc4d4d", linewidth = 3, linetype = "dashed") +
  geom_vline(xintercept = round1_avg, color = "orange", linewidth = 2, linetype = "dotted") +
  geom_vline(xintercept = round2_avg, color = "purple", linewidth = 2, linetype = "dotted") +
  scale_fill_manual(values = c("Round 1" = "pink", "Round 2" = "#6cb2e0")) +
  labs(
    title = "Comparison: Round 1 vs Round 2 vs TRUTH",
    subtitle = paste("Red = True value (", TRUE_CANDY_PERCENTAGE, "%), Orange dotted = Round 1 avg, Purple solid = Round 2 avg"),
    x = "Candy Percentage (%)",
    y = "Number of Wizards",
    fill = "Round"
  ) +
  theme_minimal() +
  theme(text = element_text(size = 16),
        plot.title = element_text(size = 20),
        axis.title = element_text(size = 18),
        axis.text = element_text(size = 19))

print(hist_true)


# Calculate how close each round was to the truth
round1_error <- abs(round1_avg - TRUE_CANDY_PERCENTAGE)
round2_error <- abs(round2_avg - TRUE_CANDY_PERCENTAGE)

cat("\n🔮 ACCURACY ANALYSIS:\n")
cat("Round 1 average:", round(round1_avg, 2), "% (difference from truth:", round(round1_error, 2), "%)\n")
cat("Round 2 average:", round(round2_avg, 2), "% (difference from truth:", round(round2_error, 2), "%)\n")

if(round1_error < round2_error) {
  winner_round <- "Round 1"
  better_error <- round1_error
} else if(round2_error < round1_error) {
  winner_round <- "Round 2"  
  better_error <- round2_error
} else {
  winner_round <- "TIE"
  better_error <- round1_error
}

cat("\n🏆 WINNER: ", winner_round, " was closer to the true population parameter!\n")
if(winner_round != "TIE") {
  cat("📈 The winning round was only", round(better_error, 2), "% away from the truth!\n")
}

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

🔮 ACCURACY ANALYSIS:

Round 1 average: 54.29 % (difference from truth: 4.29 %)

Round 2 average: 46.99 % (difference from truth: 3.01 %)

🏆 WINNER:  Round 2  was closer to the true population parameter!

📈 The winning round was only 3.01 % away from the truth!

../../../_images/9f6d8e407a4b365f05519ed8f9dee5afa8d8f2fa978fbc8866e2289f4591b73c.png

Who Actually Got It Right?#

# Now we can determine who actually got the predictions right!

# Round 1 winners
round1_final <- round1_with_predictions %>%
  mutate(got_it_right = predicted_answer == CORRECT_ANSWER)

round1_winners <- round1_final %>%
  filter(got_it_right == TRUE)

#cat("🏆 ROUND 1 WINNERS (Wizards who got it right):\n")
#if(nrow(round1_winners) > 0) {
#  for(i in seq_len(nrow(round1_winners))) {
#    cat("⭐", round1_winners$wizard_name[i], "- Found", round1_winners$candy_count[i], "candies - Predicted:", round1_winners$predicted_answer[i], "\n")
#  }
#} else {
#  cat("No wizards got it exactly right in Round 1!\n")
#}

# Round 2 winners  
round2_final <- round2_with_predictions %>%
  mutate(got_it_right = predicted_answer == CORRECT_ANSWER)

round2_winners <- round2_final %>%
  filter(got_it_right == TRUE)

#cat("\n🏆 ROUND 2 WINNERS (Wizards who got it right):\n")
#if(nrow(round2_winners) > 0) {
#  for(i in seq_len(nrow(round2_winners))) {
#    cat("⭐", round2_winners$wizard_name[i], "- Found", round2_winners$candy_count[i], "candies - Predicted:", round2_winners$predicted_answer[i], "\n")
#  }
#} else {
#  cat("No wizards got it exactly right in Round 2!\n")
#}

#total_winners <- nrow(round1_winners) + nrow(round2_winners)
#cat("\n🎉 TOTAL PREDICTION WINNERS ACROSS BOTH ROUNDS:", total_winners, "wizards!\n")

🔮 Day 4 - Spell Solutions

Contents

🔮 Day 4 - Spell Solutions#

🔮 Day 4 - Spell 1: Distribution Detective Mission#

🎯 Mission Briefing#

💡 What are Distributions?#

🐉 Investigation 1: Dragon Power (Normal Distribution)#

💡 Key Learning#

🦄 Investigation 2: Unicorn Power (Left-Skewed Distribution)#

💡 Key Learning#

🔥 Investigation 3: Phoenix Power (Right-Skewed Distribution)#

💡 Key Learning#

🎨 Stacked Distribution Gallery#

💡 What Do You Notice?#

🔍 The Great Comparison Challenge#

💡 Mystery Question#

💡 Take-Home Messages from Spell 1#

🎈 Your Turn!#

🔮 Day 4 - Spell 2: The Great Creature Sampling Adventure#

💡 What is Sampling?#

🐉 Our Magical Population: The Enchanted Forest#

🎣 Taking Our First Sample (Size 40)#

🎣 Taking Another Sample#

🎣 And One More Sample!#

📊 The Sampling Distribution: Taking LOTS of Samples!#

📊 The Magic of Sample Size: What If We Change Sample Size?#

🎯 Key Magical Discoveries from Spell 2!#

💡 Real-World Magic Applications#

🔮 Day 4 - Spell 3: Bootstrap Bootcamp - The Great Dragon Fire Sheep Rescue#

📖 The Story: Dragon Fire Sheep Rescue! 🐑🔥#

🐑 Creating Our Magical Sheep Population#

💡 What is Bootstrapping?#

🔍 Phase 1: Your Original Vision (Taking Our First Sample)#

🔄 Phase 2: The Bootstrap Magic Ritual#

🌟 The Full Bootstrap Distribution: 1000 Magic Resamples!#

🏆 Phase 3: Creating Our Confidence Prophecy (Confidence Interval)#

🎲 Experiment: What If We Had Different Sample Sizes?#

🔮 Day 4 - Ice Breaker Activity: Human Histogram & Statistics Discovery 📏#

🔮 Day 4 - Game 1: The Enchanted Halloween Neighborhood Mystery🎃#

Round 1#

Sampling Distribution for Round 1#

Calculate Round 1 Statistics#

Round 1 Predictions Analysis#

Round 2#

Load Round 2 Data#

Sampling Distribution for Round 2#

Calculate Round 2 Statistics#

Round 2 Predictions Analysis#

Overlay the Distributions#

Basic Statistical Comparison#

The Big Reveal - True Population Parameter!#

Updated Comparison with Truth Revealed#

Who Actually Got It Right?#