๐Ÿ”ฎ Day 5 - Spell Solutions#

Hide code cell content

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
Sys.setlocale("LC_CTYPE", "en_US.UTF-8")
'en_US.UTF-8'

๐ŸŽฏ Learning Goals#

In this spell, youโ€™ll learn to:

  • ๐Ÿ” Explore magical creature data with visualizations

  • ๐Ÿค– Build your first KNN (K-Nearest Neighbors) classification model

  • ๐Ÿ“Š Evaluate model performance with accuracy and confusion matrices

  • โšก Find the optimal K value for best predictions

  • ๐Ÿ”ฎ Make predictions for new magical creatures


๐Ÿ“š Load Our Magical Libraries#

First, letโ€™s load the libraries we need for our machine learning magic!

# Load our magical libraries
library(tidymodels)  # For machine learning magic
library(dplyr)   # For data manipulation and visualization
library(kknn)        # For KNN classification
โ”€โ”€ Attaching packages โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ tidymodels 1.3.0 โ”€โ”€
โœ” broom        1.0.9     โœ” recipes      1.3.1
โœ” dials        1.4.1     โœ” rsample      1.3.1
โœ” dplyr        1.1.4     โœ” tibble       3.3.0
โœ” ggplot2      3.5.2     โœ” tidyr        1.3.1
โœ” infer        1.0.9     โœ” tune         1.3.0
โœ” modeldata    1.5.0     โœ” workflows    1.2.0
โœ” parsnip      1.3.2     โœ” workflowsets 1.1.1
โœ” purrr        1.1.0     โœ” yardstick    1.3.2
โ”€โ”€ Conflicts โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ tidymodels_conflicts() โ”€โ”€
โœ– purrr::discard() masks scales::discard()
โœ– dplyr::filter()  masks stats::filter()
โœ– dplyr::lag()     masks stats::lag()
โœ– recipes::step()  masks stats::step()

๐Ÿ“– Load the Magical Creature Dataset#

Oda has been collecting data about various magical creatures sheโ€™s encountered. Letโ€™s see what sheโ€™s discovered!

# Load the magical creature dataset
# This dataset contains information about various magical creatures Oda has encountered
creatures <- read.csv("../datasets/magical_creatures.csv")

# Take a look at our magical friends
head(creatures)
A data.frame: 6 ร— 6
creature_idnamesizemagic_powerfriendliness_scorebehavior
<int><chr><dbl><dbl><dbl><chr>
11Sparkle Unicorn 7.28.59.1friendly
22Shadow Wolf 6.87.23.4mischievous
33Rainbow Butterfly1.56.88.9friendly
44Thunder Dragon 9.59.82.1mischievous
55Healing Pixie 2.17.99.5friendly
66Storm Raven 4.28.13.8mischievous

๐Ÿ” Part 1: Exploring the Magical Creature Data#

Letโ€™s get to know our magical creatures better!

# Let's see what variables we have to work with
# Each creature has: size, magic_power, friendliness_score, and behavior (friendly/mischievous)
glimpse(creatures)
Rows: 100
Columns: 6
$ creature_id        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, โ€ฆ
$ name               <chr> "Sparkle Unicorn", "Shadow Wolf", "Rainbow Butterflโ€ฆ
$ size               <dbl> 7.2, 6.8, 1.5, 9.5, 2.1, 4.2, 5.8, 7.9, 1.8, 6.5, 3โ€ฆ
$ magic_power        <dbl> 8.5, 7.2, 6.8, 9.8, 7.9, 8.1, 6.5, 8.7, 5.9, 7.8, 7โ€ฆ
$ friendliness_score <dbl> 9.1, 3.4, 8.9, 2.1, 9.5, 3.8, 8.7, 2.9, 9.2, 3.1, 8โ€ฆ
$ behavior           <chr> "friendly", "mischievous", "friendly", "mischievousโ€ฆ
# How many creatures of each type do we have?
creatures %>%
  count(behavior)
A data.frame: 2 ร— 2
behaviorn
<chr><int>
friendly 51
mischievous49

๐ŸŽจ Visualizing Our Creatures#

Letโ€™s create a beautiful plot to see if we can spot any patterns!

# Let's visualize our creatures!
# Plot size vs magic_power, colored by behavior
ggplot(creatures, aes(x = size, y = magic_power, color = behavior)) +
  geom_point(size = 3, alpha = 0.7) +
  labs(title = "Oda's Magical Creature Collection",
       x = "Size", 
       y = "Magic Power",
       color = "Behavior") +
  theme_minimal()

๐Ÿค” Think About It!#

๐Ÿ’ก Question: Can you see any patterns? Do friendly creatures tend to cluster together?


โš™๏ธ Part 2: Setting Up Our KNN Model#

Now letโ€™s prepare our data for machine learning magic!

๐Ÿ“Š Split the Data#

We need to split our data so we can train our model and then validate how well it works on โ€œnewโ€ creatures.

# Split our data into training and validation sets
# We'll use 75% for training, 25% for validation (to choose best K)
set.seed(123)  # For reproducible results

creature_split <- initial_split(creatures, prop = 0.75, strata = behavior)
creature_train <- training(creature_split)
creature_validation <- testing(creature_split)
# Let's check our split worked well
cat("Training data:\n")
creature_train %>% count(behavior)

cat("\nValidation data:\n")
creature_validation %>% count(behavior)
Training data:
A data.frame: 2 ร— 2
behaviorn
<chr><int>
friendly 38
mischievous36
Validation data:
A data.frame: 2 ร— 2
behaviorn
<chr><int>
friendly 13
mischievous13

๐Ÿณ Create a Recipe for Data Preprocessing#

In machine learning, we often need to โ€œcookโ€ our data before feeding it to the model. Hereโ€™s our recipe!

# Create a recipe for preprocessing our data
# We'll standardize size and magic_power to make sure they're on the same scale
creature_recipe <- recipe(behavior ~ size + magic_power, data = creature_train) %>%
  step_scale(all_predictors()) %>%     # Scale to standard deviation 1
  step_center(all_predictors())        # Center around 0

print("Our magical recipe is ready!")
[1] "Our magical recipe is ready!"

๐Ÿค– Create Our KNN Model#

Now letโ€™s create our KNN model. Weโ€™ll start with K=5 neighbors.

# Create our KNN model specification
# We'll start with K=5 neighbors
knn_model <- nearest_neighbor(neighbors = 5) %>%
  set_engine("kknn") %>%
  set_mode("classification")

print("KNN model created with K=5 neighbors!")
[1] "KNN model created with K=5 neighbors!"

๐ŸŽ“ Part 3: Training Our KNN Classifier#

Time to train our magical creature classifier!

# Create a workflow that combines our recipe and model
creature_workflow <- workflow() %>%
  add_recipe(creature_recipe) %>%
  add_model(knn_model)

print("Workflow assembled - ready for training!")
[1] "Workflow assembled - ready for training!"
# Train our model!
creature_fit <- creature_workflow %>%
  fit(data = creature_train)

print("๐ŸŽ‰ Model training complete!")
[1] "๐ŸŽ‰ Model training complete!"
# Let's make predictions on our validation data
creature_predictions <- predict(creature_fit, creature_validation) %>%
  bind_cols(creature_validation)

# Look at our predictions
head(creature_predictions)
A tibble: 6 ร— 7
.pred_classcreature_idnamesizemagic_powerfriendliness_scorebehavior
<fct><int><chr><dbl><dbl><dbl><chr>
mischievous 1Sparkle Unicorn 7.28.59.1friendly
friendly 2Shadow Wolf 6.87.23.4mischievous
mischievous 3Rainbow Butterfly1.56.88.9friendly
mischievous 6Storm Raven 4.28.13.8mischievous
mischievous11Crystal Owl 3.27.18.4friendly
mischievous20Cave Bat 2.97.33.5mischievous

๐Ÿ“ˆ Part 4: Evaluating Our Magical Predictions#

How well did our model do? Letโ€™s find out!

๐ŸŽฏ Calculate Accuracy#

# Calculate the accuracy of our predictions
# Convert to factors for proper classification metrics
creature_predictions_factor <- creature_predictions %>%
  mutate(
    behavior = as.factor(behavior),
    .pred_class = as.factor(.pred_class)
  )

accuracy_result <- creature_predictions_factor %>%
  accuracy(truth = behavior, estimate = .pred_class)

print(paste("Our KNN model accuracy:", round(accuracy_result$.estimate, 3)))
[1] "Our KNN model accuracy: 0.731"

๐Ÿ“Š Confusion Matrix#

A confusion matrix shows us exactly where our model made mistakes.

# Create a confusion matrix to see where we made mistakes
creature_predictions_factor %>%
  conf_mat(truth = behavior, estimate = .pred_class)
             Truth
Prediction    friendly mischievous
  friendly          10           4
  mischievous        3           9

๐ŸŽจ Visualize Predictions vs Reality#

# Visualize our predictions vs actual behavior
ggplot(creature_predictions, aes(x = size, y = magic_power)) +
  geom_point(aes(color = behavior, shape = .pred_class), size = 3, alpha = 0.8) +
  labs(title = "Actual vs Predicted Creature Behavior",
       subtitle = "Color = Actual, Shape = Predicted",
       x = "Size", 
       y = "Magic Power",
       color = "Actual Behavior",
       shape = "Predicted Behavior") +
  theme_minimal()

๐Ÿ”ง Part 5: Testing Different K Values#

Different K values can give us different results. Letโ€™s find the best one!

# Let's try different K values to see which works best!
test_k_values <- function(k_val) {
  knn_model_k <- nearest_neighbor(neighbors = k_val) %>%
    set_engine("kknn") %>%
    set_mode("classification")
  
  workflow_k <- workflow() %>%
    add_recipe(creature_recipe) %>%
    add_model(knn_model_k)
  
  fit_k <- workflow_k %>% fit(data = creature_train)
  
  predictions_k <- predict(fit_k, creature_validation) %>%
    bind_cols(creature_validation)
  
  accuracy_k <- predictions_k %>%
    mutate(
      behavior = as.factor(behavior),
      .pred_class = as.factor(.pred_class)
    ) %>%
    accuracy(truth = behavior, estimate = .pred_class) %>%
    pull(.estimate)
  
  return(accuracy_k)
}
# Test different K values
k_values <- c(1, 3, 5, 7, 10)
accuracies <- map_dbl(k_values, test_k_values)

# Create a data frame with results
k_results <- tibble(K = k_values, Accuracy = accuracies)

print("Results for different K values:")
print(k_results)
[1] "Results for different K values:"
# A tibble: 5 ร— 2
      K Accuracy
  <dbl>    <dbl>
1     1    0.654
2     3    0.654
3     5    0.731
4     7    0.731
5    10    0.769
# Plot K vs Accuracy
ggplot(k_results, aes(x = K, y = Accuracy)) +
  geom_line(color = "orange", linewidth = 1) +
  geom_point(color = "#73bbda", size = 3) +
  labs(title = "KNN Performance: How K Affects Accuracy",
       x = "Number of Neighbors (K)",
       y = "Accuracy") +
  theme_minimal()

๐Ÿ”ฎ Part 6: Predict New Magical Creatures!#

Oda has discovered some new creatures! Can you predict their behavior?

# Oda has discovered some new creatures! Can you predict their behavior?
new_creatures <- tibble(
  name = c("Sparkle Dragon", "Tiny Pixie", "Giant Troll"),
  size = c(8.5, 1.2, 9.8),
  magic_power = c(9.1, 7.8, 3.2)
)

print("New creatures discovered:")
print(new_creatures)
[1] "New creatures discovered:"
# A tibble: 3 ร— 3
  name            size magic_power
  <chr>          <dbl>       <dbl>
1 Sparkle Dragon   8.5         9.1
2 Tiny Pixie       1.2         7.8
3 Giant Troll      9.8         3.2
# Make predictions for the new creatures
# Use the best K value from your analysis above
best_k <- k_results %>%
  filter(Accuracy == max(Accuracy)) %>%
  pull(K) %>%
  first()

print(paste("Best K value:", best_k))

# Retrain model with best K
best_knn_model <- nearest_neighbor(neighbors = best_k) %>%
  set_engine("kknn") %>%
  set_mode("classification")

best_workflow <- workflow() %>%
  add_recipe(creature_recipe) %>%
  add_model(best_knn_model)

best_fit <- best_workflow %>% fit(data = creature_train)

# Predict behavior for new creatures
new_predictions <- predict(best_fit, new_creatures) %>%
  bind_cols(new_creatures)

print("Predictions for new magical creatures:")
print(new_predictions)
[1] "Best K value: 10"
[1] "Predictions for new magical creatures:"
# A tibble: 3 ร— 4
  .pred_class name            size magic_power
  <fct>       <chr>          <dbl>       <dbl>
1 mischievous Sparkle Dragon   8.5         9.1
2 mischievous Tiny Pixie       1.2         7.8
3 friendly    Giant Troll      9.8         3.2

๐Ÿงช Part 7: Final Model Testing with Independent Test Set#

Now letโ€™s test our final model on completely unseen data to get an unbiased performance estimate!

# Load the independent test dataset
# This data was never used for training or choosing K
test_creatures <- read.csv("../datasets/magical_creatures_test.csv")

print("Independent test dataset loaded:")
print(paste("Number of test creatures:", nrow(test_creatures)))
test_creatures %>% count(behavior)
[1] "Independent test dataset loaded:"
[1] "Number of test creatures: 36"
A data.frame: 2 ร— 2
behaviorn
<chr><int>
friendly 18
mischievous18
# Use our best model (with optimal K) to make predictions on test set
final_test_predictions <- predict(best_fit, test_creatures) %>%
  bind_cols(test_creatures)

# Calculate final test accuracy
final_test_accuracy <- final_test_predictions %>%
  mutate(
    behavior = as.factor(behavior),
    .pred_class = as.factor(.pred_class)
  ) %>%
  accuracy(truth = behavior, estimate = .pred_class)

print(paste("๐ŸŽฏ Final Test Accuracy (K =", best_k, "):", round(final_test_accuracy$.estimate, 3)))
[1] "๐ŸŽฏ Final Test Accuracy (K = 10 ): 0.889"
# Final confusion matrix on test set
print("๐Ÿ” Final Confusion Matrix on Test Set:")
final_test_predictions %>%
  mutate(
    behavior = as.factor(behavior),
    .pred_class = as.factor(.pred_class)
  ) %>%
  conf_mat(truth = behavior, estimate = .pred_class)
[1] "๐Ÿ” Final Confusion Matrix on Test Set:"
             Truth
Prediction    friendly mischievous
  friendly          15           1
  mischievous        3          17
# Visualize final test results
ggplot(final_test_predictions, aes(x = size, y = magic_power)) +
  geom_point(aes(color = behavior, shape = .pred_class), size = 3, alpha = 0.8) +
  labs(title = "Final Model Performance on Independent Test Set",
       subtitle = paste("Best K =", best_k, "| Test Accuracy =", round(final_test_accuracy$.estimate, 3)),
       x = "Size", 
       y = "Magic Power",
       color = "True Behavior",
       shape = "Predicted Behavior") +
  theme_minimal()