DSCI 522 Lecture 7

Data Analysis Pipeline and GNU Make

Sky Sheng

🐛 Recap: bug of last week

services:
  # run jupyter notebook inside jupyter 
  jupyter-notebook:
    image:  skysheng7/dsci522:681bccb
    ports:
      - "8888:8888"
    volumes:
      - .:/home/jovyan
    deploy:
      resources:
        limits:
          memory: 5G
    platform: linux/amd64

Data Analysis Pipeline

Image generated by OpenAI GPT-5

What is a data analysis pipeline?

  • A pipeline is a sequence of data processing steps
  • Each step takes input and produces output
  • Steps are connected: output of one step becomes input to the next

Shell script demo

  1. Go to this repository and click the green “Use this template” button to get your own copy of the data_analysis_pipeline_practice repository.

  2. Clone the repository and cd into the folder.

git clone <repository_url>
cd <repository_name>
  1. Install and ativate the conda environment (linux-64, osx-arm64, osx-64, win-64 OS)
conda-lock install --name da-pipeline-make conda-lock.yml
conda activate da-pipeline-make

Alternative way

  1. Create a docker-compose.yml file locally in your repository.

  2. Copy and paste the following code into the docker-compose.yml file.

services:
  # run jupyter notebook inside jupyter 
  jupyter-notebook:
    image: skysheng7/dsci522:681bccb
    ports:
      - "8888:8888"
    volumes:
      - .:/home/jovyan
    deploy:
      resources:
        limits:
          memory: 5G
    platform: linux/amd64
  1. Run the command in your terminal:
docker compose up

Command line non-interactive scripts 📗

  1. Open your terminal and run this python script, it reads in a text file, counts the words in this text file, and outputs a data file
python scripts/wordcount.py \
    --input_file=data/isles.txt \
    --output_file=results/isles.dat
  1. Check out the first few rows of the output data file:
head results/isles.dat
  1. Now we have another script that reads in a data file and save a plot of the 10 most frequently occurring words:
python scripts/plotcount.py \
    --input_file=results/isles.dat \
    --output_file=results/figure/isles.png

Pipeline

scripts/wordcount.py:

  1. Read a data file.
  2. Perform an analysis on this data file.
  3. Write the analysis results to a new file.

scripts/plotcount.py:

  1. Plot a graph of the analysis results

Create a shell script to run the pipeline

  1. Create a shell script called run_pip.sh at root directory
# perform wordcout on novels
python scripts/wordcount.py \
    --input_file=data/isles.txt \
    --output_file=results/isles.dat

# create plots
python scripts/plotcount.py \
    --input_file=results/isles.dat \
    --output_file=results/figure/isles.png
  1. In your terminal, run:
bash run_pip.sh

Complete shell script to run all 4 books

# run_all.sh
# Tiffany Timbers, Nov 2018

# This driver script completes the textual analysis of
# 3 novels and creates figures on the 10 most frequently
# occuring words from each of the 3 novels. This script
# takes no arguments.

# example usage:
# bash run_all.sh

# count the words
python scripts/wordcount.py --input_file=data/isles.txt --output_file=results/isles.dat
python scripts/wordcount.py --input_file=data/abyss.txt --output_file=results/abyss.dat
python scripts/wordcount.py --input_file=data/last.txt --output_file=results/last.dat
python scripts/wordcount.py --input_file=data/sierra.txt --output_file=results/sierra.dat

# create the plots
python scripts/plotcount.py --input_file=results/isles.dat --output_file=results/figure/isles.png
python scripts/plotcount.py --input_file=results/abyss.dat --output_file=results/figure/abyss.png
python scripts/plotcount.py --input_file=results/last.dat --output_file=results/figure/last.png
python scripts/plotcount.py --input_file=results/sierra.dat --output_file=results/figure/sierra.png

# write the report
quarto render report/count_report.qmd

Complete pipeline with a report:

scripts/wordcount.py:

  1. Read a data file.
  2. Perform an analysis on this data file.
  3. Write the analysis results to a new file.

scripts/plotcount.py:

  1. Plot a graph of the analysis results

quarto render

  1. Use the plots we generated to create a report

Your milestone 3: Tiffany’s example 🎯

# download the data
python scripts/download_data.py \
    --url="https://archive.ics.uci.edu/static/public/15/breast+cancer+wisconsin+original.zip" \
    --write-to=data/raw

# split and preprocess the data
python scripts/split_n_preprocess.py \
    --raw-data=data/raw/wdbc.data \
    --data-to=data/processed \
    --preprocessor-to=results/models \
    --seed=522

# perform exploratory data analysis
python scripts/eda.py \
    --processed-training-data=data/processed/scaled_cancer_train.csv \
    --plot-to=results/figures

# fit the breast cancer classifier
python scripts/fit_breast_cancer_classifier.py \
    --training-data=data/processed/cancer_train.csv \
    --preprocessor=results/models/cancer_preprocessor.pickle \
    --columns-to-drop=data/processed/columns_to_drop.csv \
    --pipeline-to=results/models \
    --plot-to=results/figures \
    --seed=523

# evaluate the breast cancer classifier
python scripts/evaluate_breast_cancer_predictor.py \
    --scaled-test-data=data/processed/cancer_test.csv \
    --pipeline-from=results/models/cancer_pipeline.pickle \
    --results-to=results/tables \
    --seed=524

# render the report
quarto render report/breast_cancer_predictor_report.qmd --to html
quarto render report/breast_cancer_predictor_report.qmd --to pdf

Bash script VS run python script directly from terminal 🤔

What are some limitations of bash shell script? 😵

  • Manually deleting generated files is time-consuming
  • Runs all steps every time, even when only small parts changed

Makefile saves the day! 🤩

When do you NOT want to make clean to delete all files?

  • When the data you generated can not be re-created easilly, e.g., the data was generated interviewing 1000 people.
  • When the data you generated cost money 💰
  • Be careful with rm -rf command!!
    • 🤖 Once upon a time, a student was talking to their favorite AI agent on auto-run mode, and said:
    • “Can you delete this for me please ~” 😊

Never do this: rm -rf ~

Image generated by OpenAI GPT-5

🤨 Are you tired of running…

conda export --from-history > environment.yml
python update_enviroment_yml.py --root_dir="." --env_name="ai_env" --yml_name="environment.yml"
conda-lock -k explicit --file environment.yml -p linux-64

ME TOO! 🤡

Makefile to the rescue!

👉 So I automated it using Makefile & GitHub Actions for my own work:

Docker, your best friend in the cloud!

🐳 Docker is actually very helpful in real-world data science projects, especially when using cloud computing resources!

Instructions

  1. Clone and cd into this repository
git clone https://github.com/skysheng7/awp-arbutus-login.git
cd awp-arbutus-login
  1. Install the dependencies using environment.yml file
conda env create -n arbutus
  1. Activate the environment
conda activate arbutus

Instructions

  1. Install a new package
conda install click
  1. Run the makefile
make env