Data Analysis Pipeline and GNU Make
Image generated by OpenAI GPT-5
Go to this repository and click the green “Use this template” button to get your own copy of the data_analysis_pipeline_practice repository.
Clone the repository and cd into the folder.
Create a docker-compose.yml file locally in your repository.
Copy and paste the following code into the docker-compose.yml file.
services:
# run jupyter notebook inside jupyter
jupyter-notebook:
image: skysheng7/dsci522:681bccb
ports:
- "8888:8888"
volumes:
- .:/home/jovyan
deploy:
resources:
limits:
memory: 5G
platform: linux/amd64scripts/wordcount.py:
scripts/plotcount.py:
run_pip.sh at root directory# perform wordcout on novels
python scripts/wordcount.py \
--input_file=data/isles.txt \
--output_file=results/isles.dat
# create plots
python scripts/plotcount.py \
--input_file=results/isles.dat \
--output_file=results/figure/isles.png# run_all.sh
# Tiffany Timbers, Nov 2018
# This driver script completes the textual analysis of
# 3 novels and creates figures on the 10 most frequently
# occuring words from each of the 3 novels. This script
# takes no arguments.
# example usage:
# bash run_all.sh
# count the words
python scripts/wordcount.py --input_file=data/isles.txt --output_file=results/isles.dat
python scripts/wordcount.py --input_file=data/abyss.txt --output_file=results/abyss.dat
python scripts/wordcount.py --input_file=data/last.txt --output_file=results/last.dat
python scripts/wordcount.py --input_file=data/sierra.txt --output_file=results/sierra.dat
# create the plots
python scripts/plotcount.py --input_file=results/isles.dat --output_file=results/figure/isles.png
python scripts/plotcount.py --input_file=results/abyss.dat --output_file=results/figure/abyss.png
python scripts/plotcount.py --input_file=results/last.dat --output_file=results/figure/last.png
python scripts/plotcount.py --input_file=results/sierra.dat --output_file=results/figure/sierra.png
# write the report
quarto render report/count_report.qmdscripts/wordcount.py:
scripts/plotcount.py:
quarto render
# download the data
python scripts/download_data.py \
--url="https://archive.ics.uci.edu/static/public/15/breast+cancer+wisconsin+original.zip" \
--write-to=data/raw
# split and preprocess the data
python scripts/split_n_preprocess.py \
--raw-data=data/raw/wdbc.data \
--data-to=data/processed \
--preprocessor-to=results/models \
--seed=522
# perform exploratory data analysis
python scripts/eda.py \
--processed-training-data=data/processed/scaled_cancer_train.csv \
--plot-to=results/figures
# fit the breast cancer classifier
python scripts/fit_breast_cancer_classifier.py \
--training-data=data/processed/cancer_train.csv \
--preprocessor=results/models/cancer_preprocessor.pickle \
--columns-to-drop=data/processed/columns_to_drop.csv \
--pipeline-to=results/models \
--plot-to=results/figures \
--seed=523
# evaluate the breast cancer classifier
python scripts/evaluate_breast_cancer_predictor.py \
--scaled-test-data=data/processed/cancer_test.csv \
--pipeline-from=results/models/cancer_pipeline.pickle \
--results-to=results/tables \
--seed=524
# render the report
quarto render report/breast_cancer_predictor_report.qmd --to html
quarto render report/breast_cancer_predictor_report.qmd --to pdfenvironment.yml file?make clean to delete all files?rm -rf command!!
“Can you delete this for me please ~” 😊
rm -rf ~Image generated by OpenAI GPT-5
ME TOO! 🤡
👉 So I automated it using Makefile & GitHub Actions for my own work:
Docker, your best friend in the cloud!
🐳 Docker is actually very helpful in real-world data science projects, especially when using cloud computing resources!
environment.yml file