DSCI 522 Lecture 4

Data Validation with Pandera

Sky Sheng

Recap: Docker image on Docker Hub, then what?? πŸ€·β€β™€οΈ

Next steps after Docker image on DockerHub

  1. To access the docker image manually:
  • docker pull username/image:tag
  • Example: docker pull ttimbers/breast-cancer-predictor:3f8675c
  • Run the docker image using the following command
docker run \
--rm \
-it \
-p 8887:8888 \
ttimbers/breast-cancer-predictor:3f8675c

Next steps after Docker image on DockerHub

  1. To access the docker image using Docker Compose:
  • Create a docker-compose.yml file
  • docker compose up
  • docker compose rm
  1. Update your documentation (README, CONTRIBUTING, etc.) about how others should use your docker image.

🀩 Let’s organize our complete workflow!

πŸ”§ Local Development

  1. conda create & conda install
  2. Create environment.yml
  3. Add version numbers
  4. Generate conda-lock file
  5. Write Dockerfile
  6. Create docker-compose.yml
  7. Push to GitHub

☁️ CI/CD & Deployment

  1. Add GitHub Actions workflow yml file
  2. Add Docker Hub Personal Access Token (PAT)
  3. Configure GitHub Secrets
  4. Workflow runs (auto/manual)
  5. Check Docker Hub for new image
  6. Locally, docker pull new image
  7. Locally, command line docker run

🐳 New GitHub Actions Workflow yml file

  • Check out this new GitHub Actions workflow yml file Tiffany created
  • Now you can replace steps 13 + 14 to: git pull locally to extract updated docker-compose.yml file
  • run docker compose up locally to start the container
  • run docker compose rm locally to stop the container

Let’s breakdown the GitHub Action workflow!

πŸ™‹β€β™€οΈ What is && \ in Dockerfile?

# build on top of template of minimal notebook
FROM quay.io/jupyter/minimal-notebook:afe30f0c9ad8

# copy all conda environment dependencies
COPY conda-linux-64.lock /tmp/conda-linux-64.lock

# copy my local python package files to pip install in docker
COPY pyproject.toml /tmp/pyproject.toml
COPY src /tmp/src
COPY README.md /tmp/README.md

# conda install all the other packages
RUN mamba update --quiet --file /tmp/conda-linux-64.lock \
    && mamba clean --all -y -f \
    && fix-permissions "${CONDA_DIR}" \
    && fix-permissions "/home/${NB_USER}"

# install openai using pip because the openai package insatlled from conda has bug
# also install my local AI_representation_bias_in_farming as a python package
# 2025-06-22: added gpt-image-1 to the list of models
RUN pip install openai==1.57.0 \
    && python -m pip install -e /tmp 

Today: Data Validation!

Image generated by OpenAI GPT-5

Remember this roast my repo? πŸ”₯

https://shorturl.at/QSv7M

Data validation in action

Check out this data cleaning script I created for the Moo4Feed R package

βœ… Data validation checklist

  • Group Milestone 2: Make sure you check all the boxes!

πŸ’» Let’s work with Pandera!

Image generated by OpenAI GPT-5

πŸ’¬ Discuss with your group: How will you conduct data validation for your data to check all the boxes?