DSCI 522 Lecture 1

Introduction to Reproducible and Trustworthy Data Science Workflow

Sky Sheng

Welcome Back! 🎉

Hope you had a nice readings week!

Image generated by OpenAI GPT-5

👩‍🍳 Who loves cooking?

💥 Consequences of Untested Systems

Untested system for radiation therapy: Terac-25

Source: Cleveland Clinic

💥 Consequences of Untrustworthy and non-reproducible Paper

Source: Timbers, T. A., Ostblom, J., D’Andrea, F., Lourenzutti, R., & Chen, D. Reproducible and Trustworthy Workflows for Data Science

👩‍💻 Data Science is…

👩‍💻 Data Science is…

Source: Timbers, T. A., Ostblom, J., D’Andrea, F., Lourenzutti, R., & Chen, D. Reproducible and Trustworthy Workflows for Data Science

💡 Reproducible analysis:

Source: Timbers, T. A., Ostblom, J., D’Andrea, F., Lourenzutti, R., & Chen, D. Reproducible and Trustworthy Workflows for Data Science

🧐 Auditable/transparent analysis:

Source: Timbers, T. A., Ostblom, J., D’Andrea, F., Lourenzutti, R., & Chen, D. Reproducible and Trustworthy Workflows for Data Science

🔥 Roast my repo!

https://shorturl.at/QSv7M

😟

  • It does not run anymore! 😱
  • It was saved only on my computer, no version control!
  • Different code chunks run on different computers with different operating systems
  • Different code chunks take in different data sources (stored in various locations on my computer)
  • No modular scripts/functions, hard to reuse
  • Not tested at all
  • Heavy cognitive load for new user (my poor biology colleagues)
  • The file name is long & not intuitive

😊

  • Well documented
  • Organized into different “chapters”
  • “It run on my computer at that time”

what is version control?

💭 Share your story

  • Think and write down a non-reproducible, or non-auditable, workflow you have used before at work, on a personal project, or in course work, that negatively impacted your work somehow (make sure to include this in the story).

  • Share your story in the Google Doc

  • Activity in textbook

✨ Clean, well documented and distributable code

Practice Benefit
Well-documented Easily understandable
Modular code Easy to reuse
Well-tested code Reliable
Portable across machines Easily accessible

Source: Ma, E. J. (2024, October 25). The Human Dimension to Clean, Distributable, and Documented Data Science Code.

Why do we need virtual environment?

🦷 Toothbrushes for different “projects”

Which toothbrush should I use for my oral hygiene??

📦 Organize tools into separate boxes

Organize programming language & packages into separate virtual environments

🔧 Different virtual environment management tools

For R:

  • packrat
  • renv
  • conda

For Python:

  • venv
  • virtualenv
  • conda
  • mamba
  • poetry
  • pipenv

🤩 Let’s create our conda virtual environment!