Posts written by Mobilize Ops

How IDS Builds the Foundation for College Success

What if coding could solve your math homework? That’s not the kind of question you’d expect in a high school classroom, but it’s exactly what Introduction to Data Science (IDS) taught me to ask. My high school first offered IDS during my sophomore year. Marketed as an alternative to calculus for upperclassmen, I eagerly enrolled despite being younger. Having already completed every computer science course at my school, I was thrilled by the chance to further develop my coding skills. What I discovered was a course unlike any other—one that inspired my UC admissions essay, shaped my academic confidence, and laid the groundwork for my success as a UCLA computer science student. 

By contrasting IDS with traditional math and computer science courses, its value to students becomes clear. I took IDS alongside AP Statistics and AP Calculus BC, and I can confidently say IDS uniquely prepared me for college. Traditional math courses like calculus focus on textbook-driven learning—digesting information and applying formulas. While this approach is valuable and well-supported in higher education, it contrasts sharply with IDS, which emphasizes real-world applications over rote memorization. Both methods are important, but IDS fills a prominent gap by equipping students with practical, transferable skills often overlooked in traditional curricula. 

During my first quarter at UCLA, a professor asked our class to download a CSV file and import it into a Colab notebook containing pre-written code. Many students felt overwhelmed, whether by their first encounter with a CSV file or by the sight of code in the notebook. Thanks to IDS, I approached the task with confidence. I recognized the file format, understood the purpose of each code snippet, and appreciated the comments explaining the logic. Even knowing basic concepts like understanding the syntax of comments gave me an edge over peers encountering these concepts for the first time. The code was well-documented and I had the skills to navigate it. Learning these fundamentals in a low-pressure high school environment gave me a significant advantage over those grappling with them for the first time in a crowded college lecture hall. 

Unlike AP Computer Science, which often attracts students already interested in programming, IDS invites a broader audience by presenting coding as an accessible problem-solving tool. This distinction is crucial. Traditional computer science courses can inadvertently reinforce the misconception that programming is reserved for the technically inclined. IDS dismantles this notion by embedding coding within a math curriculum, reframing it as approachable and universally relevant. 

The course’s real-world applications make it particularly impactful. For high school students, “real-world” might mean something as relatable as analyzing Spotify Wrapped data. By grounding programming in familiar and engaging contexts, IDS demystifies it, fostering confidence and curiosity. It empowers students to view coding not as an intimidating abstraction but as a practical, essential skill they can use to achieve tangible outcomes, regardless of prior experience or career aspirations.

This accessibility has far-reaching implications. Coding is increasingly a baseline competency across disciplines, from life sciences to humanities. Yet many students struggle with programming requirements because they haven’t been exposed to it in a meaningful way. For example, my roommate, a neuroscience major, was tasked with using Python to run math modules in her introductory life sciences course. She didn’t even attempt the assignment before asking for my help, assuming it would be too challenging simply because it involved coding. IDS addresses this gap by normalizing coding as a universal skill. By integrating programming into familiar problem-solving contexts, it prepares students to tackle interdisciplinary challenges with confidence. 

Beyond technical skills, IDS imparts critical lessons about data itself. In today’s world, data often carries an aura of unquestioned authority, treated as synonymous with evidence. IDS challenges this perception by teaching students to critically evaluate data sources, question methodologies, and recognize biases embedded in datasets. These skills are essential across fields, empowering students to navigate the complexities of data-driven decision-making with awareness and discernment. 

For me, IDS served as a bridge between high school and college, turning abstract concepts into practical skills. It demystified data and coding, empowering me to approach both with confidence and curiosity. This foundation has been invaluable in my journey as a computer science major, where data literacy and programming skills are indispensable. Introduction to Data Science is more than just a class—it’s a transformative experience. By blending mathematics and computer science, it fills a crucial niche in education, equipping students with the tools to succeed in college and beyond. IDS demonstrates that data literacy isn’t a luxury but a necessity, empowering individuals to actively engage with the data they consume daily. It’s a class every student can benefit from, regardless of their career path.

About the Author

Amy Lloyd is a computer science student at UCLA with a strong passion for making data science education accessible to everyone. In high school, she served as president of the Computer Science Honors Society, where she launched outreach programs to teach middle school students about computer science. Amy is dedicated to integrating the student perspective into data science curricula, ensuring it is both engaging and relevant for all learners.

 

Discovering Patterns in Transactional Data

As a data scientist, one of the most common tasks you’ll encounter is finding patterns and relationships within large datasets. Let’s consider an example from a supermarket setting. One powerful technique for achieving this is Apriori analysis, which is particularly useful for market basket (shopping cart) analysis and identifying frequent item sets in transactional data.

Supermarket Market Basket Analysis
Let’s consider an example from a supermarket setting. Suppose we have a dataset of customer transactions in which each transaction represents a customer’s shopping basket containing various items. The goal of Apriori analysis in this context is to identify sets of items that are frequently purchased together by customers. This information can be valuable for product placement, cross-selling strategies, and promotional campaigns. For instance, the Apriori algorithm might reveal that customers who purchase bread and butter also frequently purchase milk. This frequent item set could be represented as {bread, butter} → {milk}. By setting appropriate minimum support and confidence thresholds, the algorithm can identify such frequent item sets and generate association rules like:
{bread, butter} → {milk} (support = 0.3, confidence = 0.8)

This rule suggests that 30% of transactions contain bread, butter, and milk, and 80% of customers who bought bread and butter also bought milk. Armed with these insights, the supermarket can strategically place milk near the bread and butter sections, run promotions bundling these items together, or recommend milk to customers who have bread and butter in their baskets.

The Apriori Algorithm
Apriori analysis is a data mining technique used to uncover interesting relationships or associations between variables in a dataset. It operates on the principle of frequent itemset mining, which involves identifying sets of items that frequently appear together in a given dataset. The name “Apriori” comes from the fact that the algorithm uses prior knowledge of frequent item set properties to guide the search for larger item sets. In other words, it leverages the fact that if an item set is frequent, then all of its subsets must also be frequent.

The Apriori algorithm operates in two main steps:

Frequent Item set Generation: In this step, the algorithm identifies all item sets that satisfy a minimum support threshold. Support is a measure of how frequently an item set appears in the dataset.

Rule Generation: After identifying the frequent item sets, the algorithm generates association rules that satisfy a minimum confidence threshold. Confidence is a measure of how likely it is for the consequent to occur given the antecedent.

The algorithm iteratively generates candidate item sets of increasing length, prunes infrequent item sets, and calculates their support and confidence values until no more frequent item sets can be found.

Applications of Apriori Analysis
Apriori analysis has a wide range of applications, particularly in the following domains:
Market Basket Analysis: Identifying products that are frequently purchased together, which can inform product placement, cross-selling strategies, and promotional campaigns.
Web Usage Mining: Analyzing patterns in website clickstreams to understand user behavior and optimize website design and content.
Bioinformatics: Identifying co-occurring genes, proteins, or other biological entities that may be related or involved in similar processes.
Intrusion Detection: Identifying patterns of system calls or network traffic that may indicate malicious activity or security breaches.

Getting Started with Apriori Analysis
To get started with Apriori analysis, you’ll need a dataset containing transactional data or item sets. Many programming languages and data mining libraries, such as R’s arules package or Python’s mlxtend, provide implementations of the Apriori algorithm. Once you have your dataset and library set up, you can specify the minimum support and confidence thresholds, run the Apriori algorithm, and analyze the resulting frequent item sets and association rules. Apriori analysis is a powerful tool for uncovering hidden patterns and relationships in data, and it’s a valuable addition to any data scientist’s toolkit. With its wide range of applications and relatively straightforward implementation, it is an excellent technique to explore and master.

Why students need to understand and work with data

  • Develops critical thinking and analytical skills – Analyzing data requires students to ask questions, identify patterns, draw conclusions, and make informed decisions based on evidence.
  • Promotes data literacy – As data becomes increasingly prevalent in our data-driven world, students need to be able to interpret and communicate data effectively. Data literacy empowers students to make sense of information and use it to support arguments or solve real-world problems.
  • Data has permeated every industry and aspect of our lives. From healthcare and finance to marketing and education, data plays a pivotal role in driving decisions. And hence working with data and understanding it has become increasingly important.

Try it!
If you would like to try doing this analysis, you can download the Online Retail dataset here: https://archive.ics.uci.edu/dataset/352/online+retail. The code reference is linked below.

Code and Concept Reference:
https://www.datacamp.com/tutorial/market-basket-analysis-r

About the author:
Kunal Sonalkar is a data scientist at Nordstrom, the fashion retail company. He leverages machine learning techniques to improve the search retrieval experience and provide personalized product recommendations to online customers. He holds a master’s degree in computer science and engineering from the University of Florida.

Data Science Pipelines

A topic that comes up fairly regularly amongst data science professionals is the idea of pipelines. And I can imagine that all of the casual talk about pipes and pipelines probably makes it seem like data scientists are something more akin to plumbers than anything else… which wouldn’t be the worst characterization of the job I’ve ever heard.

Oftentimes, the goal of a data scientist is to build pipelines which might, for example:

  • Format raw data into datasets which can be quickly combined together and used for modeling purposes.
  • Build, train, and test a variety of models to identify which ones are most promising.
  • Deploy, monitor, and continually update models which are used for decision-making purposes.
  • Build these top three bullets together to create a seamless source of information that is readily available to make decisions.

What’s common throughout this (definitely non-exhaustive) list of pipelines is that data science is a field about building and evaluating processes, and I think one obvious question this brings up is, “How do we prepare and train students or young/new data scientists to build these types of processes?” We prepare them by teaching them to be critical thinkers and consumers of data first.

Teaching critical data thinkers

Training students and/or new data scientists is a pipeline problem in and of itself. Ideally, we’d get students from diverse backgrounds and perspectives interested in the field, give them a sample of the field to drive their interest, and once we’ve “hooked” them, motivate them to acquire specialized training at universities or through online coding programs. Why then is learning to be a critical thinker with data so important? Because it’s the thought process which underlies all successful data science pipelines.

People who are trained to think critically about data will spend more time thinking about how data has been sourced, who might be represented in such data, as well as who might not be represented. These thoughts can then guide the assessment or value of new data sources, decisions regarding how to format it, or how to represent that information as a data source.

Teaching students to be critical with data also teaches them how to represent or summarize information so that it’s honestly and easily interpreted, skills that are entirely necessary when it comes to evaluating competing models, monitoring model performance over time, or even just justifying business decisions to non-data experts.

The IDS to DS pipeline

One of the things I have always loved about the Introduction to Data Science high school math curriculum has been that critical thinking about data has always come first. Students get experience and exposure to data topics that are as relevant today to data scientists as they were when the curriculum was initially written. Then they get to experience, in an authentic and meaningful way,  how data scientists apply these critical thinking skills via writing code.  

Will students interested in a career in data science, at some point, need to learn lots of math, statistics, probability, and calculus? Without a doubt, just like students need more than one biology high school class if they want to become doctors. So, is getting a PhD in statistics a necessary step before we can start getting people interested in data science? Absolutely not. In fact, I would argue that giving students a glimpse of what lies at the end of a mathematics pipeline guides more students into the field than trying to piece together an existing pipeline which is already leaking.

Data science is a great career which has already benefited immensely from data scientists coming into the field with diverse backgrounds such as economics, physics, computer science, mathematics, and more. My hope is that, with courses like IDS, we’ll continue to bring in new data scientists with different views and perspectives as we continue to grow the field into the future.

About the author:

Dr. James Molyneux is a data scientist for Swyfft, LLC, where he specializes in evaluating/developing new data sources and building risk/underwriting models and workflows. He is also courtesy faculty in the Department of Mathematics at Oregon State University.