This post is an informal summary of our SIGCSE'24 paper,

Mistakes that data science students makes

submited by
Style Pass
2024-04-28 20:30:05

This post is an informal summary of our SIGCSE'24 paper, "Investigating Student Mistakes in Introductory Data Science Programming". See the paper for more details. Thanks to Anjali Singh for leading this collaboration between Microsoft and University of Michigan!

This has been studied many times over in the context of traditional CS1 and CS2 courses, but not for DS1 courses. Towards this, we analyzed students' code submissions in a data science course at the University of Michigan. We randomly sampled from 542 notebook submissions by 151 students and qualitatively analyzed them to understand the mistakes and inappropriate strategies.

A logical error is when the program behaves differently than expected, though it does not throw an error. One cause for this is misunderstanding the data, such as using the wrong columns or values. For example, a student used the column PDAT instead of P_NUMVRC which contradicts what their code comment says. Another example, a student incorrectly used male and female instead of 1 and 2 when filtering.

Another cause is misunderstanding the problem statement. For example, rounding values when the actual value is expected, dividing by the wrong value, or incorrectly ordering dictionary keys. Students also made mistakes handling missing values. For example, when computing a correlation between two columns, a student replaced all missing values in both columns with 0.

Leave a Comment