Hacking the Data Science Radar with Data Science

04 Jul 16

This post was first published by original author Duncan Garmonsway and reproduced with his kind permission.

This post reverse-engineers the Mango Solutions Data Science Radar using

Programming (R)

Why hack? Because getting at the innards also reveals

What a good score is in each category

Which statements are most important

Whether scores are comparable across people

Whether you should strongly agree with the statement “On average, I spend at least 25% of my time manipulating data into analysis-ready formats”

The radar

Based on Likert-style responses to 24 provocative statements, the Data Science Radar visualises your skills along six axes, the “core attributes of a contemporary ‘Data Scientist’.” It looks like this.

Mango Solutions Data Science Radar

First attempt: Multivariate multiple regression

How can we score better? Hacking the url would be cheating , so instead, let’s use science: hypothesise -> test -> improve. Here are some initial guesses.

Each of the 24 statements relates to exactly one attribute, i.e. four statements per attribute.

The Likert values (strongly agree, agree, somewhat agree, etc.) are coded from 1 to 7 (since there are seven points on each axis).

There is a linear relationship between the coded agreement with the statements, and the attributes.

So something like

$$text{score}_{text{attribute}} = frac{1}{4} sum_{i = 1}^{4} text{answer}_i$$

where answeri = 1, 2, ⋯, 7 by encoding “Strongly disagree” as 1, up to “Strongly agree” as 7, including only four relevant answers per attribute. The best-possible set of answers would score 7 on every axis, and the worst set would score 1.

If the hypotheses are correct, then all we need to do to prove them is to record 24 sets of random answers, the resulting scores, and fit a multivariate linear model. We’d expect each score (outcome variable) to have four non-zero coefficients (out of the 24 input variables). Let’s try it.

# The first two aren’t random, but they’re still linearly independent of the # others, which is what matters. random_data <- read_csv("./data/radar-random.csv") lm1 <- lm(cbind(Communicator, `Data Wrangler`, Modeller, Programmer, Technologist, Visualiser) ~ ., data = random_data) lm1
## ## Call: ## lm(formula = cbind(Communicator, `Data Wrangler`, Modeller, Programmer, ## Technologist, Visualiser) ~ ., data = random_data) ## ## Coefficients: ## Communicator Data Wrangler Modeller Programmer Technologist Visualiser ## (Intercept) 2.060e+00 2.422e+00 3.247e+00 6.658e-01 -1.331e+00 1.456e+00 ## q01 1.997e-01 -2.507e-02 2.602e-01 -1.103e-01 -5.866e-02 -7.103e-02 ## q02 -2.571e-01 2.729e-02 -4.514e-01 2.090e-01 1.554e-01 1.281e-01 ## q03 3.087e-01 1.744e-02 -3.471e-01 -1.303e-03 5.611e-02 1.978e-01 ## q04 4.356e-01 8.534e-04 -8.676e-03 -2.346e-02 -7.130e-02 -4.193e-02 ## q05 -2.524e-01 2.267e-01 8.732e-01 -1.559e-01 -1.907e-01 -3.885e-01 ## q06 -1.948e-01 1.545e-01 7.016e-01 -7.626e-02 -1.271e-01 -3.897e-01 ## q07 -7.925e-03 2.075e-01 4.423e-01 -1.089e-01 -2.015e-01 -2.247e-01 ## q08 8.902e-02 -4.810e-01 -1.246e-02 8.111e-02 -5.556e-02 -4.572e-02 ## q09 1.901e-01 5.174e-02 -5.260e-01 -9.428e-02 5.506e-02 2.620e-01 ## q10 9.750e-02 -1.248e-02 -2.365e-01 3.181e-02 1.557e-01 3.267e-01 ## q11 -2.099e-01 -5.220e-02 2.943e-01 2.032e-01 6.801e-02 -1.775e-01 ## q12 -1.000e-01 1.813e-15 7.000e-01 -1.333e-01 9.653e-16 -1.000e-01 ## q13 5.164e-02 2.647e-02 -3.386e-01 2.881e-01 -4.010e-03 1.428e-01 ## q14 1.211e-01 -8.162e-02 -3.835e-02 -2.508e-01 -4.963e-02 7.972e-02 ## q15 4.971e-03 5.