Hacking the Data Science Radar with Data Science

Hacking the Data Science Radar with Data Science

Hacking the Data Science Radar with Data Science
Hacking the Data Science Radar with Data Science
04 Jul 16
This post was first published by original author Duncan Garmonsway and reproduced with his kind permission.
This post reverse-engineers the Mango Solutions Data Science Radar using
Programming (R)
Why hack? Because getting at the innards also reveals
What a good score is in each category
Which statements are most important
Whether scores are comparable across people
Whether you should strongly agree with the statement “On average, I spend at least 25% of my time manipulating data into analysis-ready formats”
The radar
Based on Likert-style responses to 24 provocative statements, the Data Science Radar visualises your skills along six axes, the “core attributes of a contemporary ‘Data Scientist’.” It looks like this.
Mango Solutions Data Science Radar
First attempt: Multivariate multiple regression
How can we score better? Hacking the url would be cheating , so instead, let’s use science: hypothesise -> test -> improve. Here are some initial guesses.
Each of the 24 statements relates to exactly one attribute, i.e. four statements per attribute.
The Likert values (strongly agree, agree, somewhat agree, etc.) are coded from 1 to 7 (since there are seven points on each axis).
There is a linear relationship between the coded agreement with the statements, and the attributes.
So something like
$$text{score}_{text{attribute}} = frac{1}{4} sum_{i = 1}^{4} text{answer}_i$$
where answeri = 1, 2, ⋯, 7 by encoding “Strongly disagree” as 1, up to “Strongly agree” as 7, including only four relevant answers per attribute. The best-possible set of answers would score 7 on every axis, and the worst set would score 1.
If the hypotheses are correct, then all we need to do to prove them is to record 24 sets of random answers, the resulting scores, and fit a multivariate linear model. We’d expect each score (outcome variable) to have four non-zero coefficients (out of the 24 input variables). Let’s try it.
# The first two aren’t random, but they’re still linearly independent of the # others, which is what matters. random_data <- read_csv("./data/radar-random.csv") lm1 <- lm(cbind(Communicator, `Data Wrangler`, Modeller, Programmer, Technologist, Visualiser) ~ ., data = random_data) lm1 ## ## Call: ## lm(formula = cbind(Communicator, `Data Wrangler`, Modeller, Programmer, ## Technologist, Visualiser) ~ ., data = random_data) ## ## Coefficients: ## Communicator Data Wrangler Modeller Programmer Technologist Visualiser ## (Intercept) 2.060e+00 2.422e+00 3.247e+00 6.658e-01 -1.331e+00 1.456e+00 ## q01 1.997e-01 -2.507e-02 2.602e-01 -1.103e-01 -5.866e-02 -7.103e-02 ## q02 -2.571e-01 2.729e-02 -4.514e-01 2.090e-01 1.554e-01 1.281e-01 ## q03 3.087e-01 1.744e-02 -3.471e-01 -1.303e-03 5.611e-02 1.978e-01 ## q04 4.356e-01 8.534e-04 -8.676e-03 -2.346e-02 -7.130e-02 -4.193e-02 ## q05 -2.524e-01 2.267e-01 8.732e-01 -1.559e-01 -1.907e-01 -3.885e-01 ## q06 -1.948e-01 1.545e-01 7.016e-01 -7.626e-02 -1.271e-01 -3.897e-01 ## q07 -7.925e-03 2.075e-01 4.423e-01 -1.089e-01 -2.015e-01 -2.247e-01 ## q08 8.902e-02 -4.810e-01 -1.246e-02 8.111e-02 -5.556e-02 -4.572e-02 ## q09 1.901e-01 5.174e-02 -5.260e-01 -9.428e-02 5.506e-02 2.620e-01 ## q10 9.750e-02 -1.248e-02 -2.365e-01 3.181e-02 1.557e-01 3.267e-01 ## q11 -2.099e-01 -5.220e-02 2.943e-01 2.032e-01 6.801e-02 -1.775e-01 ## q12 -1.000e-01 1.813e-15 7.000e-01 -1.333e-01 9.653e-16 -1.000e-01 ## q13 5.164e-02 2.647e-02 -3.386e-01 2.881e-01 -4.010e-03 1.428e-01 ## q14 1.211e-01 -8.162e-02 -3.835e-02 -2.508e-01 -4.963e-02 7.972e-02 ## q15 4.971e-03 5.

Read Also:
How Semantic Data Analytics Benefits Population Health Management

Read Full Story…


Leave a Reply

Your email address will not be published. Required fields are marked *