Evaluating Data Science Projects
- by 7wData
It’s not necessary to understand the inner workings of a Machine Learning project, but you should understand whether the right things have been measured and whether the results are suited to the business problem. You need to know whether to believe what data scientists are telling you.
I’ve written two blog posts on evaluation—the broccoli of Machine Learning. There are actually two closely related concerns under the rubric of evaluation:
Both types are important not only to data scientists but also to managers and executives, who must evaluate project proposals and results. To managers I would say: It’s not necessary to understand the inner workings of a machine learning project, but you should understand whether the right things have been measured and whether the results are suited to the business problem. You need to know whether to believe what data scientists are telling you.
To this end, here I’ll evaluate a machine learning project report. I found this work described as a customer success story on a popular machine learning blog. The write-up was posted in early 2017, along with a related video presenting the results. Some aspects are confusing, as you’ll see, but I haven’t sought clarification from the authors because I wanted to critique it just as reported. This makes for a realistic case study: you often have to evaluate projects with missing or confusing details.
As you’ll see, we’ll uncover some common application mistakes that even professional data scientists can make.
The problem is presented as this: A large insurance company wants to predict especially large insurance claims. Specifically, they divide their population into drivers who report an accident (7–10%), drivers who have no accidents (90–93%), and so-called large-loss drivers who report an accident involving damages of $10,000 or more (about 1% of their population). It is only the last group involved in large, expensive claims that they want to detect. They are facing a two-class problem, whose classes they call Large Loss and Non-Large Loss.
Readers acquainted with my prior posts may recall I’ve talked about how common unbalanced classes are in real-world machine learning problems. Indeed, here we see a 99:1 skew where the positive (Large-Loss) instances are outnumbered by the uninteresting negative instances by about two orders of magnitude. (By the way, this would be considered very skewed by ML research standards, though it’s on the lighter side by real-world standards.) Because of this skew, we have to be careful in evaluation.
Their approach was fairly straightforward. They had a historical data sample of previous drivers’ records on which to train and test. They represented each driver’s record using 70 features, encompassing both categorical and numerical features, although only a few of these are shown.
They state that their client had previously used a Random Forest to solve this problem. A Random Forest is a well known and popular technique that builds an ensemble of decision trees to classify instances. They hope to do better using a deep learning neural network. Their network design looks like this:
The model is a fully connected neural network with three hidden layers, with a ReLU as the activation function. They state that data from Google Compute Engine was used to train the model (implemented in TensorFlow), and Cloud Machine Learning Engine’s HyperTune feature was used to tune hyperparameters.
I have no reason to doubt their representation choices or network design, but one thing looks odd. Their output is two ReLU (rectifier) units, each emitting the network’s accuracy (technically: recall) on that class. I would’ve chosen a single Softmax unit representing the probability of Large Loss driver, from which I could get a ROC or Precision-Recall curve. I could then threshold the output to get any achievable performance on the curve. (I explain the advantages of scoring over hard classification in this post.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
Strategies for simplifying complex Salesforce data migrations – Free Webinar
27 March 2024
5 PM CET – 6 PM CET
Read MoreYou Might Be Interested In
Why Good Data Scientists are Worth the Big Bucks
20 May, 2017Data scientist is the ‘Sexiest Job of the 21 Century’, so say Thomas Davenport and DJ Patil in their seminal …
Underwater IoT: The Promise of Data Analytics in Aquaculture
19 Feb, 2022If you follow IoT, you know how much it’s changing industries. Devices collect data. Analytics platforms turn that data into …
What devops needs to know about data governance
11 Sep, 2022Data governance is an umbrella term encompassing several different disciplines and practices, and the priorities often depend on who is …
Recent Jobs
Do You Want to Share Your Story?
Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.