Which Spark machine learning API should you use?
- by 7wData
You’re not a data scientist. Supposedly according to the tech and business press, machine learning will stop global warming, except that’s apparently fake news created by China. Maybe machine learning can find fake news (a classification problem)? In fact, maybe it can.
But what can machine learning do for you? And how will you find out? There’s a good place to start close to home, if you’re already using Apache Spark for batch and stream processing. Along with Spark SQL and Spark Streaming, which you’re probably already using, Spark provides MLLib, which is, among other things, a library of machine learning and statistical algorithms in API form.
Here is a brief guide to four of the most essential MLlib APIs, what they do, and how you might use them.
Mainly you’ll use these APIs for A-B testing or A-B-C testing. Frequently in business we assume that if two averages are the same then the two things are roughly equivalent. That isn’t necessarily true. Consider if a car manufacturer replaces the seat in a car and surveys customers on how comfortable it is. At one end the shorter customers may say the seat is much more comfortable. At the other end, taller customers will say it is really uncomfortable to the point that they wouldn’t buy the car and the people in the middle balance out the difference. On average the new seat might be slightly more comfortable but if no one over 6 feet tall buys the car anymore, we’ve failed somehow. Spark’s hypothesis testing allows you to do a Pearson chi-squared or a Kolmogorov–Smirnov test to see how well something “fits” or whether the distribution of values is “normal.” This can be used most anywhere we have two series of data. That “fit” might be “did you like it” or did the new algorithm provide “better” results than the old one. You’re just in time to enroll in a Basic Statistics Course on Coursera.
What are you? If you take a set of attributes you can get the computer to sort “things” into their right category. The trick here is coming up with the attribute that matches the “class,” and there is no right answer there. There are a lot of wrong answers. If you think of someone looking through a set of forms and sorting them into categories, this is classification. You’ve run into this with spam filters, which use a list of words spam usually has. You may also be able to diagnose patients or determine which customers are likely to cancel their broadcast cable subscription (people who don’t watch live sports). Essentially classification “learns” to label things based on labels applied to past data and can apply those labels in the future.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
Shift Difficult Problems Left with Graph Analysis on Streaming Data
29 April 2024
12 PM ET – 1 PM ET
Read MoreCategories
You Might Be Interested In
How big data is busting traffic jams
17 Jul, 2018This article was written by Jim McClelland. Jim is a sustainable futurist, speaker, writer and social-media commentator. His specialisms include built …
Artificial empathy: the upgrade AI needs to speak to consumers
15 Apr, 2022In a proliferated, multi-channel world, every brand needs to win the heart and mind of the consumer to acquire and …
Boost Your Digital Marketing with Business Intelligence
16 Jun, 2017With the emergence of social media, search engine optimization and social media have become the foundation of marketing strategies for …
Recent Jobs
Do You Want to Share Your Story?
Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.