Learn how you can modernize your data warehouse with Apache Hadoop. View an on-demand webinar now. Brought to you in partnership with Hortonworks.
What exactly is machine learning?
The simplest definition I came across:
Let’s break that down to set some foundations on which to build our machine learning knowledge.
Branch of AI: Artificial intelligence is the study and development by which a computer and its systems are given the ability to successfully accomplish tasks that would typically require a human’s intelligent behavior. Machine learning is a part of that process. It’s the technology and process by which we train the computer to accomplish the said task.
Explores ways: Machine learning techniques are still emerging. Some models for training a computer are already recognized and used (as we will see below), but it is expected that more will be developed with time. The idea to be remembered here is that different models can be used when training a computer. Different business problems require different models.
Get computers to improve their performance: For a computer to accomplish a task with AI, it needs practice and adaptation. A machine learning model needs to be trained using data and in most cases, a little human help.
Based on experience: providing an AI with experience is another way of saying – to provide it with data. As more data is fed into the system, the more accurately the computer can respond to it and to future data that it will encounter. More accuracy in understanding the data means a better chance to successfully accomplish its given task or to increase its degree of confidence when providing predictive insight.
Machine learning is often referred to as magical or a black box:
Let’s take a look at the training process itself to better understand how machine learning can create value with data.
Collect: Machine learning is dependent on data. The first step is to make sure you have the right data as dictated by the problem you are trying to solve. Consider your ability to collect it, its source, the required format, and so on. Clean: Data can be generated by different sources, contained in different file formats, and expressed in different languages. It might be required to add or remove information from your data set, as some instances might be missing information while others might contain undesired or irrelevant entries. Its preparation will impact its usability and the reliability of the outcome. Split: Depending on the size of your data set, only a portion might be required. This is usually referred to as sampling. From the chosen sample, your data should be split into two groups: one to train the algorithm and the other to evaluate it. Train: As commonly seen with neural networks, this stage essentially aims at finding the mathematical function that will accurately accomplish the chosen goal. Using a portion of your data set, the algorithm will attempt to process the data, measure its own performance and auto-adjust its parameters (also called backpropagation) until it can consistently produce the desired outcome with sufficient reliability. Evaluate: Once the algorithm performs well on the training data, its performance is measured again with data that it has not yet seen. Additional adjustments are made when needed. This process allows you to prevent overfitting, which happens when the learning algorithm performs well but only with your training data. Optimize: The model is optimized for integration within the destined application to ensure it is as lightweight and as fast as possible.
There are many different models that can be used in machine learning but they are typically grouped into three different types of learning: supervised, unsupervised, and reinforcement. Depending on the task to complete, some models are more appropriate and better performing than others.