Learning Feature Selection for Building and Improving your Machine Learning Model

3 min read
Curated from medium.com →

Feature selection/creation/transformation is one of the most commonly overlooked areas in model building by aspiring Data Scientists.

Usually, the task of model building gets reduced to trying all sorts of fancy algorithms – from standard machine algorithm to Deep learning models. But, if we are going to feed garbage to our machine learning algorithm, garbage is going to come out of it (GIGO). So, the best thing to do before we move to select the best machine learning algorithm is to build a base line model- powered by a base algorithm or any go-to algorithm, then concentrate on improving the accuracy of this model using feature creation/transformation/selection.

In model building, feature selection/creation is a step where maximum time should be spent. Feature selection is somewhat easier than feature creation. Feature selection is a well-researched area and most of the Data Science algorithms offered under Python or R have automated this process.

Feature creation is a bigger dragon to slay. Smart features reduce training time and increase accuracy drastically. Creating features requires a lot of problem context. Expertise in the area of feature creation comes with practice and knowledge of the domain.

In this blog, I will explore both feature selection and creation using Python as medium. I will use the dataset at this link. The dataset is very simple. Our task is to build a model that can predict whether a person would be promoted or not. And few variables that we have with us are – region, department, education, gender, age, KPIs_met>80%, etc.

Let’s first run the base model- Our base model is inclusive of all the columns given. It’s a XG Boost Classifier.

Above numbers will be our base numbers. We will try to improve these numbers through feature selection and feature creation.

There are broadly two types of variables- Categorical Variable and Numerical/Continues Variable.

Get the AI & data signal, daily.

335k+ subscribers read this every morning. One email, both newsletters. Unsubscribe anytime.

Below are the categorical variables from the dataset

Below are the numerical variables from the dataset

Both type of variables has separate way of handling.

There are two major classes of categorical data- Nominal and Ordinal

**Nominal** — in this there is no concept of order. e.g. racial types- Asian, American, Europeans, etc. There is no order here. Representing Asians with 1 and Americans with 2 doesn’t mean one racial type is superior/inferior than other.

**Ordinal** — in this we have some sense of order among the values. e.g. Shoe sizes S, M, L, XL, XXL

Example- Let’s take column **recruitment channel**. There are in total 3 unique values. we will create in total 2 dummies. But the problem with this approach if number of unique values that the variable can take is high (high cardinality). It will increase the number of columns drastically.

Domain specific- one must first check problem context and try to get as much information possible. For the given dataset we are provided with average training score. But, before we make any comments on this feature’s importance we should take a step back and think- How a person would be promoted in a multistate and multidepartment company? –

· No. of promotion would be dependent on a department.

Continue Reading

Enjoyed this summary? Read the complete article at the source:

Continue at medium.com →

Yves Mulkers

Yves Mulkers is the founder of 7wData and a widely followed voice in the data and AI community. He curates the 7wData and AI Beat newsletters, reaching hundreds of thousands of data and AI professionals, and writes on data strategy, analytics, AI, and the evolving data ecosystem.