How to Make Better Predictions When You Don’t Have Enough Data

How to Make Better Predictions When You Don’t Have Enough Data

How to Make Better Predictions When You Don’t Have Enough Data

When Donald Trump first declared his candidacy for President of the United States, most analysts predicted that he has an incredibly small chance of becoming the Republican nominee. Probably the most prominent of these was Nate Silver from FiveThirtyEight. He estimated that Trump had a 2% chance of winning the nomination. This estimation was based on multiple significant historic data points about past candidates, such as the background they came from, whether they were widely endorsed by the party, and their past successes and failures. This is a standard prediction approach based on the underlying assumption that what you are trying to predict (Trump) is comparable to its historical antecedents (past GOP candidates) and thus can be evaluated according to their performance. However, as it is clear to us now, in some unique cases like the Trump phenomenon, we could only learn little from recent direct history.

A similar problem crops up in polling. Political analysts use polls in order to estimate the likelihood of a candidate’s success. However, polls are not perfect, and usually suffer from multiple types of biases — such as the effect of non-responders, the tradeoff of polling by calling landlines versus cellphones, and changes in voting turnout trends. To overcome these obstacles, political statisticians build models that try to correct polling errors by using data from previous elections. This method is based on the underlying assumption that current and historical polls suffer from the same type of errors. For example, analysts might assume that the population of non-responders is distributed similarly across time — an assumption that may or may not be true.

Read Also:
How Relevant is Data Analytics to Businesses Today?

Compounding both problems, since presidential elections are a relatively rare event, our historical data is limited; in other words, the sample size is relatively small and outdated.

Predictive statisticians in the private sector face similar problems when trying to predict unexpected events, or when working from flawed or incomplete data. Simply turning the work over to machines won’t help: most machine learning and statistical mining techniques also hold the assumption that historical data, which is used to train the machine-learning model, behaves similarly to the target data, to which the model is later applied. However, this assumption often does not hold as the data is obsolete, and it is often expensive or impractical to get the additional recent data that holds this assumption.

Thus in order to stay relevant, statisticians will have to get out of the purist position of fitting models that are based solely on direct historical data, and to enrich their models with recent data from similar domains that could better capture current trends.

Read Also:
How NASA and the UN are using location intelligence to build smart cities in developing countries

This is known as Transfer Learning, a field that helps to solve these problems by offering a set of algorithms that identify the areas of knowledge which are “transferable” to the target domain. This broader set of data can then be used to help “train” the model. These algorithms identify the commonalities between the target task, recent tasks, previous tasks, and similar-but-not-the-same tasks. Thus, they help guide the algorithm to learn only from the relevant parts of the data.

In the example of the U.S.

 



Chief Analytics Officer Europe

25
Apr
2017
Chief Analytics Officer Europe

15% off with code 7WDCAO17

Read Also:
Ten Myths About Machine Learning, by Pedro Domingos

Chief Analytics Officer Spring 2017

2
May
2017
Chief Analytics Officer Spring 2017

15% off with code MP15

Read Also:
The first rule of data science

Big Data and Analytics for Healthcare Philadelphia

17
May
2017
Big Data and Analytics for Healthcare Philadelphia

$200 off with code DATA200

Read Also:
How NASA and the UN are using location intelligence to build smart cities in developing countries
Read Also:
Topic Modeling Large Amounts of Text Data

SMX London

23
May
2017
SMX London

10% off with code 7WDATASMX

Read Also:
The first rule of data science

Data Science Congress 2017

5
Jun
2017
Data Science Congress 2017

20% off with code 7wdata_DSC2017

Read Also:
How Relevant is Data Analytics to Businesses Today?

Leave a Reply

Your email address will not be published. Required fields are marked *