Data is misinterpreted more often than you might expect. Even with the best intentions, important variables may be omitted or a problem may be oversimplified or overcomplicated. Sometimes, organizations act on trends that are not what they seem. And even when two people view the same analytical result, they may interpret it differently.
"Statistics can tell you 'this versus that.' The real questions are, 'Is the difference worth worrying about?' and 'Have we collected enough data to allow us to make a decision?" said Ken Gilbert, professor emeritus of the department of statistics, operations, and management at the University of Tennessee, in an interview.
It is entirely possible for business leaders to obsess about something that is statistically insignificant, or for data scientists to omit important variables, simply because they do not understand the entire context of the problem they are trying to solve. In short, the path to valuable insights can include a number of obstacles, some of which may not become apparent until after the fact.
Some individuals and groups take a top-down approach to data analysis, meaning that they focus on the business problem they are trying to solve and they make a point of identifying variables that have been relevant in the past in a same or similar context. Others take a bottom-up approach, meaning that they attempt to correlate variables with that which they are trying to improve (such as website conversions or sales). The danger of the latter approach is a high probability that some correlations are statistically significant but are an artifact of the way the data has been analyzed, versus being an accurate indicator of underlying relationships, Gilbert said.
There are a lot of ways data can be misinterpreted, and business leaders need to understand how and why it can happen. Here are nine examples.
Insufficient Domain Expertise
Domain expertise and data expertise are both necessary for accurate interpretations of data. However, business professionals are not data scientists, and data scientists generally do not have the same level of subject matter expertise that some others in the organization possess. Although there are roles in between, such as business analysts, an imbalance of data expertise and domain expertise can result in the misinterpretation of data. "The data scientist or analytics person often doesn't understand the context of the variables they're looking at, and that happens a lot in large organizations where people are working in silos," said Carlos Abisambra, founder and CEO of lead-generation consulting firm Vortice Services, in an interview. "The way the numbers are calculated doesn't get put into the conclusion for the analysis." Sometimes a lack of domain expertise can lead to the omission of important variables, which causes strange results. "Subject matter expertise is paramount to the analysis and interpretation of data. You can aggregate the data in such a way that you see relationships that are the opposite of what they should be. There are some variables lurking around in the background that weren't included in the analysis," said University of Tennessee's Ken Gilbert. "When you're analyzing 30 to 40 different variables trying to understand customer behavior, you may leave out something important that causes you to arrive at an erroneous conclusion and you may not even realize it." Before Tron Jordheim, chief marketing officer at self-service storage company StorageMart, uses data to examine a situation or to make a decision, his team questions whether the result is possible in the real world, whether the result is likely, and what real-world experience makes them skeptical about the data or makes them think the data makes sense.
Important Variables Are Omitted
A single missing variable can cause data to be misinterpreted. And when data is misinterpreted, it leads to faulty conclusions and sometimes unwise investments. "This is the 'minefield' of using data," said StorageMart chief marketing officer Tron Jordheim in an interview. "There are so many obvious variables and an unknown quantity of unknown or obscure variables." Even if you have been prudent about identifying variables, Jordheim said, "after you make the decision you have to know that you may [nevertheless] have missed a variable. You will be on the lookout for curious or odd results after the decision that indicate you missed a variable." When retail promotions are more (or less) successful than expected, it may be because important variables are missing. "In low-income areas, there tends to be more shopping at the first of the month, so stores may do promotions to fill in the valleys," said Ken Gilbert, professor emeritus at the University of Tennessee. "If you look at the relationship between promotions and sales, it would look like promotions hurt sales. To find out the real effect, you have to do [tests] where you experiment in high-demand or low-demand periods, or you control for the seasonality in demands."
Aggregation Obscures Truth
Different levels of aggregation can tell a different story. Knowing that, it is common to vary the levels of aggregation to confirm the trend and to understand at which point the results diverge or reverse. "It's a good strategy to see whether the trends at the aggregate level hold up," said University of Tennessee's Ken Gilbert. "A better strategy is to ask, before you start collecting or analyzing data, what are the potential sources of variation [such as] why would sales vary from location to location and month to month during times of promotion versus when we don't have a promotion? That way, you have a list of things you believe have an impact, and you make sure your data is grouped according to the different values of those variables." Andrew Christopher, head of loyalty consulting in South America for marketing and loyalty analytics company Aimia, said in an interview that aggregation levels helped determine whether or not a pilot program was successful. "We were trying to persuade older customers to move from telephone service to a much lower cost online channel. When we viewed the age group in the aggregate, the results were not encouraging," said Christopher. "By further segmenting the pilot group, we were able to identify some specific subgroups for whom the campaign had been highly successful in changing their behavior. With the results of this analysis, we were able to alter the targeting of the full campaign to include only those subgroups, and drive significant cost savings."
Inferences Are Off Base
All interferences from data are conditional, so it's wise to understand the group about which inferences are being made. If not, you run the risk of inferring the wrong properties about a population. "If you are not trained in statistical thinking, you will tend to misinterpret the data or the results," said University of Tennessee's Ken Gilbert. In a business context, inferences about groups, such as all customers or a subset of customers, are common. "Every one of our properties is in a unique market with a unique set of customers and a unique set of local market conditions," said StorageMart chief marketing officer Tron Jordheim. "Every time we see a data result, we have to then consider the market to determine how that changes what we think the data is actually showing us. You cannot judge the effectiveness of a promotion to sign up customers for credit card auto-payment plans without considering that [in] certain markets ... people primarily use cash and without considering that credit card use habits differ between Canada and the USA."
Sources Of Variation Overlooked
It's important to consider the sources of variation in a process in order to disintegrate the data in a way that leads to insights. "If you are attempting to understand the impact of a promotion on weekly sales, you may find that the 'after promotion' sales are well within the predicted range of variation of weekly sales," said University of Tennessee's Ken Gilbert. "But when you break sales down by stores and day of the week to account for those two important sources of variation, you may find that the impact of the promotion can be more precisely measured." Loyalty programs have become popular among retailers, but it can be challenging to determine the business impact of such programs. "The point of a loyalty program is to drive more profitable behaviors," said Andrew Christopher, head of loyalty consulting in South America at Aimia. "A common approach is to simply compare the level of sales and profitability for the program members versus non-members. However, this approach is flawed, as it is natural that a company's best customers are significantly more likely to join the program, which skews the results. In order to account for this, the two populations must first be stratified by variables such as spend in a pre-period, tenure, and demographics to ensure we have robust test and control groups."
Statistical Significance Trumps Thinking
Statistical significance is important, but not everything that's statistically significant actually matters. "Statistical significance should be used as a screen to identify variables that might be helpful. With very large datasets, you will find statistically significant results. The real question is, are they practically significant?" said University of Tennessee's Ken Gilbert. "If you are testing 100 different variables in a prediction equation, and you are testing at the 0.5 level of significance (p=0.5), then on average five of those variations will show up as significant just due to chance. The more important questions are: 1. Is there anything that explains why that variable should be a predictor? and 2. Is the relationship strong enough to be of any practical significance?" A manager at a security trading firm used a fixed statistical significance cutoff to decide when to deploy a strategy. So, traders would intentionally overfit strategies to meet the significance criteria. "The group's performance wasn't great because the manager was blindly using statistical significance at the expense of critical thinking," said Michael Li, founder of data science placement and training firm The Data Incubator, in an interview. Some people assume statistical significance is inherent in big data analytical results, which isn't always the case. "In the big data world, people often assume that all data is equally good and forget about statistical significance, assuming if the data point comes from a 'big data' dataset, statistical significance is implicit," said Laura Khoury, principal at business consulting firm Blossom Growth Partners. "This is particularly true in social listening contexts, whereby companies collect all mentions of their name and products and take action on specific mentions that seem important. The beauty of big data is not the individual points in the data, rather the collective points to allow identification of trends and outliers, both of which require statistical significance to work."
Numerical Analysis Missed Something
Data visualizations can reveal anomalies that could be overlooked in numerical analysis. They may also reveal that what is numerically interesting is simply a misconception. For example, outliers (extreme values that skew analysis) are immediately obvious visually. Non-linear relationships result in a very small correlation coefficient, but they are also easy to recognize visually. "If I'm looking at the difference between the level of promotion and sales, I notice that when I promote a tiny amount my sales are small, and as I increase promotions my sales increase. But after awhile, I reach a point where more promotion doesn't result in more sales. In fact, more promotion reduces sales, maybe because I've overexposed the product," said University of Tennessee's Ken Gilbert. "If I plot that, I see the curve and I understand the relationship. If I simply compute a correlation, it may say at the highest level of promotion I have low sales, so there's very little positive correlation between promotion and sales, and that's obscuring what's really going on."
Correlation Is Mistaken For Causation
Correlation and causation are often confused. For example, a recent article in the American Journal of Medical Genetics Part B: Neuropsychiatric Genetics, stated that eye color may be a potential indicator of alcohol dependence among European Americans. Not surprisingly, news headlines tended to miss one or more of the conditional statements, which resulted in such definitive, yet misleading, headlines as "This Eye Color Is Linked To Alcoholism." The same dynamic happens frequently in organizations, sometimes deliberately, but more often because the subtleties are overlooked or disregarded. "In the workplace, somebody may put up a graph, or even go as far as calculating R-squared or R, and say that the stories are positively correlated, but that doesn't necessarily mean that one is causing the other," said Vortice Services founder and CEO Carlos Abisambra. For example, a sales executive wanted his team members to call leads faster, but the salespeople saw no point since they were already calling prospects the same day. Moreover, there was no data indicating that calling sooner would make a difference. The sales executive decided to graph two sales variables: "wins," which meant the prospect had been converted into a customer, and "time to contact." There was a strong negative correlation confirming that faster contact increased wins, so a graph was produced and presented to the sales and marketing teams. In addition, resources from IT and other departments were assigned to help the salespeople call leads faster. However, losses also negatively correlated with the time-to-call variable. "Both wins and losses were strongly correlated, so how do we separate what is causing what? It turns out if you call a lead faster, you have a higher chance of getting the person on the phone, and if you get that person on the phone there is a higher chance of turning the lead into the win," said Abisambra. "Fortunately, none of the efforts were in vain, but that's not always the case. A lot of times, you'll have charts that have a strong correlation that is positive or negative, and managers, executives, or even analysts will think just because they're correlated one will cause the other. In this case, the numbers were used to prove a point."
Explanation Adds Distortion
The easiest way to communicate results clearly is to use language that everyone in the audience understands. While using jargon may make a presentation sound scientific, it may also confuse the audience to whom results are presented. "It's important to express the results in language that adds clarity rather than distortion," said Ken Gilbert, University of Tennessee professor emeritus. "Stating that a new customer service cuts the monthly churn rate in half sounds impressive, but it's not what the decision maker needs to know, because the results have not been put in the context of the decision we're going to be making. If the current churn rate is 2%, a better way to express the result would be to say, 'To prevent a single customer churn, we need to provide this service to 100 customers.' That way, we can weigh the costs of providing the service against the cost of losing the customer, and decide whether that's what we're going to do."