Machine Learning’s Poor Fit for Real Data
- by 7wData
There’s a growing sentiment out there with all the wonderful things happening in artificial intelligence, machine learning, and data science that these technologies are ready to solve all the things (including how to kill all humans). The reality is there are still a bunch of significant hurdles between us and the AI dystopia/utopia. One big one that is the main impetus behind my research is the disconnect between the statistical foundations of machine learning and how real data works.
Machine learning technology is built on a foundation of formal theory. Statistical ideas, computer science algorithms, and information-theoretic concepts integrate to yield practical methods that analyze large, noisy data sets to train actionable and predictive models. The power of these methods has caused many to realize the value of data.
Yet, as data collection accelerates, weaknesses of existing machine learning methods reveal themselves. The nature of larger-scale data collection violates key assumptions in the foundation that made machine learning so effective. Most notably, statistical independence is no longer achievable with large-scale data. Data is being collected from highly interacting, entangled, complex systems. Human data describes people interacting in a single global social network; ecological data represents measurements of organisms inhabiting complex, shared ecosystems; and medical data measures the interconnected, biological systems that govern health.
The concept of statistical independence is a natural fit for laboratory experimentation. In laboratory experiments, scientists test hypotheses by running repeated experiments in closed environments. By design, the measurements taken during each experiment are independent. Because one experiment can’t affect another’s result, classical statistics can confidently quantify the effects of factors in the experiment, even in the presence of randomness.
For example, a typical pre-clinical laboratory drug trial would use a population of animal subjects, administering a drug to part of the population and giving no treatment to a separate control subpopulation. The two subpopulations would be managed to ensure that confounding factors, such as genetics, are equally distributed.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
From Text to Value: Pairing Text Analytics and Generative AI
21 May 2024
5 PM CET – 6 PM CET
Read More