Real estate experts like to say that the three most important features of a property are: location, location, location! Likewise, weather events are highly location-dependent. We will see below how a similar perspective is also applicable to machine learning algorithms.
In real estate, the buyer is first and foremost concerned about location for at least 3 reasons: (a) the desirability of the surrounding neighborhood; (b) the proximity to schools, businesses, services, etc.; and (c) the value of properties in that area. Similarly, meteorologists tell us that all weather is local. Location is significant in weather for at least 3 reasons also: (a) specific weather events are almost impossible to predict due to the massive complexity of micro-scale interactions of atmospheric phenomena that are spread over macro-scales of hundreds of miles; (b) the specific outcome of a weather prediction may occur only in highly localized areas; and (c) the minute details of a location (topography, hydrology, structures) are too specific to be included in regional models and yet they are very significant variables in micro-weather events. Side note: we might have a good start here on generating some predictive models (for real estate sales or for weather), if we could parameterize the above location-based features and score them appropriately.
Another aspect of “location” is the boundary region between different areas. This boundary region can affect real estate sales, especially if a desirable area is adjacent to an undesirable area. While conditions (prices, market factors, resale values) may be well understood deep within each of the two areas, there is more uncertainty in the boundary region. This is similarly true for the weather, as was especially evident in the big snow and ice storms that swept across the United States on March 3, 2014. For those of us in the Baltimore-Washington region, we were expecting significant snow, ice rain, and sleet. What we received was a moderate amount of snow across most of the region and not much else. This “less significant” weather event was partially due to the fact that cold dry air from the north won the battle against warm wet air from the south, pushing dryer air into the region than was expected. Wrong predictions for massive snowfalls are not unusual in this part of the country primarily because this latitude is often within the boundary region between the northern weather circulation patterns and the southern circulation patterns. It is often difficult to predict reliably which weather pattern will win the battle in the boundary region during any particular storm.
Location is also very important in many machine learning algorithms. The simplest classification (supervised learning) algorithms in machine learning are location-based: classify a data point based on its location on one side or the other of some decision boundary (Decision Tree), or classify a data point based on the classes of its nearest neighbors (K-nearest neighbors = KNN). Furthermore, clustering (unsupervised learning) is intrinsically location-based, using distance metrics to ascertain similarity or dissimilarity among intra-cluster and inter-cluster members. All of this is a natural consequence of the fact that humans place things into different categories (or classes) when we see that different categories of items are clearly separated in some feature space (i.e., occupying different locations in that space). The challenge to data scientists is to find the best feature space for distinguishing, disentangling, and disambiguating different classes of behavior. Sometimes (though not often) those “best” features are the ones that we measured at the beginning, but we can usually discover improved classification features as we explore different combinations (linear and nonlinear) of the initial measured attributes.