In spatial domains, interaction between features gives rise to two types of interaction patterns: co-location and segregation patterns. Existing approaches to finding co-location patterns have several shortcomings: (1) They depend on user specified thresholds for prevalence measures; (2) they do not take spatial auto-correlation into account; and (3) they may report co-locations even if the features are randomly distributed. Segregation patterns have yet to receive much attention. In this paper, we propose a method for finding both types of interaction patterns, based on a statistical test.
We introduce a new definition of co-location and segregation pattern, we propose a model for the null distribution of features so spatial auto-correlation is taken into account, and we design an algorithm for finding both co-location and segregation patterns. We also develop two strategies to reduce the computational cost compared to a naïve approach based on simulations of the data distribution, and we propose an approach to reduce the runtime of our algorithm even further by using an approximation of the neighborhood of features. We evaluate our method empirically using synthetic and real data sets and demonstrate its advantages over a state-of-the-art co-location mining algorithm. Mining Statistically Significant Co-location and Segregation Patterns
A co-location pattern is a group of spatial features/events that are frequently co-located in the same region. For example, human cases of West Nile Virus often occur in regions with poor mosquito control and the presence of birds. For co-location pattern mining, previous studies often emphasize the equal participation of every spatial feature. As a result, interesting patterns involving events with substantially different frequency cannot be captured in the problem of mining co-location patterns with rare spatial features.
Existing measure called the maximal participation ratio (maxPR) and show that a co-location pattern with a relatively high maxPR value corresponds to a co-location pattern containing rare spatial events. Furthermore, we identify a weak monotonicity property of the maxPR measure. This property can help to develop an efficient algorithm to mine patterns with high maxPR values. As demonstrated by our experiments, our approach is effective in identifying co-location patterns with rare events, and is efficient and scalable for large-scale data sets.
We propose a method for finding both types of interaction patterns, based on a statistical test. We introduce a new definition of co-location and segregation pattern, we propose a model for the null distribution of features so spatial auto-correlation is taken into account, and we design an algorithm for finding both co-location and segregation patterns. We also develop two strategies to reduce the computational cost compared to a naïve approach based on simulations of the data distribution, and we propose an approach to reduce the runtime of our algorithm even further by using an approximation of the neighborhood of features. We evaluate our method empirically using synthetic and real data sets and demonstrate its advantages over a state-of-the-art co-location mining algorithm.
We propose a mining algorithm that reports groups of features as co-location or as segregation patterns if the participating features have a positive or negative interaction, respectively, among themselves. To determine the type of interaction, we test the statistical significance of the PI-value of a pattern instead of comparing its frequency against a simple threshold. To this end, we develop appropriate Null models that also take the possible spatial auto-correlation of individual features into account. The estimation of the null distribution is obtained through randomization test, which is common in spatial statistics, since no closed form expressions that model the joint distribution.
We introduce two strategies to improve the runtim of our proposed method. Due to the large number of simulations conducted in randomization tests, the statistical significance tests can become computationally expensive. We improve the runtime by introducing a pruning strategy to identify candidate patterns for which the prevalence measure computation is unnecessary. Taking spatial auto-correlation of features into account, we also show that in a simulation, we do not need to generate all instances of an auto-correlated feature and can reduce the runtime of the data generation phase.