BORDER: Efficient Computation of Boundary Points
|
This work addresses the problem of finding boundary points in multidimensional data sets. Boundary points are data points that are located at the margin of densely distributed data such as a cluster. We describe a novel approach called BORDER (a BOundaRy points DEtectoR) to detect such points. BORDER employs the state-of-the-art database technique—the Gorder kNN join and makes use of the special property of the reverse k nearest neighbor (RkNN). Experimental studies on data sets with varying characteristics indicate that BORDER is able to detect the boundary points effectively and efficiently.
[1] 289 C.C. Aggarwal and P.S. Yu, “Outlier Detection for High Dimensional Data,” Proc. SIGMOD, 2001.
[2] V. Barnett and T. Lewis, Outliers in Statistical Data. John Wiley and Sons, 1994.
[3] M. Basseville and I.V. Nikiforov, Detection of Abrupt Changes. PTR Prentice Hall, 1993.
[4] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, “The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles,” Proc. ACM SIGMOD, pp. 322-331, 1990.
[5] C. Bohm, S. Berchtold, and D.A. Keim, “Searching in High Dimensional Spaces: Index Structures for Improving the Performance of Multimedia Databases,” ACM Computing Surveys, vol. 33, no. 3, pp. 322-373, 2001.
[6] C. Bohm and F. Krebs, “The k-Nearest Neighbor Join: Turbo Charging the KDD Process,” Knowledge and Information Systems (KAIS), vol. 6, no. 6, pp. 728-749, 2004.
[7] C. Bohm and F. Krebs, “Supporting KDD Applications by the k-Nearest Neighbor Join,” Proc. Int'l Conf. Database and Expert Systems Applications, pp. 504-516, 2003.
[8] C. Bohm and H.-P. Kriegel, “A Cost Model and Index Architecture for the Similarity Join,” Proc. Int'l Conf. Data Eng., pp. 411-420, 2001.
[9] M.M. Breunig, H.-P. Kriegel, R.T. Ng, and J. Sander, “Lof: Identifying Density-Based Local Outliers,” Proc. ACM SIGMOD, pp. 93-104, 2000.
[10] B.E. Brodsky and B.S. Darkhovsky, Nonparametric Methods in Change-Point Problems. Kluwer Academic Publishers, 1993.
[11] K. Chakrabarti and S. Mehrotra, “Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces,” Proc. Int'l Conf. Very Large Data Bases, pp. 89-100, 2000.
[12] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” Proc. SIGKDD, pp. 226-231, 1996.
[13] U.M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, Advances in Knowledge Discovery and Data Mining. AAAI Press, 1996.
[14] A. Guttman, “R-Trees: A Dynamic Index Structure for Spatial Searching,” Proc. ACM SIGMOD, pp. 47-57, 1984.
[15] J. Han and M. Kamber, Data Mining Concepts and Techniques. Morgan Kaufmann Publishers, 2000.
[16] W. Hsu, M.-L. Lee, B.C. Ooi, P.K. Mohanty, K.L. Teo, and C. Xia, “Advanced Database Technologies in a Diabetic Healthcare System,” Proc. Int'l Conf. Very Large Data Bases, pp. 1059-1062, 2002.
[17] Data Mining Software in Java, http://www.cs.waikato.ac.nz/mlweka, 2005.
[18] A.K. Jain, M.N. Murty, and P.J. Flynn, “Data Clustering: A Review,” ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, 1999.
[19] H. Jin, B.C. Ooi, H.T. Shen, C. Yu, and A.Y. Zhou, “An Adaptive and Efficient Dimensionality Reduction Algorithm for High-Dimensional Indexing,” Proc. Int'l Conf. Data Eng., pp. 87-98, 2003.
[20] G. Karypis, E.-H. Han, and V. Kumar, “Chameleon: Hierarchical Clustering Using Dynamic Modeling,” Computer, vol. 32, no. 8, pp. 68-75, 1999.
[21] G. Kollios, D. Gunopulos, and V.J. Tsotras, “Nearest Neighbor Queries in a Mobile Environment,” Spatio-Temporal Database Management, pp. 119-134, 1999.
[22] F. Korn and S. Muthukrishnan, “Influence Sets Based on Reverse Nearest Neighbor Queries,” Proc. ACM SIGMOD, pp. 201-212, 2000.
[23] F. Korn, S. Muthukrishnan, and D. Srivastava, “Reverse Nearest Neighbor Aggregates over Data Streams,” Proc. Int'l Conf. Very Large Data Bases, 2002.
[24] L. Horvath and M. Csorgo, Limit Theorems in Change-Point Analysis. Wiley, 1997.
[25] A. Singh, H. Ferhatosmanoglu, and A.S. Tosun, “High Dimensional Reverse Nearest Neighbor Queries,” Proc. Conf. Information and Knowledge Management, pp. 91-98, 2003.
[26] M. Smid, “Closest Point Problems in Computational Geometry,” Handbook on Computational Geometry, Elsevier Science Publishing, 1997.
[27] I. Stanoi, D. Agrawal, and A.E. Abbadi, “Reverse Nearest Neighbor Queries for Dynamic Databases,” Proc. ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery, pp. 44-53, 2000.
[28] I. Stanoi, M. Riedewald, D. Agrawal, and A. El Abbadi, “Discovery of Influence Sets in Frequently Updated Databases,” Proc. Int'l Conf. Very Large Data Bases, pp. 99-108, 2001.
[29] Y. Tao, D. Papadias, and X. Lian, “Reverse KNN Search in Arbitrary Dimensionality,” Proc. Int'l Conf. Very Large Data Bases, pp. 744-755, 2004.
[30] C. Xia, H. Lu, B.C. Ooi, and J. Hu, “Gorder: An Efficient Method for KNN Join Processing,” Proc. Int'l Conf. Very Large Data Bases, 2004.
[31] C. Yang and K.-I. Lin, “An Index Structure for Efficient Reverse Nearest Neighbor Queries,” Proc. Int'l Conf. Data Eng., pp. 485-492, 2001.
Index Terms:
Boundary points, kNN join, k-nearest neighbor, reverse k-nearest neighbor.
Citation:
Chenyi Xia, Wynne Hsu, Mong Li Lee, Beng Chin Ooi, "BORDER: Efficient Computation of Boundary Points," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 3, pp. 289-303, Mar. 2006, doi:10.1109/TKDE.2006.38