%0 Journal Article
%A HE Qian
%A HUANG Huan
%A LI Shuang-fu
%A XU Hong
%T A Fast Clustering Algorithm for Massive Data
%D 2020
%R 10.13190/j.jbupt.2019-078
%J Journal of Beijing University of Posts and Telecommunications
%P 118-124
%V 43
%N 3
%X To meet the requirements of massive data processing, a grid-based <em>K</em>-means fast clustering algorithm (SPGK) is proposed. Selection for optimal clustering initial point and the number of clusters algorithm is presented. The grids of different clusters are meshed to obtain the centroid of each grid. These centroid points are used as sample points for <em>K</em>-means clustering, thereby reducing the number of Euclidean distance calculations of <em>K</em>-means. SPGK realizes parallel computation based on Spark platform, which further improves the running efficiency of the algorithm. SPGK not only obtains good clustering effect but also greatly reduces the number of Euclidean distance calculations, which is suitable for fast clustering of mass data. With 10 millions of data, the experiments show that SPGK is superior to the existing <em>K</em>-means++ and recursive partition based <em>K</em>-means clustering algorithms obviously.
%U https://journal.bupt.edu.cn/EN/10.13190/j.jbupt.2019-078