%0 Journal Article
%A QIU Xue-song
%A WANG Jia-lu
%A WANG Zhi-li
%A YAN Yi-fei
%T A Shuffle Partition Optimization Scheme Based on Data Skew Model in Spark
%D 
%R 10.13190/j.jbupt.2019-092
%J Journal of Beijing University of Posts and Telecommunications
%P 116-121
%V 43
%N 2
%X For the problem of uneven distribution of data caused during the shuffle phase in the Spark distributed platform, the reason of Spark's low efficiency in processing skewed data is analyzed, then a skew model that can uniformly quantize the skew degree of key-value data after shuffle is proposed. Based on this skew model is established, and a shuffle partitioning scheme that can solve various data skew problems in the Spark platform is proposed. Firstly, the output data of the Map stage is sampled, the size of the intermediate data is predicted, and then the sampled data is pre-partitioned according to the Hash-based best fit algorithm. Finally, all the intermediate data is partitioned according to the pre-partition situation. In the cases of key skew and value skew, the experimental results show that this shuffle partitioning scheme is universal and efficient, and can effectively handle the situation of key and value skew.
%U https://journal.bupt.edu.cn/EN/10.13190/j.jbupt.2019-092