Tuning Spark Performance

Loading...
Thumbnail Image

Date

Authors

Zhang, Jenne

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

In the big data era, big data frameworks play a vital role in storing and processing large amounts of data, providing significant improvements in performance and availability. Spark is one of the most popular big data frameworks, providing high scalability and fault-tolerance with its unique in-memory engine. To hide the complex settings from users, Spark has approximately 200 configurable parameters in the execution engine. Default values assigned to the parameters provide initial ease of use. However, the default values are not the best setting for all workloads. In this work, we propose a general tuning algorithm named QST, Queen’s Spark Tuning, to help users with tuning Spark and to improve overall performance. First of all, we study Spark performance for a variety of workloads and identify 9 tunable parameters among more than 200 parameters that have significant impact on performance. Then, we propose QST, a general greedy iterative tuning algorithm for our set of 9 key parameters. By classifying Spark workloads as memory-intensive, shuffle-intensive or all-intensive, QST configures the parameters for each type of workload. We perform an experimental evaluation of QST using benchmark workloads and industry workloads. In our experiments, using QST significantly improves Spark performance. Overall, using QST yields an average speedup of 65% for our benchmark evaluation workloads and 57% for our industry evaluation workloads.

Description

Keywords

Apache spark

Citation

Endorsement

Review

Supplemented By

Referenced By

Creative Commons license

Except where otherwised noted, this item's license is described as CC0 1.0 Universal