why we are excited about spark 3.2

i wanted to use this post to summarize what is exciting about the spark 3.2.x release from the perspective of tresata.

background: tresata uses spark for analytics and machine learning on large amounts of data. we exclusively use the scala api (mostly dataframes but also a little rdds, no python or sql, although recently we did our first few tests with python). our main resource manager is yarn but we are actively testing kubernetes as an alternative or replacement.

highlights in spark 3.2 from our perspective

lots of progress on the adaptive execution front! in spark 3.2 adaptive execution is now turned on by default. we had it turned on by default for a long time so for us this is not a change, but now a lot more people will start using it which will hopefully will lead to more bugs and performance regressions being uncovered and fixed. the community also made some great progress in terms of making adaptive execution more useful: it can now be applied to cached datasets (see sql.optimizer.canChangeCachedPlanOutputPartitioning setting). this is a pretty big deal given how common cached/persisted datasets are, at least for us.
a new push based shuffle. spark performance is shuffle performance pretty much, with everything else being secondary, so any improvements to the shuffle i am always excited about. its too bad this improvement only applies to yarn (and we had to do some hack using shading to get it going on older hadoop 2 clusters).
scala 2.13 finally! i have been googling to find any mentions of what performance is like for spark on scala 2.13 but haven’t found anything so far. we are still on scala 2.12 but we now started cross-building for scala 2.13. i saw some mentions that the scala 2.13 collections have better performance which is exciting. but on the other hand spark uses quite lot of arrays and does a lot of conversions between arrays and seqs, and these conversions have become less efficient in scala 2.13 (where seq is now by default immutable). well to be precise: these conversions have become less efficient if you want to use same code for scala 2.12 and 2.13, which most projects will do.

not so great…

delta keeps breaking with every spark upgrade. yes delta is not technically part of spark itself, but when you upgrade spark and you also use delta you will know what i am talking about: delta compilation fails with the newer spark version like clockwork. it seems delta uses lots of internal spark apis, which is troublesome. a datasource should not have to do this. i expressed my frustration about this here.

sidenote about adaptive execution

some might wonder why adaptive execution matters so much to us… spark after all is pretty efficient at processing tasks with low overhead, so having lots of tasks is no big deal right? well that is true, but lots of tasks (which don’t necessarily process any meaningful amount data each) means lots of part files, lots of pressure on the driver, and it also makes dynamic allocation less useful (since dynamic allocation monitors pending tasks as a measure of resources needed). did i mention we also like dynamic allocation a lot?

author: koert (koert at tresata.com)

Written on March 23, 2022