developments for delta lake

at tresata we have been using and supporting the delta open source format since 2019 (the year it was open sourced). for us it has been more or less parquet+, e.g. parquet format with some added benefits. the main benefit to us is better/safer support for concurrent access to same table (multiple reads, read and write).

however delta also has it issues. for example for the longest time they showed little interest in supporting dynamic partition overwrite mode (dpo), claiming their own support for replaceWhere was superior. however dpo is a feature we rely heavily on in our code base, and it cannot easily be replaced by replaceWhere, and even if it could this would make delta format behave inconsistently with all other file-based formats, making it hard to swap out one format for another, which is something we definitely want to support. so we maintained our own internal fork of delta that did support dpo, which we eventually open sourced and turned into a pull request. this became the most up-voted pullreq for delta, and i am glad to report that this month it finally got merged.

one other thing that bugs me currently is that open source delta’s support for concurrent reads and writes is rather limited. this seems to come from the fact that the default isolation level is set to Serializable, which is pretty restrictive (even operations that blindly append data can conflict with other operations). moreover open-source delta has no way to change this isolation level, and in commercial databricks on aws and azure the default isolation level is set to the more practical WriteSerializable. i don’t know whats going on here, and unless i am missing something i think in open source delta we should be able to set the isolation level and the default should probably be WriteSerializable. i plan to create a pull request for this soon.

author: koert (koert at tresata.com)

Written on June 29, 2022