What is Gluten?

  • Glue in Latin
  • Enable Spark with Native Vectorized Execution
  • Contributed by Intel and Kyligence in 2022

Photon

  • SIGMOD 2022: A Fast Query Engine for Lakehouse Systems[1]
  • Not open source

Why we need it?

  • IO bound ==> CPU bound
  • JIT is not enough
    • Spark 1.4: Expression Compute
    • Spark 2.0: Stage Code Generation (Volcano Model)
  • Query plan level performance improves, but not operator level
  • JVM is not good for CPU instruction optimization (like SIMD)
  • Existing native engine like volex/clickhouse/arrow

inline


Spark Plugin

inline


Architecture

inline


Design Goal

  • Transform Spark’s stage physical plan to Substrait plan
  • Offload performance-critical data processing to native library
  • Define clear JNI interfaces for native libraries
  • Switch available native backends easily
  • Reuse Spark’s distributed control flow
  • Manage data sharing between JVM and native
  • Extensible to support more native accelerators

Plan Conversion & Fallback

inline


Memory Management

inline


Columnar Shuffle

  • Row to column
  • On Shuffle Read phase

Compability

  • Clear JNI interface
  • Spark Side: shim layer

Performance

inline

inline


Who are using it?


References

  1. https://cs.stanford.edu/people/matei/papers/2022/sigmod_photon.pdf
  2. https://github.com/oap-project/gluten
  3. https://cn.kyligence.io/blog/gluten-spark/
  4. https://github.com/facebookincubator/velox/
  5. https://medium.com/intel-analytics-software/accelerate-spark-sql-queries-with-gluten-9000b65d1b4e
  6. https://www.databricks.com/dataaisummit/session/best-exploration-columnar-shuffle-design/