Understanding Spark Connect API - Part 2: Introduction to Architecture

submited by
Style Pass
2023-05-26 13:30:08

In the 3.4 version, Apache Spark has released a new client/server-based API called Spark Connect. This API will help in improving how we develop and deploy Spark applications.

In this series of blogs, we are going to explore various functionalities exposed by spark connect API. This is the second post in the series where we discuss the architecture of spark connect. You can read all the posts in the series here.

Spark-Connect API is a gRPC based API that runs as a server to connect spark client applications with the spark driver and cluster. As you can observe from the diagram spark driver is no more part of the client application. Now it’s on the server side of spark-connect API.

As the spark driver is no more part of the client, how do we write the spark applications?. For this, spark now ships a separate client library which wraps the spark-connect API with a nice DataFrame based API. This makes sure that spark client applications do not need full installation and dependency of the spark. We will explore more about this API in our future posts.

One of the challenges of providing a new API is it forces all the client applications to be rewritten in this new API. But spark connect API smartly solves this issue by exposing the same DataFrame/ Dataset API as in standard spark code and internally converting those API calls to the gRPC calls. This tremendously helps the application developers as they don’t need to learn yet another new API and reuse all their existing code.

Leave a Comment