Spark structured streaming is a great feature. As you might already know it can be utilized to stream from files, or in our specific case, from delta

Spark structured streaming — Checkpointing

submited by
Style Pass
2024-11-22 08:30:03

Spark structured streaming is a great feature. As you might already know it can be utilized to stream from files, or in our specific case, from delta tables. But eventually, a failure occurs, and when it does, we need the stream to start from the same place we stopped. We certainly wouldn’t want to reread and reprocess years of data we’ve already processed. A stream needs fault tolerance, and for that, it needs to answer the question of identity (which stream is it) and location — where is it reading from on the stream (which is continuous by nature). So a way to reason about checkpoints might be: Checkpoints are the “state” of your stream.

We’ll initiate the actual stream when we trigger the writeStream and provide it (among other options) the `checkpointLocation`.

Note for this specific stream format (delta table) we can also provide the maxFilesPerTrigger option. This option controls, for file source, the “maximum number of new files to be considered in every trigger” (Kafka source has a similar maxOffsetsPerTrigger semantics, and file sources also have maxBytesPerTrigger which is a soft limit). Sometimes when you start a stream from a pre-existing file source with lots of existing files, you’ll want to limit the number of files to allow your cluster to process it, as I did. In our specific use case, I also needed to tune max shuffle partitions for this job.

Leave a Comment