This library provides an intuitive API to describe checks for Apache PySpark DataFrames v3.3.0. It is a replacement written in pure python of the pyde

GitHub - canimus/cuallee: A data quality acceleration library to get data sets verified in a friendly interface

submited by
Style Pass
2022-09-21 00:00:41

This library provides an intuitive API to describe checks for Apache PySpark DataFrames v3.3.0. It is a replacement written in pure python of the pydeequ framework.

This implementation goes in hand with the latest API from PySpark and uses the Observation API to collect metrics at the lower cost of computation. When benchmarking against pydeequ, cuallee uses circa <3k java classes underneath and remarkably less memory.

This is a very fresh implementation using the Observation API in PySpark v3.3.0. The next round validations in the roadmap include more practical use cases:

Apache License 2.0 Free for commercial use, modification, distribution, patent use, private use. Just preserve the copyright and license.

Leave a Comment