fromMedium
5 days agoData Quality on Spark, Part 4: Deequ
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets. In short, it is a Spark library built by Amazon for expressing and evaluating data quality checks at scale. Besides "regular" checks and verifications, it ships some interesting features such as profiling, analyzers, and automatic suggestions, which will be demonstrated later in this post.The main library is written in Scala, although a Python wrapper (PyDeequ) is also available.To keep the focus on a single implementation, the examples in this post are written in Scala.
Data science

































