actio_python_utils.spark_functions.count_distinct_values

actio_python_utils.spark_functions.count_distinct_values(self, columns_to_ignore={}, approximate=False)[source]

Return a new PySpark dataframe with the number of distinct values in each column. Uses pyspark.sql.functions.count_distinct() by default and pyspark.sql.functions.approx_count_distinct() if approximate == True

Parameters:
  • self (DataFrame) – The dataframe to summarize

  • columns_to_ignore (Container[str], default: set()) – An optional set of columns to not summarize

  • approximate (bool) – Get approximate counts instead of exact (faster)

Return type:

DataFrame

Returns:

The new dataframe with counts of distinct values per column