actio_python_utils.spark_functions¶

Spark-related functionality.

Functions

`convert_chromosome`(self, current_column_name)	Return a PySpark dataframe with `current_column_name` (containing human chromosomes) with a new column, `new_column_name` (defaulting to overwriting the original), with the chromosome cast as an integer.
`convert_dicts_to_dataframe`(self[, ...])	Converts either a list of dicts (`dict_list`) or a function that returns an iterator of dicts (`iter_func`) to a PySpark dataframe
`count_columns_with_string`(self[, string])	Return a PySpark dataframe with the number of times a given string occurs in each string column in a dataframe
`count_distinct_values`(self[, ...])	Return a new PySpark dataframe with the number of distinct values in each column.
`count_nulls`(self)	Return a PySpark dataframe with the number of null values in each column of a dataframe
`load_dataframe`(self, path[, format, ...])	Load and return the specified data source using PySpark
`load_db_to_dataframe`(self[, pgpass_record, ...])	Return a PySpark dataframe from either a relation or query
`load_excel_to_dataframe`(self, xl_fn[, ...])	Load and return the specified Excel spreadsheet with PySpark
`load_xml_to_dataframe`(self, xml_fn, row_tag)	Load and return the specified XML file with PySpark
`serialize_array_field`(self, column, ...[, ...])	Serializes a `pyspark.sql.types.ArrayType` field for output.
`serialize_bool_field`(self, column, new_column)	Serializes a `pyspark.sql.types.BooleanType` field for output.
`serialize_field`(self, column[, new_column, ...])	Operates on a PySpark dataframe and converts any field of either atoms or structs, or any array of either of those (but not nested) to the properly formatted string for postgresql TEXT loading format and assigns it the column name new_column.
`serialize_string_field`(self, column, new_column)	Serializes a `pyspark.sql.types.StringType` field for output.
`serialize_struct_field`(self, column, ...[, ...])	Serializes a `pyspark.sql.types.StructType` field for output.
`setup_spark`([cores, memory, use_db, ...])	Configures and creates a PySpark session according to the supplied arguments
`split_dataframe_to_csv_by_column_value`(self, ...)	Split a dataframe with PySpark to a set of gzipped CSVs, e.g. if a dataframe has data: col1,col2,col3 1,1,1 1,2,3 2,1,1.

actio_python_utils.spark_functions¶

Table of Contents

This Page