actio_python_utils.spark_functions

Spark-related functionality.

Functions

convert_chromosome(self, current_column_name)

Return a PySpark dataframe with current_column_name (containing human chromosomes) with a new column, new_column_name (defaulting to overwriting the original), with the chromosome cast as an integer.

convert_dicts_to_dataframe(self[, ...])

Converts either a list of dicts (dict_list) or a function that returns an iterator of dicts (iter_func) to a PySpark dataframe

count_columns_with_string(self[, string])

Return a PySpark dataframe with the number of times a given string occurs in each string column in a dataframe

count_distinct_values(self[, ...])

Return a new PySpark dataframe with the number of distinct values in each column.

count_nulls(self)

Return a PySpark dataframe with the number of null values in each column of a dataframe

load_dataframe(self, path[, format, ...])

Load and return the specified data source using PySpark

load_db_to_dataframe(self[, pgpass_record, ...])

Return a PySpark dataframe from either a relation or query

load_excel_to_dataframe(self, xl_fn[, ...])

Load and return the specified Excel spreadsheet with PySpark

load_xml_to_dataframe(self, xml_fn, row_tag)

Load and return the specified XML file with PySpark

serialize_array_field(self, column, ...[, ...])

Serializes a pyspark.sql.types.ArrayType field for output.

serialize_bool_field(self, column, new_column)

Serializes a pyspark.sql.types.BooleanType field for output.

serialize_field(self, column[, new_column, ...])

Operates on a PySpark dataframe and converts any field of either atoms or structs, or any array of either of those (but not nested) to the properly formatted string for postgresql TEXT loading format and assigns it the column name new_column.

serialize_string_field(self, column, new_column)

Serializes a pyspark.sql.types.StringType field for output.

serialize_struct_field(self, column, ...[, ...])

Serializes a pyspark.sql.types.StructType field for output.

setup_spark([cores, memory, use_db, ...])

Configures and creates a PySpark session according to the supplied arguments

split_dataframe_to_csv_by_column_value(self, ...)

Split a dataframe with PySpark to a set of gzipped CSVs, e.g. if a dataframe has data: col1,col2,col3 1,1,1 1,2,3 2,1,1.