actio_python_utils.spark_functions.setup_spark¶

actio_python_utils.spark_functions.setup_spark(cores='*', memory='1g', use_db=False, use_excel=False, use_glow=False, use_xml=False, show_console_progress=True, extra_options=None, extra_packages=None, postgresql_jdbc='/usr/share/java/postgresql-42.6.0.jar', excel_package='com.crealytics:spark-excel_2.12:3.3.1_0.18.7', glow_codec='io.projectglow.sql.util.BGZFCodec', glow_package='io.projectglow:glow-spark3_2.12:1.2.1', xml_package='com.databricks:spark-xml_2.12:0.15.0', spark_logging_level=40)[source]¶

Configures and creates a PySpark session according to the supplied arguments

Parameters:

cores (int | str, default: '*') – The number of cores to configure PySpark with
memory (str, default: '1g') – The amount of memory to configure PySpark with
use_db (bool, default: False) – Configure PySpark to be able to query a database via JDBC
use_excel (bool, default: False) – Configure PySpark to be able to parse Excel spreadsheets
use_glow (bool, default: False) – Configure PySpark to use glow (e.g. to parse a VCF)
use_xml (bool, default: False) – Configure PySpark to be able to parse XML files
show_console_progress (bool, default: True) – Configure PySpark to show console progress
extra_options (Optional[Iterable[tuple[str, str]]], default: None) – Any additional options to configure PySpark with
extra_packages (Optional[Iterable[str]], default: None) – Any additional packages for PySpark to load
postgresql_jdbc (str, default: '/usr/share/java/postgresql-42.6.0.jar') – The path to the PostgreSQL JDBC jar for use if use_db is specified
excel_package (str, default: 'com.crealytics:spark-excel_2.12:3.3.1_0.18.7') – The name of the package PySpark needs to parse Excel spreadsheets
glow_codec (str, default: 'io.projectglow.sql.util.BGZFCodec') – The name of the codec PySpark needs to load glow
glow_package (str, default: 'io.projectglow:glow-spark3_2.12:1.2.1') – The name of the package PySpark needs to load glow
xml_package (str, default: 'com.databricks:spark-xml_2.12:0.15.0') – The name of the package PySpark needs to parse XML files
spark_logging_level (int | str, default: 40) – The logging level to configure py4j and pyspark with

Return type:

SparkSession

Returns:

The configured PySpark session

actio_python_utils.spark_functions.setup_spark¶

Table of Contents

This Page