actio_python_utils.spark_functions.setup_spark

actio_python_utils.spark_functions.setup_spark(cores='*', memory='1g', use_db=False, use_excel=False, use_glow=False, use_xml=False, show_console_progress=True, extra_options=None, extra_packages=None, postgresql_jdbc='/usr/share/java/postgresql-42.6.0.jar', excel_package='com.crealytics:spark-excel_2.12:3.3.1_0.18.7', glow_codec='io.projectglow.sql.util.BGZFCodec', glow_package='io.projectglow:glow-spark3_2.12:1.2.1', xml_package='com.databricks:spark-xml_2.12:0.15.0', spark_logging_level=40)[source]

Configures and creates a PySpark session according to the supplied arguments

Parameters:
  • cores (int | str, default: '*') – The number of cores to configure PySpark with

  • memory (str, default: '1g') – The amount of memory to configure PySpark with

  • use_db (bool, default: False) – Configure PySpark to be able to query a database via JDBC

  • use_excel (bool, default: False) – Configure PySpark to be able to parse Excel spreadsheets

  • use_glow (bool, default: False) – Configure PySpark to use glow (e.g. to parse a VCF)

  • use_xml (bool, default: False) – Configure PySpark to be able to parse XML files

  • show_console_progress (bool, default: True) – Configure PySpark to show console progress

  • extra_options (Optional[Iterable[tuple[str, str]]], default: None) – Any additional options to configure PySpark with

  • extra_packages (Optional[Iterable[str]], default: None) – Any additional packages for PySpark to load

  • postgresql_jdbc (str, default: '/usr/share/java/postgresql-42.6.0.jar') – The path to the PostgreSQL JDBC jar for use if use_db is specified

  • excel_package (str, default: 'com.crealytics:spark-excel_2.12:3.3.1_0.18.7') – The name of the package PySpark needs to parse Excel spreadsheets

  • glow_codec (str, default: 'io.projectglow.sql.util.BGZFCodec') – The name of the codec PySpark needs to load glow

  • glow_package (str, default: 'io.projectglow:glow-spark3_2.12:1.2.1') – The name of the package PySpark needs to load glow

  • xml_package (str, default: 'com.databricks:spark-xml_2.12:0.15.0') – The name of the package PySpark needs to parse XML files

  • spark_logging_level (int | str, default: 40) – The logging level to configure py4j and pyspark with

Return type:

SparkSession

Returns:

The configured PySpark session