actio_python_utils.spark_functions.setup_spark¶
- actio_python_utils.spark_functions.setup_spark(cores='*', memory='1g', use_db=False, use_excel=False, use_glow=False, use_xml=False, show_console_progress=True, extra_options=None, extra_packages=None, postgresql_jdbc='/usr/share/java/postgresql-42.6.0.jar', excel_package='com.crealytics:spark-excel_2.12:3.3.1_0.18.7', glow_codec='io.projectglow.sql.util.BGZFCodec', glow_package='io.projectglow:glow-spark3_2.12:1.2.1', xml_package='com.databricks:spark-xml_2.12:0.15.0', spark_logging_level=40)[source]¶
Configures and creates a PySpark session according to the supplied arguments
- Parameters:
cores (
int|str, default:'*') – The number of cores to configure PySpark withmemory (
str, default:'1g') – The amount of memory to configure PySpark withuse_db (
bool, default:False) – Configure PySpark to be able to query a database via JDBCuse_excel (
bool, default:False) – Configure PySpark to be able to parse Excel spreadsheetsuse_glow (
bool, default:False) – Configure PySpark to use glow (e.g. to parse a VCF)use_xml (
bool, default:False) – Configure PySpark to be able to parse XML filesshow_console_progress (
bool, default:True) – Configure PySpark to show console progressextra_options (
Optional[Iterable[tuple[str,str]]], default:None) – Any additional options to configure PySpark withextra_packages (
Optional[Iterable[str]], default:None) – Any additional packages for PySpark to loadpostgresql_jdbc (
str, default:'/usr/share/java/postgresql-42.6.0.jar') – The path to the PostgreSQL JDBC jar for use if use_db is specifiedexcel_package (
str, default:'com.crealytics:spark-excel_2.12:3.3.1_0.18.7') – The name of the package PySpark needs to parse Excel spreadsheetsglow_codec (
str, default:'io.projectglow.sql.util.BGZFCodec') – The name of the codec PySpark needs to load glowglow_package (
str, default:'io.projectglow:glow-spark3_2.12:1.2.1') – The name of the package PySpark needs to load glowxml_package (
str, default:'com.databricks:spark-xml_2.12:0.15.0') – The name of the package PySpark needs to parse XML filesspark_logging_level (
int|str, default:40) – The logging level to configure py4j and pyspark with
- Return type:
SparkSession- Returns:
The configured PySpark session