actio_python_utils.spark_functions.setup_spark¶
- actio_python_utils.spark_functions.setup_spark(cores='*', memory='1g', use_db=False, use_excel=False, use_glow=False, use_xml=False, show_console_progress=True, extra_options=None, extra_packages=None, postgresql_jdbc='/usr/share/java/postgresql-42.6.0.jar', excel_package='com.crealytics:spark-excel_2.12:3.3.1_0.18.7', glow_codec='io.projectglow.sql.util.BGZFCodec', glow_package='io.projectglow:glow-spark3_2.12:1.2.1', xml_package='com.databricks:spark-xml_2.12:0.15.0', spark_logging_level=40)[source]¶
Configures and creates a PySpark session according to the supplied arguments
- Parameters:
cores (
int
|str
, default:'*'
) – The number of cores to configure PySpark withmemory (
str
, default:'1g'
) – The amount of memory to configure PySpark withuse_db (
bool
, default:False
) – Configure PySpark to be able to query a database via JDBCuse_excel (
bool
, default:False
) – Configure PySpark to be able to parse Excel spreadsheetsuse_glow (
bool
, default:False
) – Configure PySpark to use glow (e.g. to parse a VCF)use_xml (
bool
, default:False
) – Configure PySpark to be able to parse XML filesshow_console_progress (
bool
, default:True
) – Configure PySpark to show console progressextra_options (
Optional
[Iterable
[tuple
[str
,str
]]], default:None
) – Any additional options to configure PySpark withextra_packages (
Optional
[Iterable
[str
]], default:None
) – Any additional packages for PySpark to loadpostgresql_jdbc (
str
, default:'/usr/share/java/postgresql-42.6.0.jar'
) – The path to the PostgreSQL JDBC jar for use if use_db is specifiedexcel_package (
str
, default:'com.crealytics:spark-excel_2.12:3.3.1_0.18.7'
) – The name of the package PySpark needs to parse Excel spreadsheetsglow_codec (
str
, default:'io.projectglow.sql.util.BGZFCodec'
) – The name of the codec PySpark needs to load glowglow_package (
str
, default:'io.projectglow:glow-spark3_2.12:1.2.1'
) – The name of the package PySpark needs to load glowxml_package (
str
, default:'com.databricks:spark-xml_2.12:0.15.0'
) – The name of the package PySpark needs to parse XML filesspark_logging_level (
int
|str
, default:40
) – The logging level to configure py4j and pyspark with
- Return type:
SparkSession
- Returns:
The configured PySpark session