PYSPARK_PYTHON=python3.12 spark-submit \
    --master https:1.2.3.4 \
    examples/src/main/python/pi.py \
    1000

PYSPARK_PYTHON=python3.12 pyspark

from pyspark.sql import SparkSession
# Initialize a Spark session
spark = SparkSession.builder \
    .config("spark.ui.enabled", "false")  \
    .appName("MyPySparkApp") \
    .getOrCreate()

# Check if Spark is running
print(spark.version)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/23 21:40:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

3.5.4

spark

spark.stop()

Feature	Apache Spark	TensorFlow/PyTorch
Languages Supported	Scala, Java, Python, R	Python (main), C++, Java (limited)
Ease of Use	High-level APIs (DataFrame, SQL)	PyTorch: Easier for research; TensorFlow: More optimized for production
Learning Curve	Moderate (requires understanding of distributed computing)	PyTorch: Easier; TensorFlow: Steeper curve (especially older versions)

Feature	Apache Spark	TensorFlow/PyTorch
Scalability	Scales horizontally (across clusters)	Scales vertically (leveraging multiple GPUs/TPUs)
Distributed Computing	Yes (native support for cluster computing)	Requires additional tools like Horovod or PyTorch Distributed
Speed	Optimized for large-scale distributed data processing	Optimized for numerical operations and matrix multiplications

Feature	Apache Spark	TensorFlow/PyTorch
Built-in ML	Yes (MLlib, MLFlow integration)	No (but specialized for deep learning)
Deep Learning Support	Limited (via TensorFlowOnSpark, Deep Learning Pipelines)	Specialized for neural networks, CNNs, RNNs, transformers, etc.

Introduction to Spark¶

Feng Li¶

Guanghua School of Management¶

Peking University¶

feng.li@gsm.pku.edu.cn ¶

Course home page: https://feng.li/bdcf ¶

What is Spark¶

Why Spark¶

Speed¶

Programming & Usability¶

Scalability & Performance¶

Machine Learning Support¶

Who created Spark¶

Ease of Use¶

Generality¶

Runs Everywhere¶

Spark architecture¶

Spark Built-in Libraries:¶

Run a Python application on a Spark cluster¶

Interactively Run Spark via Pyspark¶

Run Spark interactively within Jupyter Notebook¶

Install Python modules on PKU HPC¶

Stop a Spark session¶

Introduction to Spark¶

Feng Li¶

Guanghua School of Management¶

Peking University¶

feng.li@gsm.pku.edu.cn¶

Course home page: https://feng.li/bdcf¶

What is Spark¶

Why Spark¶

Speed¶

Programming & Usability¶

Scalability & Performance¶

Machine Learning Support¶

Who created Spark¶

Ease of Use¶

Generality¶

Runs Everywhere¶

Spark architecture¶

Spark Built-in Libraries:¶

Run a Python application on a Spark cluster¶

Interactively Run Spark via Pyspark¶

Run Spark interactively within Jupyter Notebook¶

Install Python modules on PKU HPC¶

Stop a Spark session¶

feng.li@gsm.pku.edu.cn ¶

Course home page: https://feng.li/bdcf ¶