As the most active open-source project in the big data community, Apache SparkTM has become the de-facto standard for big data processing and analytics. Spark’s ease of use, versatility, and speed has changed the way that teams solve data problems — and that’s fostered an ecosystem of technologies around it, including Delta Lake for reliable data lakes, MLflow for the machine learning lifecycle, and Koalas for bringing the pandas API to Spark.
We’re proud to share the complete text of O’Reilly’s new Learning Spark, 2nd Edition with you. It includes the latest updates on new features from the Apache Spark 3.0 release, to help you:
- Learn the Python, SQL, Scala, or Java high-level APIs: DataFrames and Datasets
- Inspect, tune, and debug your Spark operations with Spark configurations and Spark UI
- Perform analytics on batch and streaming data using Structured Streaming
- Build reliable data pipelines with open source Delta Lake and Spark
- Develop machine learning pipelines with MLlib and productionize models using MLflow
- Use Koalas, the open source pandas framework, and Spark for data transformation and feature engineering
Learn more about the latest developments around Spark, and the ecosystem around it with Delta Lake, MLflow, and Koalas, in this free ebook.