List: Apache Spark | Curated by Jacek Laskowski

Jan 21, 2025

41 stories

1 save

Apache Spark
In
State Farm Engineering Blog
by
State Farm Engineering
When the Spark Execution Plan Gets Too BigBy Hunter Mitchell
Jan 13
Jan 13
Alaukik Harsh
Optimizing Spark Aggregations: How We Slashed Runtime from 4 Hours to 40 Minutes by Fixing GroupBy…Handling massive datasets efficiently is critical in big data processing, but it’s not uncommon to run into performance bottlenecks. This…
Dec 28, 2024
4
Dec 28, 2024
4
In
Israeli Tech Radar
by
Yerachmiel Feltzman
DuckDB won’t replace Spark. Nor will Polars.I’ll say it loud and clear: DuckDB won’t replace Spark. Nor will Polars. They will replace Pandas. Indeed they are already replacing…
Dec 3, 2024
5
Dec 3, 2024
5
In
TDS Archive
by
Sergey Kotlov
Adopting Spark ConnectHow we use a shared Spark server to make our Spark infrastructure more efficient
Nov 7, 2024
2
Nov 7, 2024
2
Archana Goyal
Adaptive Query Execution (AQE) in Apache Spark 4.0 : Revolutionizing Query OptimizationAs big data processing advances, the demand for smarter and more efficient query optimization has never been greater.
Aug 25, 2024
3
Aug 25, 2024
3
In
Data Engineer Things
by
Vu Trinh
I spent 6 hours learning how Apache Spark plans the execution for us.Catalyst, Adaptive Query Execution, and how Airbnb leverages Spark 3.
Sep 11, 2024
1
Sep 11, 2024
1
Kaviprakash Selvaraj
Spark performance optimization in Databricks — A complete guideIn this article, we are going to deep dive into techniques of spark optimization in Databricks. This article is written based on the…
Aug 30, 2024
3
Aug 30, 2024
3
In
Towards Dev
by
Avin Kohale
Spark — Beyond Basics: Data Skewness and its solutionSkewed data can really mess your code up without you knowing it. Read to learn more…
Jul 25, 2024
Jul 25, 2024
Daniel Mantovani
Apache Spark 4.0 Everything You Must KnowSpark Connect
Jun 29, 2024
2
Jun 29, 2024
2
Rishika Idnani
Solving data skewness in Spark with SaltingData skewness refers to the non-uniform distribution of data in a dataset. Skewed data causes certain nodes/workers in a Spark cluster to…
Feb 23, 2023
1
Feb 23, 2023
1
Archana Goyal
Spark Series: Partition Discovery & Production LearningMy articles are open to everyone; non-member readers can read the full article by clicking this link.
Mar 3, 2023
1
Mar 3, 2023
1
Sephinreji
Spark Lineage vs DAGLineage:
Jan 16, 2023
Jan 16, 2023
In
TDS Archive
by
Chengzhi Zhao
Deep Dive into Handling Apache Spark Data SkewThe Ultimate Guide To Handle Data Skew In Distributed Compute
Jan 3, 2023
Jan 3, 2023
In
Getindata Blog
by
GetInData | Part of Xebia TechTeam
Apache Spark with Apache Iceberg — a way to boost your data pipeline performance and safetySQL language was invented in 1970 and has powered databases for decades. It allows you not only to query the data, but also to modify it…
Oct 27, 2022
Oct 27, 2022
Flomin Ron
Optimizing slow Group By aggregations in Spark: From 20 Hours to 40 minutes
Nov 13, 2022
3
Nov 13, 2022
3
Atul Verma
Aggregator in Apache SparkSince Spark 3.0, UserDefinedAggregateFunction (UDAF) is deprecated. 
“An Aggregator is similar to a UDAF, but the interface is expressed…
Oct 7, 2022
Oct 7, 2022
Subham Khandelwal
PySpark — The Magic of AQE CoalesceWith the introduction of Adaptive Query Engine aka AQE in Spark, there has been a lot changes in term of Performance improvements. Bad…
Oct 13, 2022
1
Oct 13, 2022
1
In
Sync Computing
by
Sync Computing
Sync Autotuner for Apache Spark — API Launch!The Sync Autotuner API enables you to continuously monitor and tune your Apache Spark jobs at scale
Sep 19, 2022
Sep 19, 2022
Subham Khandelwal
PySpark - Create Data Frame from List or RDD on the flyPySpark enables certain popular methods to create data frames on the fly from rdd, iterables such as Python List, RDD etc.
Oct 4, 2022
Oct 4, 2022
In
The Startup
by
Tony Liu
Demystifying Spark’s Stream-Stream OUTER JoinReal-time analysis & recommendation is always a fancy idea in data science. Nothing is more exciting than serving the most up-to-date…
Dec 11, 2020
Dec 11, 2020

Apache Spark

When the Spark Execution Plan Gets Too Big

By Hunter Mitchell

Optimizing Spark Aggregations: How We Slashed Runtime from 4 Hours to 40 Minutes by Fixing GroupBy…

Handling massive datasets efficiently is critical in big data processing, but it’s not uncommon to run into performance bottlenecks. This…

DuckDB won’t replace Spark. Nor will Polars.

I’ll say it loud and clear: DuckDB won’t replace Spark. Nor will Polars. They will replace Pandas. Indeed they are already replacing…

Adopting Spark Connect

How we use a shared Spark server to make our Spark infrastructure more efficient

Adaptive Query Execution (AQE) in Apache Spark 4.0 : Revolutionizing Query Optimization

As big data processing advances, the demand for smarter and more efficient query optimization has never been greater.

I spent 6 hours learning how Apache Spark plans the execution for us.

Catalyst, Adaptive Query Execution, and how Airbnb leverages Spark 3.

Spark performance optimization in Databricks — A complete guide

In this article, we are going to deep dive into techniques of spark optimization in Databricks. This article is written based on the…

Spark — Beyond Basics: Data Skewness and its solution

Skewed data can really mess your code up without you knowing it. Read to learn more…

Apache Spark 4.0 Everything You Must Know

Spark Connect

Solving data skewness in Spark with Salting

Data skewness refers to the non-uniform distribution of data in a dataset. Skewed data causes certain nodes/workers in a Spark cluster to…

Spark Series: Partition Discovery & Production Learning

My articles are open to everyone; non-member readers can read the full article by clicking this link.

Spark Lineage vs DAG

Lineage:

Deep Dive into Handling Apache Spark Data Skew

The Ultimate Guide To Handle Data Skew In Distributed Compute

Apache Spark with Apache Iceberg — a way to boost your data pipeline performance and safety

SQL language was invented in 1970 and has powered databases for decades. It allows you not only to query the data, but also to modify it…

Optimizing slow Group By aggregations in Spark: From 20 Hours to 40 minutes

Aggregator in Apache Spark

Since Spark 3.0, UserDefinedAggregateFunction (UDAF) is deprecated. “An Aggregator is similar to a UDAF, but the interface is expressed…

PySpark — The Magic of AQE Coalesce

With the introduction of Adaptive Query Engine aka AQE in Spark, there has been a lot changes in term of Performance improvements. Bad…

Sync Autotuner for Apache Spark — API Launch!

The Sync Autotuner API enables you to continuously monitor and tune your Apache Spark jobs at scale

PySpark - Create Data Frame from List or RDD on the fly

PySpark enables certain popular methods to create data frames on the fly from rdd, iterables such as Python List, RDD etc.

Demystifying Spark’s Stream-Stream OUTER Join

Real-time analysis & recommendation is always a fancy idea in data science. Nothing is more exciting than serving the most up-to-date…

Jacek Laskowski