InState Farm Engineering BlogbyState Farm EngineeringWhen the Spark Execution Plan Gets Too BigBy Hunter MitchellJan 13Jan 13
Alaukik HarshOptimizing Spark Aggregations: How We Slashed Runtime from 4 Hours to 40 Minutes by Fixing GroupBy…Handling massive datasets efficiently is critical in big data processing, but it’s not uncommon to run into performance bottlenecks. This…Dec 28, 20244Dec 28, 20244
InIsraeli Tech RadarbyYerachmiel FeltzmanDuckDB won’t replace Spark. Nor will Polars.I’ll say it loud and clear: DuckDB won’t replace Spark. Nor will Polars. They will replace Pandas. Indeed they are already replacing…Dec 3, 20245Dec 3, 20245
InTDS ArchivebySergey KotlovAdopting Spark ConnectHow we use a shared Spark server to make our Spark infrastructure more efficientNov 7, 20242Nov 7, 20242
Archana GoyalAdaptive Query Execution (AQE) in Apache Spark 4.0 : Revolutionizing Query OptimizationAs big data processing advances, the demand for smarter and more efficient query optimization has never been greater.Aug 25, 20243Aug 25, 20243
InData Engineer ThingsbyVu TrinhI spent 6 hours learning how Apache Spark plans the execution for us.Catalyst, Adaptive Query Execution, and how Airbnb leverages Spark 3.Sep 11, 20241Sep 11, 20241
Kaviprakash SelvarajSpark performance optimization in Databricks — A complete guideIn this article, we are going to deep dive into techniques of spark optimization in Databricks. This article is written based on the…Aug 30, 20243Aug 30, 20243
InTowards DevbyAvin KohaleSpark — Beyond Basics: Data Skewness and its solutionSkewed data can really mess your code up without you knowing it. Read to learn more…Jul 25, 2024Jul 25, 2024
Rishika IdnaniSolving data skewness in Spark with SaltingData skewness refers to the non-uniform distribution of data in a dataset. Skewed data causes certain nodes/workers in a Spark cluster to…Feb 23, 20231Feb 23, 20231
Archana GoyalSpark Series: Partition Discovery & Production LearningMy articles are open to everyone; non-member readers can read the full article by clicking this link.Mar 3, 20231Mar 3, 20231
InTDS ArchivebyChengzhi ZhaoDeep Dive into Handling Apache Spark Data SkewThe Ultimate Guide To Handle Data Skew In Distributed ComputeJan 3, 2023Jan 3, 2023
InGetindata BlogbyGetInData | Part of Xebia TechTeamApache Spark with Apache Iceberg — a way to boost your data pipeline performance and safetySQL language was invented in 1970 and has powered databases for decades. It allows you not only to query the data, but also to modify it…Oct 27, 2022Oct 27, 2022
Flomin RonOptimizing slow Group By aggregations in Spark: From 20 Hours to 40 minutesNov 13, 20223Nov 13, 20223
Atul VermaAggregator in Apache SparkSince Spark 3.0, UserDefinedAggregateFunction (UDAF) is deprecated. “An Aggregator is similar to a UDAF, but the interface is expressed…Oct 7, 2022Oct 7, 2022
Subham KhandelwalPySpark — The Magic of AQE CoalesceWith the introduction of Adaptive Query Engine aka AQE in Spark, there has been a lot changes in term of Performance improvements. Bad…Oct 13, 20221Oct 13, 20221
InSync ComputingbySync ComputingSync Autotuner for Apache Spark — API Launch!The Sync Autotuner API enables you to continuously monitor and tune your Apache Spark jobs at scaleSep 19, 2022Sep 19, 2022
Subham KhandelwalPySpark - Create Data Frame from List or RDD on the flyPySpark enables certain popular methods to create data frames on the fly from rdd, iterables such as Python List, RDD etc.Oct 4, 2022Oct 4, 2022
InThe StartupbyTony LiuDemystifying Spark’s Stream-Stream OUTER JoinReal-time analysis & recommendation is always a fancy idea in data science. Nothing is more exciting than serving the most up-to-date…Dec 11, 2020Dec 11, 2020