Why is Spark SQL so obsessed with Hive?! (after just a single day with Hive)
I spent the whole yesterday learning Apache Hive. The reason was simple — Spark SQL is so obsessed with Hive that it offers a dedicated HiveContext to work with Hive (for HiveQL queries, Hive metastore support, user-defined functions (UDFs), SerDes, ORC file format support, etc.)
Since I’m into Apache Spark and have never worked with Hive I needed to uncover the reasons for so much love for Hive. It was even more frustrating given how often my Spark clients have been asking about Hive. It became evident that I needed to sink into Hive to understand “Why Hive”.
As a long-time and happy user of SafariBooksOnline I picked up the very first book that showed up after my search for resources about Hive. (I already knew how to build Hive from the sources since I was doing it every day hoping that somehow doing it would make the truth simpler to discover — it has not worked out well, though).
If I had known that Apache Hive Essentials was released by Packt Publishing, I would not have read the book. I hated their way to present technical topics by giving solutions without much explanation of why one should be doing it that way. Given how much I need to read, a week or two with a Packt book was often a week lost. I’ve been enjoying reading books from O’Reilly, Manning, Apress for the very basic reason of offering me a more broad overview of the topic at hand.
How much would I have missed if I had not picked Apache Hive Essentials book from Packt!
After two first chapters I experienced the Aha! moments few times. The chapters were very short and comprehensive, and moreover they brought me exactly that help I really needed. Thanks Dayong Du and Packt!
I already mastered how to use and even create your own user-defined functions in Spark SQL, how to register temporary or permanent tables to Hive, how to work with ORC files, and finally execute HiveQL queries directly (through sql method). I could also recognize and fix the many failures regarding Spark on Windows (because of Hive and Hadoop). My eyes were already trained to see the patterns where and how to use Spark SQL to access Hive. What I was missing was the answer to he question of why people have been using Hive in the first place?!
Sorry, but I would personally not recommend going to the home page of Apache Hive to find the answer “What is Hive?”. The introduction is too convoluted to me.
The Apache Hive ™ data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.
Go to the Wiki page about Apache Hive instead. It’s closer to touch my heart yet still incomplete.
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.
So here comes my explanation of Hive (after just a day with the book and few other articles with enough confidence to explain whys and hows of Spark SQL):
Apache Hive is a SQL layer on top of Hadoop. Hive uses a SQL-like HiveQL query language to execute queries over the large volume of data stored in HDFS. HiveQL queries are executed using Hadoop MapReduce, but Hive can also use other distributed computation engines like Apache Spark and Apache Tez. Since HiveQL is so SQL and many data analysts know SQL already, Hive has always been a viable choice for data queries with Hadoop for storage (HDFS) and processing (MapReduce).
How does it sound? Where do you think I’m missing something important about Hive? I’d like to make it simpler as I think not many could afford their time to read “that much”. I’d appreciate your support.
Let me know your thoughts. I’d appreciate if you could help getting me better at Spark and Hive. Thanks!