Speed-Querying StackOverflow data with DuckDB ft. Michael Hunger

MotherDuck November 27, 2023
Video Thumbnail
MotherDuck Logo

MotherDuck

@motherduckdb

About

Collaborative serverless analytics platform

Video Description

Explore StackOverflow's vast data with Michael Hunger from Neo4j using #duckdb and MotherDuck. A highlight from the DuckDB Meetup, Berlin, November 20th, 2023. Michael is also the co-author of the upcoming book 'DuckDB in action'. ☁️🦆 Start using DuckDB in the Cloud for FREE with MotherDuck : https://hubs.la/Q02QnFR40 Thanks dlthub (https://dlthub.com/) for hosting this! Michael's Linkedin : https://www.linkedin.com/in/jexpde/ DuckDB in action book : https://motherduck.com/duckdb-book/ ➡️ Follow Us LinkedIn: https://www.linkedin.com/company/8192... Twitter : https://twitter.com/motherduck Blog: https://motherduck.com/blog/ #duckdb #dataengineering -------------------------------------- Discover how to perform large-scale data analysis on the entire Stack Overflow dataset using the power of DuckDB and MotherDuck. This practical tutorial walks you through the complete data engineering workflow, from sourcing the massive XML data dump from the Internet Archive to running complex SQL queries in milliseconds. We'll show you how to import gigabytes of data into a local DuckDB instance, demonstrating its incredible speed for exploratory data analysis (EDA) right on your laptop. Learn why you don't need a massive cluster for big data analytics and how DuckDB makes it possible to query 58 million posts in under a second. Next, we dive into data optimization techniques essential for any data engineer. See a live demo of converting raw CSV files into the highly efficient, columnar Parquet format. This step is crucial for improving query performance and reducing storage size. We'll run benchmark queries directly against the local Parquet files, showcasing how DuckDB's architecture leverages columnar storage to deliver lightning-fast results, even for complex aggregations and joins. This section provides valuable insights into building an efficient local data analytics pipeline. Take your analysis to the next level by scaling from local to the cloud with MotherDuck, the serverless data warehouse built on DuckDB. We demonstrate how to seamlessly connect to MotherDuck and load your Parquet files directly from an S3 bucket, highlighting the incredible network speed and query pushdown optimizations between S3 and the MotherDuck platform. This setup provides a powerful, hybrid cloud data warehouse solution where you can manage and query large datasets without provisioning any infrastructure. Finally, explore the cutting-edge features of MotherDuck that accelerate your workflow. Witness a fascinating demo of AI-powered queries, where we use a GPT-based prompt to automatically generate and even fix complex SQL, making data accessible to a wider audience. We also cover MotherDuck's powerful data sharing capabilities, showing you how to create read-only snapshots of your database to collaborate with colleagues securely and efficiently. This video is a complete guide for developers and data analysts looking to master the modern data stack with DuckDB.