Speed-Querying StackOverflow data with DuckDB ft. Michael Hunger

MotherDuck • November 27, 2023

MotherDuck

@motherduckdb

About

Collaborative serverless analytics platform

Latest Posts

PT4M

Give every user their own database! Unleashing the untapped power of small data

MotherDuck1 month ago

920

PT4M

DuckDB for Python Devs: 6 Reasons It Beats DataFrames

MotherDuck2 months ago

6866

PT4M

How to Efficiently Load Data into DuckLake with Estuary

MotherDuck4 months ago

1951

PT4M

An Evolving DAG for the LLM world - Julia Schottenstein of LangChain at Small Data SF

MotherDuck8 months ago

2233

Video Description

Explore StackOverflow's vast data with Michael Hunger from Neo4j using #duckdb and MotherDuck. A highlight from the DuckDB Meetup, Berlin, November 20th, 2023. Michael is also the co-author of the upcoming book 'DuckDB in action'. ☁️🦆 Start using DuckDB in the Cloud for FREE with MotherDuck : https://hubs.la/Q02QnFR40 Thanks dlthub (https://dlthub.com/) for hosting this! Michael's Linkedin : https://www.linkedin.com/in/jexpde/ DuckDB in action book : https://motherduck.com/duckdb-book/ ➡️ Follow Us LinkedIn: https://www.linkedin.com/company/8192... Twitter : https://twitter.com/motherduck Blog: https://motherduck.com/blog/ #duckdb #dataengineering -------------------------------------- Discover how to perform large-scale data analysis on the entire Stack Overflow dataset using the power of DuckDB and MotherDuck. This practical tutorial walks you through the complete data engineering workflow, from sourcing the massive XML data dump from the Internet Archive to running complex SQL queries in milliseconds. We'll show you how to import gigabytes of data into a local DuckDB instance, demonstrating its incredible speed for exploratory data analysis (EDA) right on your laptop. Learn why you don't need a massive cluster for big data analytics and how DuckDB makes it possible to query 58 million posts in under a second. Next, we dive into data optimization techniques essential for any data engineer. See a live demo of converting raw CSV files into the highly efficient, columnar Parquet format. This step is crucial for improving query performance and reducing storage size. We'll run benchmark queries directly against the local Parquet files, showcasing how DuckDB's architecture leverages columnar storage to deliver lightning-fast results, even for complex aggregations and joins. This section provides valuable insights into building an efficient local data analytics pipeline. Take your analysis to the next level by scaling from local to the cloud with MotherDuck, the serverless data warehouse built on DuckDB. We demonstrate how to seamlessly connect to MotherDuck and load your Parquet files directly from an S3 bucket, highlighting the incredible network speed and query pushdown optimizations between S3 and the MotherDuck platform. This setup provides a powerful, hybrid cloud data warehouse solution where you can manage and query large datasets without provisioning any infrastructure. Finally, explore the cutting-edge features of MotherDuck that accelerate your workflow. Witness a fascinating demo of AI-powered queries, where we use a GPT-based prompt to automatically generate and even fix complex SQL, making data accessible to a wider audience. We also cover MotherDuck's powerful data sharing capabilities, showing you how to create read-only snapshots of your database to collaborate with colleagues securely and efficiently. This video is a complete guide for developers and data analysts looking to master the modern data stack with DuckDB.