DuckDB: The Hidden Gem Powering the New Era of Local Data Analytics

Introduction

If you’ve ever worked with data, you’ve likely used tools like pandas for quick analysis or databases like PostgreSQL/MySQL for storing and querying larger datasets. Both are powerful — but there’s always been an awkward middle ground between them.

Pandas is easy and intuitive but struggles when your dataset grows beyond memory. Databases, on the other hand, are robust but often come with the burden of setup, configuration, and maintenance.

Enter DuckDB — the perfect bridge between the two.

Think of it as “SQLite for analytics”: a lightweight, in-process analytical database that runs right inside your Python, R, or C++ code. No servers, no configs, no connection strings.

— just import it and start querying instantly

Why is everyone talking about DuckDB ?

Data work has changed. Analysts and developers today handle millions of rows locally — often inside Jupyter notebooks or simple scripts. Setting up something like Spark for that kind of “medium-sized” data is overkill.

That’s where DuckDB shines. DuckDB is fast and lightweight, easy to use, portable across different systems and it is powerful enough to crunch through analytical workloads that once required massive data engines.

In short, DuckDB brings database-level performance to your laptop. You can analyze CSVs, Parquet files, or pandas DataFrames using pure SQL — without leaving your local environment. That sweet spot between convenience and capability is exactly why DuckDB is quickly becoming a favorite in the modern data stack.

DuckDB vs. Traditional Databases

When we hear “database,” we usually picture systems like PostgreSQL, MySQL, or MongoDB — the kind that run on servers and handle multiple users and transactions at once.

These are client–server databases, ideal for transactional work (OLTP) — things like recording orders, managing users, or updating records in real time.

DuckDB, however, plays a different game. It’s built for analytical workloads (OLAP) — the kind where you run complex queries, aggregations, and summaries over large datasets.

And since it’s in-process, DuckDB doesn’t need a separate server. It runs inside your program, just like NumPy or pandas — making it seamless to integrate into your existing workflows.

Why DuckDB Is So Fast ?

DuckDB’s performance isn’t magic — it’s smart engineering. Here’s how it achieves lightning-fast analytics:

Columnar storage — Data is stored column-by-column, not row-by-row, which makes aggregate operations like SUM(), AVG(), or filtering super efficient.

Vectorized execution — It processes data in batches (vectors), making full use of modern CPU caches for speed.

Parallel processing — DuckDB automatically distributes work across multiple CPU cores.

Query files directly — You can query data from CSV or Parquet files without even importing them first.

While databases like Postgres are optimized for handling many small transactions, DuckDB is optimized for big analytical queries — scanning millions of rows and summarizing them in seconds.

Installation: As Easy As It Gets

One of DuckDB’s biggest superpowers is zero setup. No servers. No daemons. No configuration files. Just install and go.

In python:

    pip install duckdb

In R:

    install.packages("duckdb")

That’s it. You’re ready to start querying data instantly.

Getting started with SQL

DuckDB follows standard SQL syntax with a few analytical extensions.

Creating Tables in SQL

    CREATE TABLE students (id INTEGER, name TEXT, score FLOAT);

Inserting values into the table

    INSERT INTO students VALUES (1, 'Alice', 89.5);

Query data from tables

    SELECT * FROM students;

    SELECT * FROM table WHERE score > 80;

Ordering data

    SELECT * FROM students ORDER BY score DESC;

Aggregating data

    SELECT AVG(score), COUNT(*) FROM students;

Importing CSV File

    CREATE TABLE t AS SELECT * FROM read_csv_auto('sample_file.csv');

Exporting as CSV File

    COPY (SELECT * FROM table) TO 'output_file.csv' (HEADER, DELIMITER ',');

With these fundamentals, you have everything you need to start working with SQL and you can immediately begin using DuckDB.

How to use DuckDB ?

Lets first start with the basic commands to get started with DuckDB. Its important to note that a good understand of pandas library is required.

    import duckdb

    # Create an in-memory database (default). This means the data is stored in RAM in this situation
    con = duckdb.connect()

    # Run an SQL command directly in python
    # This create a table called "students" and makes an entry to the table
    con.execute("CREATE TABLE students AS SELECT 1 AS id, 'Alice' AS name, 90 AS score;")

    # This line will extract all the data currently in the "students" table
    result = con.execute("SELECT * FROM students").fetchall()

    print(result)
    # [(1, 'Alice', 90)]

You can also connect to a persistent / existing database file

    con = duckdb.connect('mydatabase.duckdb')

One of DuckDB’s main features is that you can query pandas dataframes directly using SQL without writing them to the disk. Let us see how that’s done.

    import duckdb
    import pandas as pd

    df = pd.DataFrame({
        'id': [1, 2, 3],
        'name': ['Alice', 'Bob', 'Charlie'],
        'score': [90, 85, 92]
    })

    result = duckdb.query("SELECT name, score FROM df WHERE score > 85").to_df()
    print(result)

It creates a pandas DataFrame in Python, then uses DuckDB to run a SQL query on that DataFrame treating it like a database table. Finally, it returns the query result back as another DataFrame.

Why is this such an important feature ?

Normally to run SQL on data in pandas, you’d have to write the data to a database like MySQL / PostgreSQL. With DuckDB, we don’t move or copy any such data, it reads directly from the pandas dataframe in memory.

You can also query multiple dataframes using DuckDB

    students = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})

    scores = pd.DataFrame({'id': [1, 2, 3], 'score': [90, 85, 92]})

    result = duckdb.query("""
        SELECT s.name, sc.score
        FROM students s
        JOIN scores sc ON s.id = sc.id
    """).to_df()

    print(result)

You can also explicitly register a pandas dataframe as a named table for repeated or complex queries and it will actually read from the pandas dataframe in memory without any copying or conversion.

    # create a duckdb connection object
    con = duckdb.connect()

    # register "df" as "students" inside duckdb. Now df available inside duckdb with the name "students"
    con.register('students', df)

    # queries the students table and converts the result of the SQL query to a dataframe
    result = con.execute("SELECT * FROM students WHERE score > 85").df()

Finally we come to exporting the table to CSV / Pandas / Parquet

    # To pandas
    df_out = con.execute("SELECT * FROM students").df()

    # To Parquet
    con.execute("COPY students TO 'students.parquet' (FORMAT 'parquet');")

    # To CSV
    con.execute("COPY students TO 'students.csv' (HEADER, DELIMITER ',');")

Real world use cases

DuckDB is widely used in real-world scenarios where fast, in-process analytics is needed. It has the ability to query large CSV or Parquet files directly, integrate seamlessly with pandas, and perform complex SQL operations.

Here are some use cases:

Suppose you have multiple CSV files be it from any domain and you would want fast data aggregation, joins or filtering.
Creating features from large datasets before feeding to ML models.
Analyze large biological datasets like gene expression.

DuckDB: The Hidden Gem Powering the New Era of Local Data Analytics

DuckDB: The Hidden Gem Powering the New Era of Local Data Analytics

Introduction

Why is everyone talking about DuckDB ?

DuckDB vs. Traditional Databases

Why DuckDB Is So Fast ?

Installation: As Easy As It Gets

Getting started with SQL

Creating Tables in SQL

Inserting values into the table

Query data from tables

Ordering data

Aggregating data

Importing CSV File

Exporting as CSV File

How to use DuckDB ?

Why is this such an important feature ?

Real world use cases

Here are some use cases:

References