“import geopandas” also exists and has for some time. Snark aside, WHAT is special about duckDB? I wish the author had actually shown some practical examples so I could understand their claims better.
I replied to another comment, but I think a big part is that duckdbs spatial extension provides a SQL interface to a whole suite of standard foss gis packages by statically bundling everything (including inlining the default PROJ database of coordinate projection systems into the binary) and providing it for multiple platforms (including WASM). I.E there are no transitive dependencies except libc.
Yes, DuckDB does a whole lot more, vectorized larger-than-memory execution, columnar compressed storage and a ecosystem of other extensions that make it more than the sum of its parts. But while Ive been working hard on making the spatial extension more performant and more broadly useful (I designdd a new geometry engine this year, and spatial join optimization just got merged on the dev-branch), the fact that you can e.g. convert too and from a myriad of different geospatial formats by utilizing GDAL, transforming through SQL, or pulling down the latest overture dump without having the whole workflow break just cause you updated QGIS has probably been the main killer feature for a lot of the early adopters.
(Discmaimer, I work on duckdb-spatial @ duckdblabs)
I'm not the OP, but thank you for such a detailed answer. The integration and reduced barriers to entry you mention mirror my own experiences with tooling in another area, and your explanation made parallels clear.
Is there any strong reason to use GeoParquet instead of straight up parquet if all I'm interested in is storing and operating on lat/lons?
I'm curious if it compresses them better or something like that. I see lots of people online saying it compresses well (but mostly compared to .shp or similar) but normal parquet (.gz.parquet or .snappy.parquet) already does that really well. So it's not clear to me if I should spend time investigating it...
I mostly process normal parquet with spark and sometimes clickhouse right now.
Based on my reading of the GeoParquet spec, the main difference is that geometries are stored as WKB using Parquet's byte array type. Byte arrays can be delta-encoded. There is also some additional metadata stored like CRS and a bounding box.
When using EPSG:4326 lat/lons, I don't think GeoParquet would give you any benefits over just having separate columns (this is what I typically do, and it's plenty fast).
If you are using range requests to fetch only parts of parquet files at a time, you could potentially sort your data using a hilbert curve, which could limit the number of row groups that need to be fetched to execute a query.
> a big part is that duckdbs spatial extension provides a SQL interface to a whole suite of standard foss gis packages by statically bundling everything (including inlining the default PROJ database of coordinate projection systems into the binary) and providing it for multiple platforms (including WASM). I.E there are no transitive dependencies except libc.
and for the last twenty, not ten years, this is what PostGIS was pioneering, and also teaching everyone get used to. DuckDB was not something that people even knew in GIS world. I'm not even sure whether QGIS connects to DuckDB, perhaps it does for a while, but it sure knows Spatialite for very long and last, but not least - ESRI sure as f*ck still have not heard of DuckDB. This is already half the geospatial world out there.
This whole article is superb biased and its very sad.
no surprise I missed it, 3.4 was only released in 25.11.2024 г. which is like yesterday in ESRI terms, given price and speed of adoption of new versions (note: many .gov still use the 10.x/11.x branch).
which does not change my original statement that nobody cared about DuckDB for very long, while the whole server-side processing idea is largely based on PostGIS.
As someone unfamiliar with DuckDB but at least somewhat with geospatial tools (it's been a few years, though):
Dang - see, now that is seriously cool. The whole ETL shebang always was the biggest hassle, even with serious commercial tools, and the idea of a stable, all-in-one, ready-to-go layer is incredibly appealing.
It's just something the writer of the article should probably have at least mentioned when going full hyperbole with the title (hey, after reading this, it might actually be justified! :) ).
Rereading the article that focuses on the one thing that isn't a standout (the installation itself), though, I can't help but chuckle and think "worst sales pitch ever". ;)
I've been researching DuckDB - while it has many technical merits I think the main argument will be ease of use. It has a lot of the operational advantages of sqlite paired with strong extensibility and good succinct documentation.
Folks who have been doing DevOps work are exasperated with crummy SaaS vendors or antiquated OSS options that have a high setup cost. DuckDB is just a mature project that offers an alternative, hence an easy fan favorite among hobbyists (I imagine at scale the opportunity costs change and it becomes less attractive).
I'm still getting feedback that many devs are not too comfortable with reading and writing SQL. They learned simple SELECT statements in school, but get confused by JOINs and GROUP BYs.
Random voice here: they should get better at SQL. Not 9 joins and GROUP BY and HAVING and other magic. But two joins and GROUP for sure. If one gets the 3NF already then join and others are a quick (2 week) learn.
It’s not a good database for many storage workloads because it only allows a single process to write to it. For example you wouldn’t want 20 app servers writing their logs or metrics or traces to a single duck db instance. You would instead have them write rotating log files that are somehow aggregated then queried through duck db via something like grafana.
So the operational savings are more for using it as a light weight data science tool that excels in specific workloads like geospacial. Wherever you need to do those computations more locally it will excel.
I don’t really think sql is a mandated devops skill, though a basic understanding of databases certainly is. Between online content, LLMs, as long as you can iteratively improve a query part by part through explain analyze etc it’ll be sufficient.
I don't know if it's ever been a core skill. You can ask any dev. Many tell me SQL is not something they use every day, so even if they learned it, they usually forget it in no time.
SQL is burned into my muscle memory because I work on ML models with large datasets. It took me a lot of trial and error to get decent at SQL. I imagine most devs just don't have the reps because their work rarely exposes them to SQL.
There's no point in learning any much deeper SQL anymore, AI assistants have largely solved SQL querying. Just ask for what you want with natural language.
SQL queries are one area where correctness matters a lot, because downstream applications rely on them to be correct. If you get a query wrong, especially in an ETL process, you can generate a lots of garbage data for a very long time (I'm speaking from lived experience). It might take a long time to correct (via backfill) and sometimes the original data might no longer be available.
I use LLMs to generate ideas for writing complex SQL, but I always need to understand the query 100% first before I deploy. Mistakes are costly, and sometimes irreversible. I need to trust but verify, which requires deep SQL knowledge.
To write correct SQL, you not only have to know the syntax, but also the schema (easily provided to an LLM) and the expected data values (you can provide a sampling to an LLM which helps, but domain knowledge helps even more). There are lots of surprising things in data that only reveal themselves via visual inspection. (though I think with MCO, an LLM will one day be able to inspect the data itself by sampling the database)
this suggests that i should expect that dara coming out of SQL is likely to be wrong, since lots of people writing it dont have the deep knowledge, and its very hard to verify correctness of the results since there is limited to no ground truth.
if i already expect it to be making at least some of a mess, why not have AI as part of the setup?
Testing SQL for correctness is besides the point of how the SQL was generated. It's important of course to review and test anything. My point is SQL writing is particularly good already with current state of the art of assistants for it to be worth it to spend much time hand crafting complex queries.
While AI assistants can give ideas how to address something I'd strongly contradict the statement "there is no point in learning". You should understand the AI suggestion because it could also be very bad or working incorrectly. (If it hallunicated, you'll note the syntax error, that does indeed not require learning...)
Getting a sql query to optimal performance is still much more of an art than a specific science. Having the LLM generate a query that appears to work (correctness issues aside), is much more likely than the LLM generating an optimal performing query. While this may not matter for one-off queries common in analytics, when you start worrying about scalability, even the tiniest tweaks can make a huge difference.
duckdb has parquet support and can operate, in SQL syntax, on enormous 'tables' spread across huge collections of parquet files as if one virtual file. i believe the underlying implication is opportunities to leverage vector instructions on parquet. it's very "handy".
Author here: what's special is that you can go from 0 to spatial data incredibly quickly, in the data generalist tool you're already using. It makes the audience of people working with geospatial data much bigger.
Probably no difference for your use-case (ST_Distance). If you already have data in Postgres, you should continue using Postgis.
In my use case, I use DuckDB because of speed at scale. I have 600GBs of lat-longs in Parquet files on disk.
If I wanted to use Postgis, I would have to ingest all this data into Postgres first.
With DuckDB, I can literally drop into a Jupyter notebook, and do this in under 10 seconds, and the results come back in a flash: (no need to ingest any data ahead of time)
I haven't yet understood this pattern (and I tried using duckdb). Unless you're only ever going to query those files once or twice in your life, importing them into postgres shouldn't be that long and then you can do the same or more than with DuckDB.
Also as a side note, is everyone just using DuckDB in memory? Because as soon as you want some multiple session stuff I'd assume you'd use DuckDB on top of a local database, so again I don't see the point but I'm sure I'm missing something.
> importing them into postgres shouldn't be that long and then you can do the same or more than with DuckDB.
Usually new data is generated regularly and would require creating a separate ETL process to ingest into Postgres. With DuckDB, no ETL is needed. New Parquet files are just read off the disk.
> Also as a side note, is everyone just using DuckDB in memory?
DuckDB is generally used as a single-user, and yes in-memory use case is most common. Not sure about use cases where a single user requires multiple sessions? But DuckDB does have read concurrency, session isolation etc. I believe write serialization is supported in multiple sessions.
With Parquet files, it's append-only so the "write" use-cases tend to be more limited. Generally another process generates those Parquet files. DuckDB just works with them.
This part was not obvious. In a lot of cases geodata is mostly stable and reads/searches dominate over appends. And that’s why we keep this in DB (usually postgis, yes).
So DuckDB is optimised for very different use case and it is not always obvious when it’s mentioned
And now I'm curious whether there's a way to actually index external files (make these queries over 600GB faster) and have this index (or many indices) be persistent. I might have missed that when I looked at the docs...
Ah thanks, of course. I was thinking of dealing with millions of (Geo)JSON files adding up to terabytes, without copying/duplicating them though, mostly indexing. I used to do that with postgres foreign data wrappers and had hopes for duckdb :-). But that's a question for SO or other forum.
I haven't used duckDB but the real comparison is presumably postgis? Which is also absent from the discussion, but I think what the author alludes to.
I have no major qualm with pandas and geopandas. However I use it when it's the only practical solution, not because I enjoy using it as a library. It sounds like pandas (or similar) vs a database?
Yeah, PostGIS is readily available, and postgres is much more widely used than DuckDB. Either I don't understand OP's argument for why this is so important or I just don't buy it.
If you're using JavaScript you install Turf. The concept that you can readily install spatial libraries is hardly earth shattering.
Ask anyone that's just starting out with geo pandas about their experience, and I'd be shocked if anyone calls it intuitive and straightforward. To geopandas credit, I think they just inherited many of Pandas' faults (why does a user need to understand indexes to do basic operations, no multi core support, very poor type hinting, etc).