Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
SQLite 3.33 (sqlite.org)
272 points by chmaynard on Aug 16, 2020 | hide | past | favorite | 108 comments


SQLite 3.33 also increases the maximum size of a SQLite database file from 140TB to 281TB: https://www.sqlite.org/releaselog/3_33_0.html

I asked about this on the SQLite forum and D. Richard Hipp said "This change was in response to a customer request. They still have a factor of 4 before reaching the old upper limit, but asked for additional headroom." https://sqlite.org/forum/forumpost/8e40a7f588428077d7f073c00...


It is very good to hear there are others leveraging the full capabilities of this software.

We haven't broken the 100GB barrier for a single SQLite database file yet, but we have strong confidence that everything will simply continue working as expected once we do.


Out of curiosity I wondered what was the biggest Sqlite databse in my filesystem:

find / -name *.sqlite -printf '%s %p\n' 2>/dev/null | sort -nr | head -n 1

The winner is `favicons.sqlite` from Firefox profile directory, which is 40 MB.


macos version:

mdfind "kMDItemDisplayName == *.sqlite" -0 | xargs -0 stat "-f%z %N" | sort -nr | head -n 5

should be fairly fast because mdfind uses the spotlight backend and already has this data cached.


does it include system files?


Yeah -- the top 5 for me includes photos and notes


Make sure to include .etilqs files as well


I found the source. It appears that it is an etilqs_ prefix rather than .etilqs but the story is somewhat humorous:

https://github.com/mackyle/sqlite/blob/3cf493d4018042c70a4db...


Chrome History at ~500MB


I have a ~1TB SQLite database (genetic data). All my queries run really fast, and overall I'm very impressed with the performance.


Yep, that's a great use-case! However on such amounts of data I would recommend to split it to several databases - backup, replication, vacuum will be much easier and better. Doesn't applies to sqlite3 only =) . I assume that you data is immutable or append-only, and there is rare writes with lots of reads?


Yep! The data is for the most part append-only, with massive writes once a month (that's how often the genetic repositories update their data dumps) with a few updates scattered in when annotations change.

What exactly do you mean by split it to several databases? It seems like to me that would make backup and replication and such more difficult, since now I'd have to manage multiple databases. But I don't have experience there, so I'd love to hear if there are easy ways to do that


If you are doing this for analytical things, setup with one database is OK. But, imagine that you a running something on production with sqlite, and database is really big. It is hard to : VACUUMizing, creating indexes and so on. In that case it's great to shard this thing, even it will be several files on one machine (of course, if you have data that can be sharded, like different users data can be stored on different dbs.)


The largest I've worked with until now is a bit over 0.5TB. It's still performing as well as if it was a few GB.


For those of us who haven’t had a few GB SQLite db, does it perform different to one that is a few MB?


A few MB would likely be fully in the OS buffer cache. I think you would see the difference when it no longer fits.


Just as a tip if you experience bad performance of SQLite once the size reaches a triple GB range: Understand how the BEGIN/END TRANSACTION and transaction control in general works, learn about journaling why and when you need it and how to control it via the appropriate pragmas and be careful about indexing. You can increase sqllite3 performance by literally 4 orders of magnitude with simple tweaking to the point that you could easily serve a website with a million hits per day off a single SQLite database.


Still I see that a mere 60GB SQLite database for browser history, accumulated in maybe 10 years, causes trouble with Firefox, and I had to rename it and make Firefox create a new one in order for it to continue storing my recent history.

I bet it's not SQLite's problem.


I'd bet that the appropriate table is missing an index, since whoever designed it didn't anticipate it growing to 60GB.


I always had to fight with corporate DBAs to add indexes - they never wanted to create them because it made their backups take longer so instead they made everyday usage slower for everyone.

One poor woman I worked with used to come in every day and kick off a bunch of queries that would do a couple linear scan of a database and take four-six hours to complete. I added a single one column index and the same query ran in less then a second. Got a hug on the spot. The DBA was upset the backup took a couple minutes longer to complete but when he complained he got into trouble for not profiling the queries and adding the index himself (ha!).

After that I started using SQLite for a lot of things - instead of confronting the DBAs I’d just dump the official database into an SQLite database and run all my queries against that. Was kind of amazing that you would have these huge Oracle clusters supporting a database with a few tens or hundreds of megabytes (not even gigabytes) of data but perfect size for slurping into SQLite on my desktop.


What you were dealing with is most likely a sysadmin that grew into a role of DBA. The DBAs I know are highly qualified and are actually making operations so much more smoother. They help with strategies of effective schema migrations, know the profiler in-depth, and actually detect optimization opportunities before the devs realize it.

My guess is that you were dealing with a person who may not actually enjoy their work and is optimizing to make their work as little as possible — but of course I may be way off course here. :)


You may be right, now that I think it there was one guy responsible for all the databases in my business line, and he was the one always launching a two ocean war when his backup ran a minute slower then the day before.


Which is just petty stuff, and sends the message “I am more important than the rest of the company”.

It sounds like he doesn’t have the organization’s best interests in mind, but rather just his own.


Just goes to show you that just because somebody managed to get a job with the title DBA doesn't mean they're actually an expert in database management.


Pardon my french, but I would call such DBA an "anti-DBA"


That's not french at all. :)


administrateur bâtard de l'enfer


That is kind of entertaining to think someone can store a 281TB SQLite file and I am still waiting for companies to sell 1PB hard drives.

Back in 2010 I thought yeah soon we will have 50TB hard drives then 500TB and someday I can have 1PB it will be all I will need! Sadly no such change. I would be happy to be in the 10s of TBs rage for under 200 dollars.

Games are getting stupid wasteful. Someone told me Fortnite is almost 90GB or so and I flinched considering GTA5 is around that range and offers a loooot more rich gameplay and features! What a wasteful game resource wise.


Kids these days.

My first hard drive was an 8" 20MB drive. That's about 9.5 x 4.6 x 14.2 inches and weighed about 20 pounds. Only cost me $6000, in 1981. And with that massive capacity, I never did run out of room.

Now I have a 200GB MicroSD in my phone. It cost $75. Haven't run out of room on that one either.



Gameplay and features are cheap, they don't take as much storage as assets. All the dances, outfits and gears in fortnite are heavy it seems.


GTA V has seven or so years worth of major bi- or tri-annual DLCs. There are probably a few hundred hours worth of (fully voiced) co-op content in GTA Online.

https://gta.fandom.com/wiki/Updates_in_GTA_Online (37 total)


"I can have 1PB it will be all I will need"

You can have a 1PB file without a 1PB drive. I think md has a limit of 8 Zib. Several filesystems allow files up to 8EB.


Well yeah, I know that, but still it would be nice to have a single 1PB hard drive.


That’s a lot of eggs for one basket...


Right? Better have several of those, bought from different vendors at different times, to handle the backups!

It depends on the size of the eggs though. With 1 PB drives will probably come 3D 360° wat K videos in which you can take a walk, and video games.

These videos will be downloaded from your crappy rural TB connection by the way. Web pages will still be slow as fuck to load, like today, because we will have figured out that downloading the entire NPM registry several times per page load is easier than taking the time to bundle the dependencies, and optimizing image and video size will be considered as a premature optimization that nobody does anymore.


It would take almost a month to fill up assuming 500MB/s write speed.


When I say hard drive I don't mean just hard drive, I usually mean storage medium designed for desktops. So a PCI-E drive would suffice, besides my assumption was that we'd have more storage space but files wouldnt endlessly increase as they seem to be.


Yeah Call of Duty: Modern Warfare (the latest one). https://i.imgur.com/bsiGKFp.png Their patches are 30-40gb each!


The size is typically from assets, not gameplay and features.


Even in that context GTA5 seems to have a lot more though, a massive city with all types of places like deserts and downtown areas, planes, cars, motorcycles, and so on.


Yeah, visual assets up to very high resolutions is a large part of it. I wish I remembered the game, but I do recall one (it was a free to play battle royal game, not fortnite or PUBG) that had an ~15GB base game download, and then an addition 30GB for 4K textures.

In the case of GTA I’d imagine the voice lines are a major source of size as well. They’ve been doing updates for the online with new missions and voice lines for ages.


A SAN with a volume manager, or just a big enough RAID NAS could be it. You need mere 36 8TB drives, not even a full Backblaze pod.


36 x 8TB = (roughly) 288TB raw storage.

The overhead (depends on partition/filesystem/etc layout) would probably take that under 281TB.

Adding a few more drives would still probably be needed to hold the file, and add some level of redundancy. :)


Json and markdown output for the CLI also...that's nice.


Those limits seem so... arbitrary. And at that scale, why bother having a limit at all? Genuinely curious for the justification.


https://sqlite.org/limits.html

> SQLite was originally designed with a policy of avoiding arbitrary limits. [...] Unfortunately, the no-limits policy has been shown to create problems. Because the upper bounds were not well defined, they were not tested, and bugs were often found when pushing SQLite to extremes.


Thanks -- that makes total sense, all for guaranteed tests.

Though I shudder to imagine the effort that goes into testing a 281 TB database.

Do they purchase all those hard drives and make one enormous RAID concatenated disk set? Or is there a way to more cheaply virtualize that by combining a bunch of max-sized 16 TB AWS EBS instances? What types of errors are even likely to come up at that point -- errors in SQLite's internal logic, or errors in the operating system or drivers?


The TH3 test harness for SQLite (https://www.sqlite.org/th3.html) supports a virtual filesystem in which we can create test database files that appear to be very large but that don't actually contain much data or use much space.

We also have a simple utility program in the SQLite source tree (https://www.sqlite.org/src/file/tool/enlargedb.c) that lets you create a massive database file using a sparse file (https://en.wikipedia.org/wiki/Sparse_file) on systems that support that kind of thing.


If they have customers who need that kind of database, I'm sure they will be willing to help out with the testing. The setup doesn't have to be too crazy. Storage appliances with 20-30 hard drives are fairly common, actually. Even for individual humans, not just corporations. LVM allows you to easily create a logical volume out of multiple physical volumes, e.g. https://www.redhat.com/sysadmin/creating-logical-volumes.

You might find it interesting to look at the actual sqlite source commits where this change was introduced: https://sqlite.org/src/timeline?r=larger-databases It turns out the number comes from having a max of 2^32 pages in their database. Their default page being 4 kB each: https://www.sqlite.org/pgszchng2016.html

Working backwards, they must have raised it to 64kb: `python -c 'print(1024 * 64 * 232)'` produces 281,474,976,710,656.


Maybe upping the limit required some changes in the way some things are handled in the code. Therefore, the limit would not be arbitrary, but a consequence of how SQLite is written.

Or the authors want to provide some guaranties and upper limits are set according to what is known to work / tested.

I'd be interested in knowing the actual answer to your question though.

edit: the actual answer is in the sibling comment, ah ah.


Decimal type, finally. Now SQLite is 100x easier to use in finance.


That's brilliant! I've been waiting for this for years.


If you appreciate SQL, the example (update inventory from daily sales) is a thing of beauty.


Is “GROUP BY 2” a shorthand for “GROUP BY itemId”, the second column in the FROM SELECT? Not sure I’ve seen that syntax before.


You can mention columns from the select clause by ordinal in group by and order by clauses, in lieu of restating the full expression. Very helpful with complex expressions, but IMO not the greatest for readability, so I tend to only use it in ad hoc queries. It's part of SQL-92, but I believe it is deprecated.

Many interpreters let you use a column alias from the select clause in the group and order clauses. This has better readability IMO but I'm not sure it's in the SQL-92 standard, but I believe it is now standardized.


> Many interpreters let you use a column alias from the select clause in the group and order clauses. This has better readability IMO but I'm not sure it's in the SQL-92 standard, but I believe it is now standardized.

Not sure if it's standardised, but Postgres doesn't let you do this (rather irritatingly).


No, PostgreSQL definitely lets you do this.

> An expression used inside a grouping_element can be an input column name, or the name or ordinal number of an output column (SELECT list item), or an arbitrary expression formed from input-column values. [0]

> Each expression can be the name or ordinal number of an output column (SELECT list item), or it can be an arbitrary expression formed from input-column values. [1]

Perhaps you are thinking of trying to use an output column name in a WHERE clause?

[0]: https://www.postgresql.org/docs/current/sql-select.html#SQL-...

[1]: https://www.postgresql.org/docs/current/sql-select.html#SQL-...


> Perhaps you are thinking of trying to use an output column name in a WHERE clause?

Yeah, I think I am thinking about this.

Its pretty frustrating because the actual query engine is perfectly capable of executing fairly complex SQL expressions efficiently (they're not really that complex computationally, but they are syntactically because of SQL's verbosity), but the code becomes quite unmaintainable if you use too many of them.

For example, the following expression:

    ROUND((
      (EXTRACT(EPOCH FROM (s.end_time - s.start_time)) / 60) -- Shift length in minutes
        - (FLOOR((EXTRACT(EPOCH FROM (s.end_time - s.start_time)) / 60) / 380) * 20) -- Break length in minutes
    )::numeric / 60 , 2) as hours_planned,
It repeats the sub-expression `EXTRACT(EPOCH FROM (s.end_time - s.start_time)) / 60)`. If I could name that sub expression and reference it multiple times then the overall expression would be a lot more readable.


Yes, it's not in the standard. IIRC they only don't do it because it's not in the standard, somewhat annoying for iterating on queries. In stuff like SQLAlchemy it's not as annoying, since you can write it DRY.


I believe it's in SQL:1999.

It's tedious that all this is hearsay without open access standards. It's "only" $195 for the latest.


Yes. GROUP BY n refers to the nth column of the result set.


Order by [column number] is also a useful shortcut in mssql. Very often my columns are a compilation of functions, it feels clean to not repeat them. I'm not sure of availability in other sqls.


IMO it looks cleaner but adds cognitive overhead for readability so I'm not sure if it is actually better. Feels kind of like a lookup table that a reader now has to reference. YMMV


In code, I agree. But when you are a REPL, it is handy to have shortcuts.


It isn't idempotent, though - if you execute the query twice, you're kind of screwed, since your inventory numbers are now bad. A better way to do it would be to create a new table that consists of daily snapshots of inventory, select the latest row, and then use that row's date as a WHERE filter when aggregating the latest transactions.


> It isn't idempotent, though

Agreed, but the purpose of the example is to demonstrate how to use UPDATE FROM, nothing more. I appreciate how it accomplishes a complex update in a very clear, concise, elegant way.


I'm only passingly familiar with SQL, can anyone explain?


Essentially the FROM part calculates how many of each item has been sold today and then uses this amount to reduce how many are currently in stock (the UPDATE and SET parts).


I see what it's doing but I don't really see what the benefit is. Maybe a comparison to what previously had to be done would help me? Would you have to make the 'daily' table in a separate step and then UPDATE using that?


Previously you would have to SELECT and then UPDATE in separate queries, meaning that the data is round tripping to your application server and back again.

Edit: Or use subqueries apparently.


This is sqlite, you typically don't have a server.


Is the "set" evaluated where its written or only during "sum()"? If not, whats the sum for?


As someone who uses SQL Server at work, I was surprised to find that this wasn't a standard feature in other DB engines. I use it all the time! Glad to see it spreading.


SQLite isn't a database so much as a file format with a handy API, so it should be expected it won't have all of the feature you may expect from a full database engine. It doesn't support users and permissions for example. I don't think a features presence in SQLite can be seen as indicative of the feature set of any other database software.


Postgres has this too afaik?

Yeah![1] with some caveats, that also apply to the sqlite implementation if im not mistaken:

When a FROM clause is present, what essentially happens is that the target table is joined to the tables mentioned in the from_item list, and each output row of the join represents an update operation for the target table. When using FROM you should ensure that the join produces at most one output row for each row to be modified. In other words, a target row shouldn't join to more than one row from the other table(s). If it does, then only one of the join rows will be used to update the target row, but which one will be used is not readily predictable.

https://www.postgresql.org/docs/current/sql-update.html


The release notes are directly referencing that they copied us / PG: "Support for UPDATE FROM following the PostgreSQL syntax. The sqlite folks have done that a few times, and I think that's good. No need to unnecessarily proliferate more variants of doing the same.


I'm intrigued by the decimal extension. I thought SQLite was IEEE double only, but it handles 64-bit integers, and now decimals. They don't seem to be fully incorporated into the documentation yet, and decimal.c leaves me with questions, so I guess I'll just have to play around with it.


Decimals are represented as strings.

They have add, sub, mul and sum functions for text strings as well as comparison.


Oh, decimal support is landing!

I have a lot of stuff dancing around the lacking of decimals in sqlite, so this is so welcoming!


This software is a piece of magic...


As a data point, the nightly builds of DB Browser for SQLite now include SQLite 3.33:

https://nightlies.sqlitebrowser.org/latest/


UPDATE FROM is a nice extension. In Oracle I've done similar things with an UPSERT/MERGE statement. The linked example of:

  UPDATE inventory
     SET quantity = quantity - daily.amt
    FROM (SELECT sum(quantity) AS amt, itemId FROM sales GROUP BY 2) AS daily
   WHERE inventory.itemId = daily.itemId;
Could be done with MERGE using:

  MERGE INTO inventory i
  USING (
    SELECT
      SUM(quantity) AS amt
    , itemId
    FROM sales
    GROUP BY 2
  ) daily
  ON (
    i.itemId = daily.itemId
  )
  WHEN MATCHED THEN UPDATE
    SET i.quantity = i.quantity - daily.amt;


Is SQLite capable of supporting multiple writes to the database from different processes at the same time?

This question is purely around INSERTS - no UPDATES.

Anything I should be aware of? Looking to write data from multiple services to an SQLite file.


I think that this applies to all modifying operations - INSERTs, UPDATEs, DELETEs.

It's perfectly ok to write to sqlite from different processes in the same time, but to achieve good results it's better to:

* use WAL mode - so the readers and writers do not block (you can turn on it with `PRAGMA journal_mode=WAL;` in CLI, and it's better to add `PRAGMA main.synchronous=NORMAL;` also).

* all concurrent writes will be queued by sqlite3 lib and done in sequential manner, and if any write attempt will wait longer that BUSY_TIMEOUT ( see https://www.sqlite.org/pragma.html#pragma_busy_timeout ) , it will return error.

Snippet

  # setting things up...
  itroot@l7490:/tmp$ grep -i pragma ~/.sqliterc
  PRAGMA journal_mode=WAL;
  PRAGMA main.synchronous=NORMAL;
  PRAGMA busy_timeout=1000;
  itroot@l7490:/tmp$ sqlite3 test.sqlite 'CREATE TABLE records (id INTEGER PRIMARY KEY, record TEXT);' > /dev/null 2>&1

  # running 10 parallel processes that inserts numbers from 1 to 1000...
  itroot@l7490:/tmp$ echo {1..1000} | xargs -n1 -d' ' -P 10 -i% sqlite3 test.sqlite 'INSERT INTO records (record) VALUES (%);' >/dev/null 2>&1

  # getting number of records
  itroot@l7490:/tmp$ sqlite3 test.sqlite 'SELECT count(*) FROM records;'
  count(*) = 1000


From the FAQ:

> Multiple processes can have the same database open at the same time. Multiple processes can be doing a SELECT at the same time. But only one process can be making changes to the database at any moment in time, however.

Sounds like you can, just not at the same moment as it locks the file during writes. It appears to retry if the file is locked, so not the end of the world.

Faq- https://www.sqlite.org/faq.html#q5

Relevant SO post - https://stackoverflow.com/questions/15383615/multiple-access...


SQLite is great because I don't want to run a database server if I don't need to and whenever it seems like I might have to give up and run a database server instead I find a configuration option in SQLite that solves my scaling problem or performance problem or concurrency problem. I don't like to recompile SQLite to set compile time options but I'm perfectly capable of doing it though.

I wish WAL mode were the default. I wish read only WAL mode were easier to do without putting the database file in a sticky directory. Oh well.


Why do you need read only WAL mode? Or do you just mean some processes are read only, while others can write?


The real reason I need WAL mode is so I can read the database while vacuum is running.

The web server runs unprivileged and only needs read permission. Insert and vacuum are done as a user which has write permission.

The documentation says read only WAL mode is possible if there is write permission on the directory. So I put the database file in its own directory and set the mode of the database file to 0664 and set the mode of the directory to 3777. So an unprivileged process has permission to create WAL files which become group writable.


Thank you for explaining.


My understanding is that this is because a temporary file needs to be created next to the db in WAL mode so write permissions are necessary on on directory. (https://www.sqlite.org/tempfiles.html, search for 'WAL')


This is useful but the clunkyness of using FROM/WHERE vs JOIN ... USING has always annoyed me in Postgres, which they've based this on. The MySQL version is sort of what my brain expects.


I wonder if they could just make this syntax work in Postgres:

    UPDATE inventory 
    SET quantity = inventory.quantity - daily.amt
    FROM
        (SELECT sum(quantity) AS amt, itemId FROM sales GROUP BY 2) daily USING( itemId );
Basically just allowing a ON/USING clause after the first FROM entry as if it was joining to the table being updated.

Otherwise it's kind of annoying when you end up joining to several tables in the update, but have to use the WHERE clause to join back to the main table.


Personally I prefer UPDATE JOIN as I've used it in mysql:

  UPDATE inventory
  INNER JOIN (SELECT sum(quantity) AS amt, itemId FROM sales GROUP BY 2) AS daily USING (itemId)
  SET quantity = quantity - daily.amt;
I had thought postgres could do it this way without the extra FROM syntax, but apparently not.


The idea of being able to update multiple tables in a single query this way is intriguing to me! But I like the fact that the Postgresql syntax makes it crystal clear which table is being updated.


In fact, MySQL allows update of multiple tables. https://stackoverflow.com/questions/4361774/mysql-update-mul...


Nice feature, I hate having to work around it with subqueries or worse


Agreed. It would be interesting to compare the UPDATE FROM query plan with the equivalent subquery-based update.


Bummer, Python 3.9.0 RC1 just missed it! It bundles SQLite 3.32.3. Ah well, I guess we can look forward to it in 3.9.1 in a few months.


Is UPDATE FROM part of ISO SQL or is it just a very popular extension?


Going by the SQLite documentation it isn't.

https://www.sqlite.org/lang_update.html



Fossil also got a new release today: https://fossil-scm.org/home/uv/download.html


a better link to the changes in this ver than what the entry provides: https://www.sqlite.org/releaselog/3_33_0.html


Interesting. Great for Command Query Responsibility Segregation (CQRS), but I could see this leading to a reduction in data normalization for a lot of teams not using CQRS too. CQRS is a great tool - but the complexity is high so you only want to use it where you need it.

CQRS overused a lot though, like using a $75k surveillance robotic dog from Boston Dynamics [https://spectrum.ieee.org/automaton/robotics/industrial-robo...] (Massive Dynamics?) to see who's at the door, when you could have just looked through the peephole.

I could see this feature becoming the CQRS of the SQL world for a while, used in many places where it should be considered harmful, in addition to the places where it is helpful.


I don’t understand why you focus on CQRS; I don’t think it’s specifically related? At least, not any more related than say, materialized views are useful for CQRS.

I’d expect a CQRS system on top of SQL to be implemented using a single event table and a lot of triggers, each triggering a different aggregate.


Not sure why I got downvoted. Also, Domain Driven Design (which is where I've usually seen aggregates discussed), Event Sourcing, and CQRS are three different things that might be used together in an implementation but don't have to be.

The reason I noted how it's great for CQRS is that I've usually seen CQRS implemented with two data models. One is queried, and one is modified via commands. The one that's queried is updated according to some strategy, sometimes on an event triggered by the command model being modified. This this you could trigger and update on your query model when your command model is changed, all within SQL.


I think you're mostly downvoted because:

* It's only about CQRS, which in no way was mentioned in the article;

* You're not properly qualifying why this feature is so important for CQRS;

* In the same breath, you're also claiming it's overused and complex, while in fact you're the one who's "promoting" it.

I agree that in my reply, I was actually thinking about ES + CQRS rather than pure CQRS, but the gist of my comment still stands: I would expect a CQRS implementation to work with triggers (for incremental updates) or materialized views (one table insert-only, the "view" is kept in sync automatically by the database, which is what users query). Could you elaborate on why you think this new feature is better than those two approaches?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: