Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've been doing this for a long time and all I can say this after reading this multiple times ... "I don't get it".

I mean, I get it, from a technical standpoint. Ok, so you're going to send read-only Sqlite databases to everybody.

Is it missing what the API (that you still need) is updating when you insert or update something and all client DBs are now stale? Is there a central database? How often are you pushing out read-only database replicas across the wire to all clients? Is that really less "chatty"? If so, how much bandwidth is that saving to push an entire database multiplied by the number of clients?

None of this seems logical. Maybe I'm missing the real-world use-case. Are we discussing tiny Sqlite databases that are essentially static? Because in the last 30 years I've not run into a situation where I needed to have clients execute SQL queries on tiny, static databases let alone still need to potentially update them also.



Author here. We're using LiteFS to replicate SQLite databases in real time so the changes sent are only incremental. I think there are several ideal use cases for sharing databases across applications:

1. Internal tooling: you're able to manage the contract better since you have control of the source & destination applications.

2. Reporting & analytics: these tend to need a lot of query flexibility & they tend to use a lot of resources. Offloading the query computation to the client makes it easier on the source application.

As for database size, the Corrosion program mentioned in the post is about 8GB and has a continuous write load so this doesn't have to just be for tiny databases.


Looks interesting although personally I don't see those as compelling use cases although I may very well be missing something.

> 1. Internal tooling: you're able to manage the contract better since you have control of the source & destination applications.

This has not been my experience in anything but tiny companies or companies with very strict mono-repo processes. One big point of separate teams is to minimize the communication overhead as your organization grows (otherwise it scales as N factorial). That means you do not want a lots of inter-department dependencies due to the internal tooling and APIs they leverage.

> 2. Reporting & analytics: these tend to need a lot of query flexibility & they tend to use a lot of resources. Offloading the query computation to the client makes it easier on the source application.

Depends on the resources the client has versus the server to devote to a single query. The resources are also high because of how much data is analyzed and 8gb seems tiny to me (ie: the whole thing can be kept in some DBs memory).


Ben, fan of your work. You guys have really moved the flag on sqlite.

Are there any plans for Corrosion to be published as OSS?


hey Ryan, thanks! As for Corrosion, I can't say anything publicly but something may be announced in the near future wink wink :)


Curious if you've tested it with Jepsen. Anytime I come across some distributed system, my stomach ulcers start playing up while I wonder about all the weird failure modes. I kinda looked for a caveats page on the LiteFS web site, didn't really see one.


LiteFS doesn't try to be a correct distributed system, as you can see from: https://fly.io/docs/litefs/how-it-works/#cluster-management-...

Basically, the solution they have is:

1. There is a single writer. There's optional best-effort leader election for the writer.

2. If there's a network partition, split-brain, etc, availability is chosen over consistency.

Jepson's testing is focused on databases that pick "consistency". Since LiteFS didn't pick consistency, there's really not any point in running Jepson against it. Like, jepson would immediately find "you can lose acknowledged writes", and LiteFS would say "Yes! That's working exactly as intended!"

However, another way of running LiteFS is with only a single writer ever (as in one app, one server, one sqlite database only), and all clients as read-only replicas that are not eligible for taking writes ever. In that case, you also don't have a proper distributed system, just read only replicas, which is quite easy to get right, and mostly what this post is talking about.


LiteFS works similarly to async replication you'd find in Postgres or MySQL so it doesn't try to be as strict as something running a distributed consensus protocol like Raft. The guarantees for async replication are fairly loose so I'm not sure Jepsen testing would be useful for that per se.

On the LiteFS Cloud side, it currently does streaming backups so it has similar guarantees but we are expanding its feature set and I could see running Jepsen testing on that in the future. We worked with Kyle Kingsbury in the past on some distributed systems challenges[1] and he was awesome to work with. Would definitely love to engage with him again.

[1]: https://fly.io/dist-sys/


I could imagine this technique being useful for kepler.gl or other data visualization tools


Do you realize most reporting & analytics use cases don't use SQLLite Databases?


Do you realise that there are many reporting and analytics cases where SQLite is a great fit?


I am looking at the data analytics industry as a whole and being involved in communities of data practicioners. Most of these people use cloud DBs (Snowflake, BigQuery & co) since there's less dependencies on DB Admin type of work.

Some might be using Postgres or whatever the Engineering team provided them with, but I don't think I have really heard of a Data person preferring to use SQL Lite.

Might still be a great fit, but as the other comment pointed out, it might not be a good fit for the target audience.


The parent is saying this isn’t a great idea because target users don’t use SQLite. Your reply is it is a good idea if people changed how they do things to fit the idea

I don’t have a horse in this race the reply isn’t very well I don’t know what to make of that


I have more interest in this: https://electric-sql.com/

They are adding a sync-mechanism for Sqlite so local writes can be synced to a remote location and on into Postgres and so on. CRDT-based eventual consistency.

So local write latency, eventually consistent sync and the tools for partial sync and user access are the primary areas of development.

Early days but an interesting approach to shipping the db.


Oh, I implemented something like that for my Android app. It seems to work quite well. I don't have many users yet, though.

I replicate the clients Sqlite DB to a central server (over a REST-API) where it is synced with existing data. I use an CRDT, so changes don't get lost and as long as the clocks of all the users devices are accurate, the merges are in the correct order.

This enables offline access, but also the use of my app without an account. You can always merge the data later. Multi-device merge is also possible, if you create an account. Especially the multi-device merge was a big headache until I settled for this approach.

Since it is a home-grown solution I still have to do some manual stuff that could be abstracted away, like converting datatypes to and from JSON, Kotlin, PHP, MySQL. There's not always an equivalent datatype in those technologies.

This approach probably won't work well for big databases that share data between multiple users, though.


Are there any advantages you noticed as a dev to using this stack? Did you find there were better affordances? Were there challenges and growing pains?


It probably makes most sense if you have a middle-tier-less application where the UI IS the application, and effectively it just dumps the whole application state out through APIs anyway. We have ended up with this scenario playing out and you eventually just throw up your hands and make the UI hit the server with a giant "send me a huge blob of JSON that is most of the tables in the database on first page load" query anyway.

So the assumption that "if anybody changes anything everybody needs to know it" is close to true. In that scenario, putting a bunch of APIs in the way just makes the data less coherent rather than more. In most other scenarios, yeah, it's hard to see that it really makes sense.


I have never in my career spanning decades have I had to ask a server to send me a dataset so large that it "most tables in the database on first page load".

In what use case do you run into that?

I'm lead and an architect on an enterprise application at the moment that drives the whole company. It's your standard configuration, front-end, APIs, SQL. The system requests what it needs to fulfill only what functionality the user is dealing with.

Earlier in my career I was dealing with large enterprise desktop applications that talked directly to the database, with no centralized API. Some of them had thousands of individual desktop clients hitting a single multi-tenant SQL server. No problem, SQL Server would handle it without breaking a sweat. The bandwidth to an indidual client was fine. It was fast. And that was 20 years ago.


I did faced this scenario a couple time when working on application that would work offline, a few of those I was involved in:

- try to reduce the amount of paper based catalog we were sending out to customers, those were a couple thousands of pages and not cheap to produce, not cheap to send and would get deprecated very quickly. The web app would be pulling the entire catalog at first load so customers could go in remote location and still be able to use the catalog

- a web app for sales that was intended to be use on customer site containing all the marketing materials and much more during presentations on site without ever having to connect anywhere


> In what use case do you run into that?

Single-tenant simple sites without permission checks with limited content volume.

Of course you can also do the same in a multi-tenant environment but naturally you wouldn’t be returning all rows in the database.


This is pretty much what many SPA were/is, dump entire app state relevant on first load and then offline is no problem.


Imagine you're building a SaaS which allows your users to do create a website, hotel booking platform and channel manaager.

User can open the application, get the database, the frontend does all the offline updates the user want to perform. The user can update their website, add bookings they received on the phone, check what prices they set. This is all blazingly fast with no loading time, no matter your internet connection, because no further communication with the server is needed.

At some point they press a magic Publish button and the database gets synced with upstream and uploaded. The upstream service can take the data and publish a website for its users, update availabilities, update prices, etc.

It would be a better user experience than 99% of the SaaS out there.


Most of the Internet spent the last 10-20 years moving away from this because business metrics generally benefit from realtime updates and background autosaves. By and large, users expect systems to autosave their work-in-progress nowadays.

You might remember horror stories from the late 1900s when people would lose critical data and documents because they "forgot to hit Save and the computer crashed." Or, maybe you've never experienced this — because a couple decades of distributed-systems engineers and UX researchers made that problem obsolete.

So now we've... reinvented the Save button, but somehow needed a Paxos implementation to do it?

Everything old is new again, I guess.


Heh, the last company I worked for had a (very popular, very successful) 20 year old desktop app that worked by downloading the client's entire database on login, then doing regular syncs with the back end and sending batch updates as the user made changes. It was an awful system that everyone hated, and the company was desperately trying to figure out how to move away from it without doing a complete rewrite. Maybe I should let them know that they're actually ahead of the curve if they can just hold out for a few more years...


I feel like I'm taking crazy pills. As HNers, we should want more control, not less over our work. Auto-save means that the company, no matter how nefarious, decides on their terms when my work is saved - regardless of what I've entered into the document (I write documents in a very stream of conscious manner).

I want to decide when to save, and how, not the application. I am not a product, I am a user!!!


Seems like on-by-default auto save with the option to disable, and a separate save button, satisfy all your requirements. That’s a pretty common set of customizations on saves from what I’ve seen.


This makes me smile, because it's satire, right?


No. I value a meaningful "Last updated at" timestamp.

If I open a word document, redact a few bits, save to PDF and close it - I don't want my changes saved but they are (real scenario from last week. I'd only just moved the file into Onedrive so autosave had turned itself on)

Ditto sorting and filtering spreadsheets.

Software engineers: By all means save in the background so I never lose anything, but don't assume I want the latest changes.


What you want, then, is versioning of your documents. MS Office and Google Suite do that.

This is not trivial to implement, though. Complexity will depend a lot of the application and the data. I'd say it'll only make sense for mature apps. A startup will most likely don't consider this in its roadmap.


Hell, even with word processors we have auto save enabled by default for at least a decade.


> User can open the application, get the database, the frontend does all the offline updates the user want to perform.

"get the database"

How small do you think these databases are?!

You're going to download the entire hotel booking platform's database?

For how many hotels? One at a time, and then get another databse? Or are you getting a Sqlite booking database for every hotel in the world? And you're going to send them to each user? For what date range?

And even if that were possible, you then have to commit your offline updates. What if someone else booked the date range prior to you? Now your Sqlite copy is stale. Download the entire thing again? There could have been countless changes from other users in the time since you last got your copy.

This explanation just leaves me even more confused. It's illogical.


I have some experience with a comparable platform. In a typical CouchDB/PouchDB design, syncs and offline work are common and easy, and it's pretty close to database-based design if you get fancy with views and javascript-based logic.

For this project, I'd do:

* A database for each user (this is natural on CouchDB, one can enable a setting to create the user DB automatically when a user is created. The platform won't choke on many databases) - for some per-user data, like comments, if the user has already reserved a booking we can mirror booking data there + some info regarding the hotel.

* Common synced databases - some general info regarding the platform and hotels. Preferably not too large (IIRC there's no limit to attachments with 3.x, but it sounds risky). Anything too large we'd do online or use data provided with the application.

* A large booking database which isn't synced and must be modified online with a REST call - we don't have to allow every action offline. Here I wouldn't entirely dispense with API. This obviously needs the online component for making sure bookings don't conflict. Could even be a regular relational database.

I think it is possible to implement this the CouchDB database-way: a provisional offline reservation to the user database followed by some rejection method on the server when syncing, but I don't think there's much value to the user here. This design however would allow us to not sync all the data but a much smaller portion while supporting offline.

---

It sounds very doable, rather similar to a project I was involved with, but I miss SQL a lot with that platform (javascript NoSQL not my favorite to work with). A sqlite-based approach is an interesting alternative.


Web applications aren't going to get bigger just by themselves! /s

Jokes aside, this extreme optimization for development does have impacts on user experience. The amount of bandwidth/storage etc used by a "just ship the whole db" type applications would surely suck for most people outside of the 'I measure my bandwidth in Gbps' crowd?


Not the entire database, just your own data for your own hotel.

It doesn't sound that unreasonable.

Even if you have 3-4 hotels your data won't be significant.


You did not read the use case. It's not equivalent of Booking. More like equivalent of Wix for hotels.


Thanks for the example, it helped me understand the idea.

The thing that I'm unclear on is how do I figure out what data to ship to the user? Like if I don't already have a database-per-user then I have to extract only the data the user has permission to see, and ship that as a database?

That would be the case even if I had database-per-customer - not every user is necessarily able to see all the data owned by the organization they're a part of.

It seems like a lot of extra work, and error-prone, too (what could be worse than accidentally shipping the wrong data _as an operational database_ to a client?)

Edit: the article covers this at the bottom, but IMO it's a show-stopper. How many applications of any real complexity can actually implement database-per-user (database-per-tenant is probably not enough, as I mentioned above). As soon as you need any kind of management or permissions functionality in your app then you can't ship the database anymore, so you may as well start off not shipping it.


Once upon a time, this is how applications worked and this was a pretty good experience.

Now, you'll introduce a huge amount of user frustration around why their changes never made it to the central database. If you have users closing a form, especially if it's on a webpage, they are going to expect the write to happen at the same time the form closes. Making a bunch of changes and batching them in a single sync is going to be confusing to a ton of users.


What happens if two users want to make changes at the same time? What if their changes conflict? How do you even detect conflicts?


It could be done with "select for update" and etags+triggers checking the etag received + triggers generating a new etag on every create/update.


If everyone's working on their own local copy of the database then the select for update isn't going to do anything. The issue is later syncing and detecting conflicts. It's actually easier to do this with a centralized DB, hence why everything works this way. If an app is mostly offline then, yeah, ship it with a SQLite DB and life will be good, but for something like a hotel booking app that doesn't actually solve many problems and makes existing ones harder.


Wasn't this Lotus Notes?


What is this? A scifi novel?

By the time your dude downloads the database half the hotels on your imaginary website have changed status.


A lot of SPA’s already operate this by leveraging things like localstorage in the users browser.


To do what? Isn't this limited to 5 MB?

I just checked a heavy user for our enterprise SPA. localStorage is using ~12 KB.

I don't know what people are just throwing in localStorage but they certainly aren't pulling down much of the database (which still would have to be checked to ensure cached data in there isn't stale).


> Because in the last 30 years I've not run into a situation where I needed to have clients execute SQL queries on tiny, static databases let alone still need to potentially update them also.

It was not uncommon in the early 2000's for applications targeting mobile professionals and sales types to work this way, except that instead of SQLite they used things like Notes¹, Jet² and MSDE³. By modern standards these would be considered "tiny mostly static databases" often not exceeding a gigabyte. Instead of connecting over the internet they used serial dial-up file transfer protocols to synchronize with a central database or a mainframe. People would typically download information in the morning, make updates offline and synchronize at the end of the day. A slow trickle of periodic bidirectional updates insured everyone had a reasonably fresh copy of the information they needed.

1- https://en.wikipedia.org/wiki/HCL_Domino

2- https://en.wikipedia.org/wiki/Access_Database_Engine

3- https://en.wikipedia.org/wiki/MSDE


I think the point is basically "unless there's a good reason for your API to look different than your DB schema, ontologically speaking, then the schema has already effectively defined your API and just let people interact with the DB", or, alternatively, "bake in as much of the data constraints as possible into your DB schema, and only when you can't enforce them with column constraints or DB extensions should you add APIs on top".


I have seen some stuff where a streaming tool (Kafka or whatever) is used to just ship all updates to a database to certain clients. But I think this is a dubious architecture since it comes with basically all the downsides of a database that many applications all want to use besides the write contention one.


Supabase realtime handles this really elegantly




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: