How should we design an IoT platform that handles dynamic device schemas and time-series ingestion at scale (100K writes/min)? [closed]

Question 1

We’re a small dev team (3 full-stack web devs + 1 mobile dev) working on a B2B IoT monitoring platform for an industrial energy component manufacturer. Think: batteries, inverters, chargers. We have 3 device types now, with plans to support 6–7 soon.

We're building:

A minimalist mobile app for clients (React Native)
A web dashboard for internal teams (Next.js)
An admin panel for system control

Load Characteristics

~100,000 devices sending data every minute
Message size: 100–500 bytes
Time-series data that needs long-term storage
Real-time updates needed for dashboards
Multi-tenancy — clients can only view their own devices
We prefer self-hosted infrastructure for cost control

Current Stack Consideration

Backend: Node.js + TypeScript + Express
Frontend: Next.js + TypeScript
Mobile: React Native
Queue: Redis + Bull or RabbitMQ
Database: MongoDB (self-hosted) vs TimescaleDB + PostgreSQL
Hosting: Self-hosted VPS vs Dedicated Server
Tools: PM2, nginx, Cloudflare, Coolify (deployments), Kubernetes (maybe, later)

The Two Major Questions We're Facing:

1. MongoDB vs TimescaleDB for dynamic IoT schemas and time-series ingestion? We need to store incoming data with flexible schemas (new product types have different fields), but also support efficient time-series querying (e.g., trends, performance over time).

MongoDB seems flexible schema-wise, but might fall short on time-series performance.
TimescaleDB has strong time-series support but feels more rigid schema-wise.
Is there a proven pattern or hybrid approach that allows schema flexibility and good time-series performance?

2. How to structure ingestion for 100K writes/min while supporting schema evolution? We’re worried about bottlenecks and future pain if we handle ingestion, schema evolution, and querying in one system.

Should we decouple ingestion (e.g., raw JSON into a write-optimized store), then transform/normalize later?
How do we avoid breaking the system every time a new product with a new schema is introduced?
We’ve also considered storing a "data blob" per device and extracting fields on-demand — not sure if that scales.

Additional Sub-Questions: (Feel free to address any of these if they fall into your expertise area)

RabbitMQ vs Kafka — Is Kafka worth adopting now or premature for our stage?
Real-time updates — Any architectural patterns that work well at this scale? (Polling, WebSockets, SSE?)
Multi-tenancy — Best-practice for securely scoping data per client in both DB and APIs?
Queue consumers — Should we custom-load-balance our job consumers or rely on built-in scaling?
VPS sizing — Any heuristics for choosing VPS sizes for this workload? When to go dedicated?
DevOps automation — We’re small. What lightweight CI/CD or IaC tools would you suggest? (Currently using Coolify)
Any known bottlenecks, security traps, or reliability pitfalls from similar projects?

We're still early in the build phase and want to make smart decisions upfront. If any of you have dealt with similar problems in IoT, real-time dashboards, or large-scale data ingestion — your advice would mean a lot.

Thanks!

Question 2

I haven't used it but MQTT is purported to be designed for this use case.

Question 3

see Green fields, blue skies, and the white board - what is too broad?

Question 4

although there are literally a lot of questions in this post. The overall question is limited and coherent. "How do you scale ingestion and reporting on large amounts of timebase data with many schemas"

Question 5

Please include the actual update frequency requirements rather than just "real time". Real time might mean very different things depending on context. Some control systems might need an update interval measured in microseconds, while something like weather considers minutes to be "real time".

Question 6

We need to store incoming data with flexible schemas (new product types have different fields), but also support efficient time-series querying

That's the neat thing, you can't!

The essential problem is that schemas enable fast and coherent queries. Just because you can store a json blob with no schema, doesn't mean your reports will run fast or continue to work when the data structure changes.

Also, though 100k writes per second is fairly low in the scheme of things, you can easily imagine problems as you scale.

You have already mentioned the general solutions to these issues

Split ingestion into steps, Ingest whatever is sent and then have a second post processing layer which adapts the data into a reportable schema.

This gives you a protection/anti corruption/Extract Transform Load layer allowing you to deal with schema changes and having multiple schemas live at any given time,
Shard the database.

You will want to allow the ingestion to branch out, try to collect as locally as possible and package up the data to keep the bandwidth down. But also allow for sharding by product and tenant to protect your system from scaling issues
Reporting/Querying

100k data points are more than you can display on a normal graph even if you can ingest them. Add another transform layer which rolls up metrics into averages and totals for fast reporting.

If you do these things it can work well, but you have to abandon to some extent the idea that you can throw any form of data at the system and do a quick change to the report. Which is the feature things like splunk or elastic promise.

You have to plan and understand changes to the schemas, and add transforms and indexes to support your queries.

Sub questions:

These are a bit too detailed to go into, but I think you should check out some of the off the shelf ETL systems before you pick a language for your backend. This wheel has definitely been invented and the product you choose will determine what language you write in to some extent.

Question 7

MongoDB vs TimescaleDB for dynamic IoT schemas and time-series ingestion?

Both are very capable to handle this sort of load. Note that you have one aspect really handy, I quote: "multi-tenancy — clients can only view their own devices." This means, essentially, that in this particular case, you can scale horizontally with ease. Say you have a farm of five servers handling 1,000 users. After a big marketing event, you know that the number of users will double. No problem: you'll now need five extra servers.

How to structure ingestion for 100K writes/min while supporting schema evolution? We’re worried about bottlenecks and future pain if we handle ingestion, schema evolution, and querying in one system.

I don't see an actual problem here. Don't worry now about the bottlenecks you might have at some future moment. That's premature optimization. Measure what you have now, identify the current bottlenecks, if any; solve them.

Since you can scale horizontally, the likely issue you'll encounter is at the network level, if you are unlucky to have an on-premise solution. Right now, it's not a problem, as you only receive about 500 KB/s—that's nothing by today's standards. But if it starts growing (for instance if the messages you receive start containing more and more information), then it could become a problem, forcing you to move to cloud (or a more capable data center). But, again, that's not what you should be focusing on right now.

How do we avoid breaking the system every time a new product with a new schema is introduced?

By allowing the schema to be flexible. That's what you already did by choosing a database such as MongoDB.

want to make smart decisions upfront

This is not how software development works. You should postpone the decisions until the very last moment, not the other way around. This also means you should make it both possible and easy to switch from one thing to another.

Practical example: I start writing a piece of software that would need to store some data, but I'm unsure how the data will be structured, and whether it would be better to store it in PostgreSQL or MongoDB.

Instead of spending my time speculating about which one is better, I would keep it easy: start to draft the piece of software, and store everything in a JSON file on disk, while knowing that I will have to switch to another storage thing later on.

The fact of knowing that would mean that I'll put in place the abstractions which will help me do the switch.

At the same time, I would release the MVP earlier, as I didn't spent time provisioning any database—storing everything in a plain JSON file is easy.

Now, I have an MVP that provides me with data about the actual usage patterns and the different storage issues I should address. With this data, I can make a concrete decision about the database I would switch to. No more speculation here. Or if I have doubts about the two databases, I can always draft two implementations of the database layer, and test them.

If, later on, it appears that I made a mistake, or circumstances changed, I can always swap the database layer for another one. No big deal.

If any of you have dealt with similar problems in IoT, real-time dashboards, or large-scale data ingestion — your advice would mean a lot.

Not really.

You see, I do have real-time dashboards and do handle IoT data. I can write a post about how I do it, and explain why I'm using C++, or plain text files. This would be completely irrelevant for you, as my project is addressing the quirks that exist in my context (such as hosting the dashboards on Raspberry Pi 3), and not the ones you will encounter.

Similarly, you could find how large companies are solving this problem. But here again, different contexts require different solutions. Maybe some project developed by Boeing to gather the metrics from all the robotics involved in assembling an helicopter look very close to what you do. But the fact that they have chosen Fortran and Oracle may have to do with their corporate policies that you do not have.

Question 8

"You should postpone the decisions until the very last moment, not the other way around." Well said. Flexibility is the smart decision, and that means waiting if you can.

Question 9

~100,000 devices sending data every minute

I think your main problems are likely to arise in this area.

100k different concurrent connections to a single central data store, with 100k reliable writes a minute sustained, is something that sounds improbable.

My experience of seeing real-time data collected from unattended hardware devices is that sustaining a few dozen simultaneous connections can be problematic, and a maintenance man was occupied full-time dealing with a few hundred.

Your average web developer is probably coping with a website with an average of less than one concurrent user a minute, and probably no more than a few hundred concurrent (typically human) users at peak times. And for the tricky stuff this entails, he's probably using standard technologies and public infrastructure within ordinary assumptions (assumptions that definitely don't fit your application).

One hundred thousand concurrent connections, producing writes mechanically every minute, 24/7/365? That seems stratospheric in relation to my frame of reference. Personally I don't think it's commensurate with a four-man web+mobile development team.

The range of potential circumstances is too great to address even a small fraction of possible issues that could be in play, but I think at the very least your system would require something to concentrate the many connections and consolidate data at multiple levels, so that the fan-out at each level is far less than 100,000-to-1, and the number of concurrent writes at the central store is far less than 100k/min sustained.

Question 10

"concurrent connections to a single central data store"—it doesn't need to be single, as sharding is a perfectly viable option here. Also, for example SQL Server limit in terms of concurrent connections is 32 767. Far from the "few dozen" connections you mentioned. Finally, the author never told there are 100,000 connections at the same time. It may be that there are 100,000 requests per minute (although it's not clear from the original question). If every request takes 10 ms. to process, that's 17 concurrent connections.

Question 11

@ArseniMourzenko, that's just not my experience with these things. The assumption that 100,000 requests per minute will be monotonic and evenly distributed over the entire minute is not justifiable, and a request that takes 10ms under ideal conditions might be blocked for just a second - you now have a backlog of over 1,600 concurrent connections waiting, on a system reckoned to handle just 17 at once. A server or system that has just an hour of downtime, on resuming could face 100,000 concurrent connections each maybe pushing 60 times more payload than normal.

Question 12

Those are indeed well known problems. Check the term "back pressure" regarding one way to solve them. In short, the side that sends data has a feedback mechanism that tells that the other side is ready to continue to receive. You will get information loss (quite expected anyway during downtime), but at least you won't crush the system that receives data.

Question 13

I agree with steve, 100k per min can be handled, but you know some customer is going to say "ok now do my 1 mil smart lightbulbs, oh and an I need per second accuracy" and its going to fall over. You need some sort of local, or near local collection and batching

Question 14

@ArseniMourzenko, I think with this, it's a case of fools rush in where angels fear to tread. It's not the laws of physics that stop these applications, it's the misapprehension of overall complexity involved, and the amount of resources and expertise available for a solution.

Ewan Ewan 83.9k5 gold badges90 silver badges186 bronze badges · Answer 1 · 2025-08-04 16:50:32Z

We need to store incoming data with flexible schemas (new product types have different fields), but also support efficient time-series querying

That's the neat thing, you can't!

The essential problem is that schemas enable fast and coherent queries. Just because you can store a json blob with no schema, doesn't mean your reports will run fast or continue to work when the data structure changes.

Also, though 100k writes per second is fairly low in the scheme of things, you can easily imagine problems as you scale.

You have already mentioned the general solutions to these issues

Split ingestion into steps, Ingest whatever is sent and then have a second post processing layer which adapts the data into a reportable schema.

This gives you a protection/anti corruption/Extract Transform Load layer allowing you to deal with schema changes and having multiple schemas live at any given time,
Shard the database.

You will want to allow the ingestion to branch out, try to collect as locally as possible and package up the data to keep the bandwidth down. But also allow for sharding by product and tenant to protect your system from scaling issues
Reporting/Querying

100k data points are more than you can display on a normal graph even if you can ingest them. Add another transform layer which rolls up metrics into averages and totals for fast reporting.

If you do these things it can work well, but you have to abandon to some extent the idea that you can throw any form of data at the system and do a quick change to the report. Which is the feature things like splunk or elastic promise.

You have to plan and understand changes to the schemas, and add transforms and indexes to support your queries.

Sub questions:

These are a bit too detailed to go into, but I think you should check out some of the off the shelf ETL systems before you pick a language for your backend. This wheel has definitely been invented and the product you choose will determine what language you write in to some extent.

score 3 · Answer 2 · 2025-08-04 15:00:54Z

MongoDB vs TimescaleDB for dynamic IoT schemas and time-series ingestion?

Both are very capable to handle this sort of load. Note that you have one aspect really handy, I quote: "multi-tenancy — clients can only view their own devices." This means, essentially, that in this particular case, you can scale horizontally with ease. Say you have a farm of five servers handling 1,000 users. After a big marketing event, you know that the number of users will double. No problem: you'll now need five extra servers.

How to structure ingestion for 100K writes/min while supporting schema evolution? We’re worried about bottlenecks and future pain if we handle ingestion, schema evolution, and querying in one system.

I don't see an actual problem here. Don't worry now about the bottlenecks you might have at some future moment. That's premature optimization. Measure what you have now, identify the current bottlenecks, if any; solve them.

Since you can scale horizontally, the likely issue you'll encounter is at the network level, if you are unlucky to have an on-premise solution. Right now, it's not a problem, as you only receive about 500 KB/s—that's nothing by today's standards. But if it starts growing (for instance if the messages you receive start containing more and more information), then it could become a problem, forcing you to move to cloud (or a more capable data center). But, again, that's not what you should be focusing on right now.

How do we avoid breaking the system every time a new product with a new schema is introduced?

By allowing the schema to be flexible. That's what you already did by choosing a database such as MongoDB.

want to make smart decisions upfront

This is not how software development works. You should postpone the decisions until the very last moment, not the other way around. This also means you should make it both possible and easy to switch from one thing to another.

Practical example: I start writing a piece of software that would need to store some data, but I'm unsure how the data will be structured, and whether it would be better to store it in PostgreSQL or MongoDB.

Instead of spending my time speculating about which one is better, I would keep it easy: start to draft the piece of software, and store everything in a JSON file on disk, while knowing that I will have to switch to another storage thing later on.

The fact of knowing that would mean that I'll put in place the abstractions which will help me do the switch.

At the same time, I would release the MVP earlier, as I didn't spent time provisioning any database—storing everything in a plain JSON file is easy.

Now, I have an MVP that provides me with data about the actual usage patterns and the different storage issues I should address. With this data, I can make a concrete decision about the database I would switch to. No more speculation here. Or if I have doubts about the two databases, I can always draft two implementations of the database layer, and test them.

If, later on, it appears that I made a mistake, or circumstances changed, I can always swap the database layer for another one. No big deal.

If any of you have dealt with similar problems in IoT, real-time dashboards, or large-scale data ingestion — your advice would mean a lot.

Not really.

You see, I do have real-time dashboards and do handle IoT data. I can write a post about how I do it, and explain why I'm using C++, or plain text files. This would be completely irrelevant for you, as my project is addressing the quirks that exist in my context (such as hosting the dashboards on Raspberry Pi 3), and not the ones you will encounter.

Similarly, you could find how large companies are solving this problem. But here again, different contexts require different solutions. Maybe some project developed by Boeing to gather the metrics from all the robotics involved in assembling an helicopter look very close to what you do. But the fact that they have chosen Fortran and Oracle may have to do with their corporate policies that you do not have.

"You should postpone the decisions until the very last moment, not the other way around." Well said. Flexibility is the smart decision, and that means waiting if you can.

Steve Steve 12.4k2 gold badges20 silver badges36 bronze badges · Answer 3 · 2025-08-04 21:37:31Z

~100,000 devices sending data every minute

I think your main problems are likely to arise in this area.

100k different concurrent connections to a single central data store, with 100k reliable writes a minute sustained, is something that sounds improbable.

My experience of seeing real-time data collected from unattended hardware devices is that sustaining a few dozen simultaneous connections can be problematic, and a maintenance man was occupied full-time dealing with a few hundred.

Your average web developer is probably coping with a website with an average of less than one concurrent user a minute, and probably no more than a few hundred concurrent (typically human) users at peak times. And for the tricky stuff this entails, he's probably using standard technologies and public infrastructure within ordinary assumptions (assumptions that definitely don't fit your application).

One hundred thousand concurrent connections, producing writes mechanically every minute, 24/7/365? That seems stratospheric in relation to my frame of reference. Personally I don't think it's commensurate with a four-man web+mobile development team.

The range of potential circumstances is too great to address even a small fraction of possible issues that could be in play, but I think at the very least your system would require something to concentrate the many connections and consolidate data at multiple levels, so that the fan-out at each level is far less than 100,000-to-1, and the number of concurrent writes at the central store is far less than 100k/min sustained.

"concurrent connections to a single central data store"—it doesn't need to be single, as sharding is a perfectly viable option here. Also, for example SQL Server limit in terms of concurrent connections is 32 767. Far from the "few dozen" connections you mentioned. Finally, the author never told there are 100,000 connections at the same time. It may be that there are 100,000 requests per minute (although it's not clear from the original question). If every request takes 10 ms. to process, that's 17 concurrent connections.
@ArseniMourzenko, that's just not my experience with these things. The assumption that 100,000 requests per minute will be monotonic and evenly distributed over the entire minute is not justifiable, and a request that takes 10ms under ideal conditions might be blocked for just a second - you now have a backlog of over 1,600 concurrent connections waiting, on a system reckoned to handle just 17 at once. A server or system that has just an hour of downtime, on resuming could face 100,000 concurrent connections each maybe pushing 60 times more payload than normal.
Those are indeed well known problems. Check the term "back pressure" regarding one way to solve them. In short, the side that sends data has a feedback mechanism that tells that the other side is ready to continue to receive. You will get information loss (quite expected anyway during downtime), but at least you won't crush the system that receives data.
I agree with steve, 100k per min can be handled, but you know some customer is going to say "ok now do my 1 mil smart lightbulbs, oh and an I need per second accuracy" and its going to fall over. You need some sort of local, or near local collection and batching
@ArseniMourzenko, I think with this, it's a case of fools rush in where angels fear to tread. It's not the laws of physics that stop these applications, it's the misapprehension of overall complexity involved, and the amount of resources and expertise available for a solution.

Stack Exchange Network

How should we design an IoT platform that handles dynamic device schemas and time-series ingestion at scale (100K writes/min)? [closed]

3 Answers 3

Hot Network Questions

How should we design an IoT platform that handles dynamic device schemas and time-series ingestion at scale (100K writes/min)? [closed]

3 Answers 3

Related

Hot Network Questions