June 26, 2026 · 15 min read
System Design for Working Engineers, Not Interview Prep
The Interview Trap
If you look at most system design tutorials, you get an extreme use case. Design Twitter. Design YouTube. Scale it to a billion users. Draw boxes on a whiteboard for 45 minutes.
Do you think your app will be used by a billion users on day one? The answer is almost always no. But the tutorials don't teach you what to do when you have 500 users, unclear requirements, a team of four, and a quarter to ship something that works.
Real system design is nothing like a whiteboard interview. You don't get clean requirements, you don't design from scratch, and nobody asks you to handle a billion requests per second on day one.
Real System Design Starts with Questions, Not Diagrams
The very first thing that matters in system design is something most tutorials skip entirely: unclear and chaotic requirements. In the real world, requirements don't come as a clean problem statement. They come from non-technical business teams, and you need to navigate through cross-questions to get all the clarity you need.
Ask as many questions as possible. Understand your functional and non-functional requirements. Which features need to be synchronous and which can be async? What are the read and write load patterns? What is the maximum and average number of concurrent users right now? What does authentication look like? Do you need role-based access control?
These questions drive your choices. You don't always need an axe where a knife will do. Being minimalist with a reasonable growth prediction and a 3, 6, 9 month plan will take you in the right direction.
There will be things the situation demands immediately but would take more time than expected. Taking a predictable hit now and fixing it at the right future time without missing that balance is truly important. Weighing what will be expensive to change later, in terms of dollar cost or human effort, is how real architectural decisions get made.
Pushing Back on Bad Requirements
Many times requirements come from non-technical business teams and you need to push back on why certain things should not be done the way they expect.
Here is a real example. A business person once asked to duplicate data into another Kafka topic because their prediction was that the existing topic would not handle more load from a new subscriber. The technical reality? Kafka is built for exactly this. A new consumer group on the same topic would work without impacting existing consumers at all. If you don't push back, you end up creating tech debt with support and maintenance costs forever, just for replicating data that never needed to be replicated.
Trade-off Decisions Nobody Teaches
Monolith vs Microservices
Typically the very first thing engineers want to talk about is microservices and how they can help. But do you realistically have even 100 users on the product? Why do you need K8s, Docker, distributed tracing, cross-cutting async messaging, and service mesh? Do you really need that scale, or are you doing it to make your resume look better?
If you have no real users in the thousands, a modular monolith is the best choice. Deploy everything as one server on Linux with a reverse proxy and a CNAME record. That simple. You need a database, sure. But you don't need Kafka, distributed tracing, auto-scaling, or any complex distributed computing to begin with.
When predictable growth comes, add monitoring and observability to understand which requests are hitting hardest. Decouple the modules doing heavy work into independent microservices. Then pivot. That is the right sequence.
Synchronous vs Async
If you don't need to process something immediately, decoupling via async helps. If it is fire-and-forget, use a simple queue. If you need multiple services to consume the same event with highly scalable producers and consumers, use Kafka. If the user is waiting for a response, keep it synchronous via a RESTful API because it needs to happen right now.
Build vs Buy
Rule of thumb: never reinvent the wheel. If something already exists at low cost and does the job, buy it. If companies like OpenAI and Anthropic are not building their own payment systems and instead use established financial integrators, then you should trust that. If giants are not building everything from scratch, why should you? Building only makes sense when no existing solution fits your needs.
Consistency vs Availability: Real-World CAP
When you are dealing with transactions that need ACID guarantees, use SQL. Ticket booking, inventory updates, financial debits and credits. These cannot tolerate stale reads or lost writes.
If you need consistency and partition tolerance where stale reads must be errored out, NoSQL works better. Social media feeds, messaging, analytics, and streaming. If you need availability and partition tolerance with tolerance for eventual consistency, columnar databases like Cassandra fit well. IoT data, time series, high write throughput with low read frequency.
Perfect Architecture vs Shipping This Quarter
Perfect architecture is always the goal, but if you can balance it with shipping this quarter, that brings real business value and revenue. Find the healthy mix. Build a base that requires very little change even if the actual decision evolves later.
For example, tightly couple your audit logging service synchronously because you don't have async processing yet. It ships real business value now. Later, when async communication is added, you decouple it without changing how the end user experience works.
Analytics is another one. You might not have the full setup of MySQL CDC to Debezium to ClickHouse yet. But you can start by ingesting specific tables into ClickHouse directly for analytics. Solve it elegantly later when DevOps capacity allows the full event streaming pipeline.
When to Scale and When Not To
The time to scale is based on observability data and predictive customer expansion patterns. Your business understanding combined with analytical thinking will surface the signals that tell you when scaling is actually needed.
Before jumping to horizontal or vertical scaling, check the basics first. Does your database have optimal indexing? Is your application connection pool configured properly? Are there N+1 queries firing hundreds of calls where one would do? These are high-level checks. Deeper concepts like partitioning and sharding are problems you encounter with billions of records, not a few million.
Horizontal scaling is generally the better approach because it guarantees higher throughput with the ability to scale up or down without downtime. But only when you actually need it.
A Real Story: Premature Scaling Gone Wrong
I worked with a company that had fewer than 50-100 customers and less than 5,000 business transactions total. They had already added Docker, Kubernetes, Kafka, distributed monitoring, and auto-scaling. Now they had two problems instead of one: the real business problem and a tech problem.
Very few developers on the team understood microservices as a whole. Nobody knew DevOps practices well enough to manage how scaling actually works. It was not just slowing business delivery but also burning cloud costs because nobody knew how to optimize the infrastructure bill.
A double-edged sword. Premature scaling without proper architectural guidance creates more problems than it solves.
Every Architecture Decision Is a Cost Decision
Will you use managed Kubernetes or bare K8s? New Relic or Dynatrace or open-source alternatives? Managed database or self-hosted? It all depends on who is owning what. If you have DevOps engineers who can manage the nightmare of persistent storage, networking, constant upgrades, and maintenance, then self-hosted can work. If the answer is no, managed is better but it comes with a higher price tag.
It is equally important to monitor your cloud costs and understand the incremental bills. Is your Docker image lifecycle policy set to delete old images within a few days? Is your S3 storage persistent forever or only for a retention period? Have you optimized or dropped high-cardinality metrics in your distributed tracing to save cost? How about networking costs for transporting data across regions? It all adds up.
Here is a question I ask teams all the time: will you optimize your MySQL queries and indexing, or will you throw more money at bigger database instances so the app functions at a dollar cost that keeps increasing? Unless the root cause is identified and fixed, you are just burning money.
Small teams with few users almost always face overly expensive microservices hosting and management. The operational overhead, debugging complexity, and cognitive load on the team need to be balanced against the actual benefits. A mentor can help you find that balance before the cloud bill teaches you the hard way.
Database Design Is Architecture
I covered this in detail in my 726 mentoring sessions post: pages taking 10+ seconds to load because nobody thought about indexing, N+1 queries firing hundreds of database calls, unused columns bloating tables, no caching layer, and complex business logic with if-else ladders that nobody can follow.
Schema design decisions haunt you for years. The table structure you choose in month one determines how painful your queries are in year two. Foreign keys, indexes, data types, normalization vs denormalization. These are architecture decisions, not database admin tasks.
For SQL vs NoSQL, the real answer is simpler than the blog posts make it: if you need transactions and relationships, use SQL. If you need flexible schema with high write throughput and can tolerate eventual consistency, use NoSQL. Most applications should start with SQL and add NoSQL for specific use cases when needed.
Caching strategy is another design decision that gets treated as an afterthought. Cache the data that is read frequently but changes rarely. Product catalogs, user profiles, configuration data. Invalidate on write. Start with a simple TTL-based approach and add event-driven invalidation when your system complexity demands it.
Observability Is a Design Decision, Not an Afterthought
Will your app monitoring give you real analytical insights, or is it just collecting logs nobody reads? P99 and P95 latency metrics, structured logs, alerts that tell you production broke at 12 PM instead of finding out at 2 PM when customers start calling.
The real power of observability is the transactional breakdown: cache hit vs miss ratio, time spent in business logic, time spent in SQL queries, external API call latency, queue processing time. When you can see all of this in one trace, debugging goes from guesswork to precision.
Being preventive rather than reactive is what separates teams that sleep well from teams that get paged at 2 AM. I wrote more about this in my vibe coding post: logging everywhere but observability nowhere is the most common pattern I see.
The Architecture Review Nobody Does
Most teams never review their architecture after the initial design. Not a priority. Time constraints. Always something more urgent. But it does not take long for a system to become legacy if it does not keep up with the right patterns.
Consider how fast things move in our industry. Installed software gave way to cloud. Cloud is now being reshaped by AI. Teams still running Java 8 in 2026 because nobody reviewed the stack. Apps built on deprecated frameworks because there was never a checkpoint to evaluate alternatives. It takes no time for things to become stale if nobody reviews how they are done.
What should an architecture review cover? Coding practices. Failure handling. Coupling between services. Operational burden. Whether the team can actually maintain what was built.
In many architecture review meetings, I see people not coming with homework. They did not state why they considered alternatives. They have no POC results to support their design. If you only made a choice instead of evaluating options, it is a pure assumption that you struck gold. In reality it might be silver, and you would not know without backed proof.
The best architecture reviews are the ones where someone walks in and says “here is what I considered, here is what I ruled out, and here is why this approach wins.” That is the difference between engineering and guessing.
System Design Is Not a Diagram. It Is a Series of Decisions.
Every section in this post is a decision. Monolith or microservices. Sync or async. SQL or NoSQL. Build or buy. Scale now or scale later. Spend on managed services or invest in DevOps. Ship the perfect version or ship what works this quarter.
The tutorials teach you to draw boxes and arrows. Real system design is about making the right call at the right time with incomplete information and real constraints. That is a skill you build through experience, not through watching YouTube videos about designing Netflix.
If you want to get better at these decisions, the fastest path is working with someone who has made them before and can help you see what you are not seeing yet.
Want to sharpen your system design thinking?
I mentor working engineers on real-world architecture decisions, not interview prep. 726+ sessions, 5.0 rating. System design, trade-off analysis, scaling decisions, and production engineering.