InfoQ Homepage Articles Building Resilient Platforms: Insights from Over Twenty Years in Mission-Critical Infrastructure
Building Resilient Platforms: Insights from Over Twenty Years in Mission-Critical Infrastructure
Nov 10, 2025 11 min read
reviewed by
Write for InfoQ
Feed your curiosity. Help 550k+ globalsenior developers
each month stay ahead.Get in touch
Key Takeaways
- Great platforms deliver an intuitive experience by hiding complexity and appearing magical; they operate so seamlessly that users take them for granted and never need to think about the underlying infrastructure.
- Platform builders must balance the "three Ss" (stability, security, scalability) as non-negotiable requirements while maintaining an evergreen approach to continuous updates and patching.
- Success requires being opinionated about what to build and saying "no" frequently; it's better to do fewer things exceptionally well than many things poorly.
- Open source has been instrumental in building modern platforms at scale, providing community innovation, portability across environments, and the ability to read and extend the underlying code.
- Building the right culture with empowered teams and diversity of thought is the foundation; great culture drives great teams, which, in turn build great products.
Introduction
Building resilient platforms requires understanding both the art and science of creating infrastructure that others depend on for critical applications. Drawing from over twenty years of experience building various platforms that support critical applications, this perspective applies to anyone who builds software consumed by others at scale, whether developing infrastructure platforms, software development platforms, messaging systems, or banking platforms.
My journey into financial services began unexpectedly. I started my career in oil and gas, working on data acquisition and communications with ships offshore, then transitioned to networking and telecommunications. After moving to New York, I found my way into the financial sector, which at the time offered the best opportunities for low-level infrastructure professionals. This proved fortuitous, because banks care deeply about uptime and resiliency, and invest significantly in underlying infrastructure. The field has provided continuous investment opportunities in platform work at scale, where downtime is unacceptable and the stakes remain consistently high.
Working across major financial institutions, including American Express, JPMorgan Chase, and Goldman Sachs, has meant building platforms supporting trading systems, banking systems, and credit card processing. All are systems that cannot tolerate downtime, security breaches, or inability to scale with business needs.
What is a Platform?
The dictionary describes a platform as a raised, level surface where people or things can stand. This metaphor extends beautifully to technology. Think about train platforms or city infrastructure like sewage systems. These platforms exist quietly in the background, hiding complexity underneath that users never need to consider. They become so integral that they're taken for granted.
Related Sponsors
In technology terms, a platform represents a set of integrated technologies used as a base to develop other applications or processes. The best platform builders succeed when they are taken for granted, seeing success not in recognition, but in silence.
Users can work without ever thinking about the underlying infrastructure, because the platforms simply function, consistently and reliably, making them invisible. Like the power infrastructure energizing our devices, the complexity behind these systems remains hidden while enabling users to focus on their primary tasks.
The best manifestation of platforms today are cloud platforms, either public or private cloud, that underpin the vast majority of modern software applications, from brands we know well (e.g., Uber, Netflix) to complex enterprise systems. The complexity is hidden so that consumers never have to think about it.
The Principles of Infrastructure Platforms
Building a platform that performs this smoothly is no accident, it’s the product of careful engineering. It takes experience, discipline, and a set of principles that guide how to think about building at scale. Over the years, I wrote these principles down in a white paper for my broader team to capture what has helped me on my journey of building platforms.
Deliver an Intuitive Experience
Great platforms prioritize intuitive experiences above all else. Fifteen years ago, using public cloud infrastructure was painful: the complexity, difficulty, and obtuseness made consumption challenging. Today's public cloud platforms provide intuitive experiences because they integrate seamlessly while hiding complexity beneath elegant interfaces. They appear magical.
Arthur C. Clarke's observation that sufficiently advanced technology becomes indistinguishable from magic applies perfectly to infrastructure platforms. The magic comes from seamless integration that invisibly handles complexity. For example, Steve Jobs championed simplicity as the ultimate sophistication: Making something appear simple requires tremendous effort. Exposing complexity is easy; creating something that just works intuitively is hard.
Successfully hiding complexity while delivering powerful functionality defines platform excellence. The sophisticated engineering underneath should remain invisible to users who simply want to accomplish their tasks without friction.
Build Common and Interchangeable Components
Platforms succeed through integration, ensuring components work together seamlessly. Typical systems involve many different building blocks (e.g., messaging, databases, web tiers) and must interlock perfectly. Common observability exemplifies this principle, where observability once was fragmented and troubleshooting across components proved impossible, modern cloud platforms now provide unified logging, telemetry, and dashboards.
The Lego metaphor illustrates this perfectly: a finite set of interlocking blocks provides infinite creative possibilities. Platforms must dictate standards for interoperability, such as observability systems or identity management approaches. This insistence on commonality delivers seamless integration for consumers. While representing freedom from choice for the internal components that underpin a platform, commonality creates freedom through simplicity for users.
Consider identity management in a three-tier application: Deployment becomes seamless when identity management works consistently across all layers. The beauty of interchangeable components lies in this predictable, reliable interaction pattern that users can trust.
Use the Three Ss: Stability, Security, and Scalability
Financial services platforms support mission-critical operations such as trading systems and credit card processing. These systems have zero tolerance for downtime, security breaches, or scaling failures. The three Ss represent non-negotiable requirements. Unlike real estate where you might optimize for two out of three (e.g., location, price, size) platforms must deliver all three without compromise.
Stability means consistent, reliable operation at all times. However, achieving stability through stagnation creates security vulnerabilities from unpatched systems. Patching introduces changes that can impact stability while enabling security. Scalability requires building for 10x growth:Successful platforms attract users like an unstoppable force, and many platforms fail because they cannot scale with customer demand.
Balancing these three requirements demands continuous attention and investment. While cost can fluctuate based on business needs and priorities, these three fundamentals establish an inviolable foundation. Sometimes scaling takes precedence, requiring temporary adjustments to patching cycles. The key lies in maintaining minimum acceptable levels across all three dimensions while optimizing based on immediate needs.
Be Evergreen
Maintaining security requires continuous environmental maintenance: staying evergreen. This becomes incredibly challenging at scale. Managing tens of thousands of servers, hundreds of thousands or millions of VMs, containers, databases, and messaging brokers means maintaining millions of components with individual lifecycles. Some require quarterly updates, others bi-annual, some monthly, and increasingly, even daily patches.
Managing this maintenance without customer disruption proves exceptionally difficult. Many financial services applications were not designed for rolling upgrades. Patching databases that cannot tolerate downtime, coordinating change windows, managing API contract changes, the complexity compounds rapidly.
The temptation to defer maintenance always exists, but falling behind creates insurmountable technical debt. From a security perspective, increased exploitation of zero day vulnerabilities by bad actors demonstrate how quickly deferred maintenance becomes crisis management. Staying evergreen requires eternal vigilance and commitment. Once you fall behind, catching up becomes nearly impossible. This principle demands upfront planning and unwavering execution.
Avoid Undifferentiated Heavy Lifting
Engineering teams naturally gravitate toward building everything from scratch. Smart infrastructure engineers might propose building custom databases, messaging brokers, or even operating systems. However, financial institutions don't need custom database engines when excellent options already exist.
Focus must remain on what clients actually need. For example, making databases enterprise-ready might mean wrapping Postgres with automatic failover, backups, and compliance controls, but not rewriting the database engine itself. Use existing foundations to add only the minimum necessary controls, and apply Occam's razor (the simplest way is usually the best) rigorously: do only what's necessary.
Building on existing foundations narrows the focus to business value rather than reinventing the wheel. Resist the temptation to overengineer, regardless of how intellectually stimulating the challenge might be.
Be Opinionated
Platform owners must make decisive choices about scope. Clients will request everything imaginable, but saying "no" protects platform integrity. Being unpopular often indicates good decision-making, because doing fewer things with excellence beats doing many things poorly.
Consider relational database engines: Do you really need MySQL, Postgres, Oracle, CockroachDB , and MS SQL? Focus on the mainstream eighty percent of use cases. Accept that you cannot please everyone and that sometimes clients must adapt their software to your platform rather than demand endless customization.
Experience shows that saying "yes" too often leads to unsustainable sprawl. Resources spread too thinly across too many initiatives result in poor maintenance, instability, and unhappy clients. It is better to disappoint some users initially than fail everyone eventually.
Retiring technical debt requires similar conviction. Beloved but obsolete systems must be decommissioned despite user protests. No client wants to port their application, but platform sustainability demands these difficult decisions. The magic lies in managing such transitions with empathy while maintaining resolve.
Be Long-Term "Greedy"
Platform decisions require long-term thinking. Like bringing home a puppy, you're not just committing to the cute beginning, but to years of care and feeding. That adorable puppy becomes a dog requiring walks, food, and veterinary care for a decade or more. Platforms demand similar long-term commitment.
Once clients deploy on your platform, you've made an implicit contract to support them for years. Before accepting new responsibilities, consider whether you're willing to fund feature teams for the next decade, handle maintenance, and provide support through thick and thin.
Learning from mistakes helps refine decision-making. For example, building multiple container platforms, first with Tomcat, then Docker, then Mesosphere, before finally adopting Kubernetes, created migration nightmares. Each platform accumulated clients who resisted moving. Sometimes waiting for industry consensus proves wiser than early adoption.
Letting the community innovate provides valuable signals. Six different groups independently build Kafka implementations, indicates a genuine need for a common platform. The art of innovation lies in timing, not too early when consensus remains fluid and not too late when fragmentation has already occurred.
Share Responsibility
Platform providers and consumers share an implicit contract that benefits from explicit documentation. Beyond API contracts and SDKs, clarify control boundaries, uptime expectations, Service-Level Objectives (SLOs), and maintenance windows.
Explicitly reserving quarterly patch windows for stateful databases prevents future conflicts. Without upfront communication, clients will resist any downtime. Document SLOs, measure them consistently, and discuss them openly.
Be clear about the division of responsibilities. Platforms can provide many services, but not everything. Think of the problem as ice cubes versus snowflakes: Platforms can efficiently produce various ice cube shapes, but each unique snowflake requires custom craftsmanship. Clarity about capabilities and limitations enables productive partnerships.
Abstract, Don't Obfuscate
Clients have varying needs and comfort levels. Some prefer UI interaction without understanding underlying mechanics, such as most car drivers, who simply want reliable transportation without understanding engines. Others prefer Terraform configurations or direct API access.
Build multiple abstraction levels while maintaining accessibility to all of them. When something breaks, users need visibility into underlying systems. If platforms completely obfuscate internals, troubleshooting becomes impossible.
The evolution from assembly language to compilers to AI-assisted coding demonstrates the development of continuous abstraction layers. Each new layer provides value while maintaining access to lower levels when needed. Modern developers can generate entire applications through AI while still being able to inspect and modify the resulting code when necessary.
The key is not to dumb down the platform interaction for all, but to offer multiple levels of abstractions and allow the customer to enter where they see fit.
Stand on the Shoulders of Giants
Open source has transformed platform building, especially in financial services. Isaac Newton's famous observation about how the evolution of science builds upon prior research and insights ("If I have seen further it is by standing on the shoulders of giants." ), perfectly captures open source's value proposition. The community provides enormous mindshare, shared engineering resources, continuous innovation, portability due to common components, and unprecedented transparency through visible code.
For example, Linux now powers mission-critical trading and banking systems across major financial institutions. Kafka, Cassandra, Kubernetes, Postgres, MySQL, and Terraform, these open-source foundations allow platform builders to focus on differentiation rather than commodity infrastructure.
Open source is not about free software: There is no free lunch, and developers need compensation regardless of licensing models. The value of open source comes from community collaboration, portability, and standardization. For example, Postgres runs on AWS Aurora, Azure, and GCP with minimal modification, providing true portability.
Open source does not diminish proprietary software's value, instead, it provides unique advantages for platform building. The combination of open source foundations with proprietary differentiation creates powerful, sustainable platforms.
Build Culture, the Rest Takes Care of Itself
Platform building at scale requires multiple teams building different components that must integrate consistently. Achieving this consistency demands the right culture. Great culture creates great teams and great teams build great products. Don't confuse temporary success with sustainable excellence; repeatability requires cultural foundation.
Three cultural dimensions prove critical. First, empowered teams make decisions without seeking permission and within defined boundaries. Teams need freedom to innovate on their core responsibilities, while adhering to platform standards.
Second, diversity of thought prevents groupthink and drives innovation. Homogeneous teams lack the creative tension necessary for excellence. When friends hire friends with similar backgrounds and experiences, you end up with ten versions of the same person thinking the same way. Building diverse teams requires intentional hiring, promotion, and management practices.
Third, team-oriented leadership sometimes requires sacrificing individual preferences for team dynamics. Great teams need balanced composition, which means talented individuals might need to move to where they contribute most effectively.
Cultural transformation takes months to years, not days to weeks. This long-term investment in culture represents the most important platform-building activity. Daily focus on culture creates the foundation for everything else.
Conclusion
Platforms permeate modern life, building upon each other in endless layers. Like geological strata, each platform layer provides a foundation for the next. Great platforms achieve such stability that users take them for granted. They possess permanence without requiring thought.
Platform building remains a job for unsung heroes. No one calls to celebrate when platforms work perfectly, they only call when things break. The highest compliment is silence. Success means remaining transparent, unknown, and unsung. When users start calling you directly, something has gone wrong.
These principles provide a framework for building platforms that others can depend upon, platforms that hide complexity while delivering value, platforms built to last. Whether building infrastructure, creating internal tools, or developing any software others will consume, these principles guide the path toward truly resilient platforms at scale.
The journey requires patience, discipline, and commitment to excellence, but the result enables others to build amazing things they could never have created alone.
This content is in the Cloud topic
Related Topics:
-
Related Editorial
-
Popular across InfoQ
-
Cloudflare Introduces Data Platform with Zero Egress Fees
-
Grafana and GitLab Introduce Serverless CI/CD Observability Integration
-
TanStack Start: A New Meta Framework Powered by React or SolidJS
-
Java News Roundup: OpenJDK JEPs for JDK 26, Spring RCs, Quarkus, JReleaser, Seed4J, Gradle
-
Meta and Hugging Face Launch OpenEnv, a Shared Hub for Agentic Environments
-
If You Can’t Test It, Don’t Deploy It: The New Rule of AI Development?
-
Related Content
The InfoQ Newsletter
A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example