[フレーム]
BT

InfoQ Software Architects' Newsletter

A monthly overview of things you need to know as an architect or aspiring architect.

View an example

We protect your privacy.

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Unlock the full InfoQ experience

Unlock the full InfoQ experience by logging in! Stay updated with your favorite authors and topics, engage with content, and download exclusive resources.

Log In
or

Don't have an InfoQ account?

Register
  • Stay updated on topics and peers that matter to youReceive instant alerts on the latest insights and trends.
  • Quickly access free resources for continuous learningMinibooks, videos with transcripts, and training materials.
  • Save articles and read at anytimeBookmark articles to read whenever youre ready.

Topics

Choose your language

InfoQ Homepage Articles Building Distributed Event-Driven Architectures Across Multi-Cloud Boundaries

Building Distributed Event-Driven Architectures Across Multi-Cloud Boundaries

Nov 19, 2025 14 min read

reviewed by

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.
Get in touch
Listen to this article - 0:00
Audio ready to play
0:00
0:00

Key Takeaways

  • Multi-cloud is inevitable, not optional. With eighty-six percent of organizations already operating in a multi-cloud environment, it's a reality driven by modernization and FinTech competition.
  • Latency requires code-level optimizations, including compression, batch optimization, calibrated timeouts, and account-based partitioning.
  • Resilience extends beyond immediate availability. Event stores, comprehensive policies, and systematic replay help to both survive failures and automatically recover.
  • Event ordering and duplicates need a multi-layer defense: Apply sequence numbers with deferred processing, unique IDs, idempotent configs, and duplicate checking.
  • Start small with comprehensive observability, embrace failures, and invest in robust event backbones and team training.

*All thoughts and opinions shared below are my own and don’t represent my employer’s views

Picture this: It is 3 AM, and your phone is buzzing with alerts. A critical financial transaction processing system has gone down, but here's the twist: The failure cascades across AWS, Azure, and your on-premise infrastructure. Neither the failure nor your debugging session will respect cloud boundaries.

Welcome to multi-cloud event-driven architectures. Event-driven architectures, the backbone of modern distributed systems, face particular challenges in multi-cloud environments.

    Related Sponsors

The elegant simplicity of "fire an event and forget" becomes a complex orchestration of latency optimization, failure recovery, and data consistency across provider boundaries. Yet, when done right, multi-cloud event-driven architectures offer unprecedented resilience, performance, and business agility.

This article covers the real-world challenges of building event-driven systems across multiple cloud providers.

We'll explore the four critical areas of multi-cloud complexity, latency optimization, resilience patterns, event ordering, and duplicate handling, through practical code examples and proven solutions.

Using a realistic financial services case study, you'll learn actionable strategies for managing cross-cloud communication and building systems that thrive in multi-provider environments.

The Multi-Cloud Reality

If you're still debating whether multi-cloud is worth the complexity, you're asking the wrong question. That ship has sailed. With eighty-six percent of organizations already operating across multiple cloud providers, the question isn't whether you'll build distributed systems across cloud boundaries, it's whether you'll do it well or spend your nights troubleshooting cascading failures.

The Flexera 2025 State of the Cloud Report delivers a stark reality check: Only twelve percent of organizations operate on a single cloud provider, with a mere two percent maintaining single private cloud environments. The remaining eighty-six percent have embraced multi-cloud strategies, with seventy percent utilizing hybrid architectures that weave together on-premise systems with multiple cloud providers.

This isn't a trend, it is a fundamental shift in how we architect systems. But why has multi-cloud become inevitable rather than optional?

The answer lies in the collision between business needs and technical reality:

  • Regulatory requirements often dictate data residency across different regions and providers.
  • Best-of-breed services are scattered across providers. As an example, a specific organization may prefer AWS for security, Azure for AI, Google Cloud for analytics for their use case.
  • Concentration Risk mitigation demands avoiding single points of failure at the provider level.
  • Legacy modernization creates natural migration paths to different clouds based on workload characteristics.
  • Vendor negotiation power increases when you're not locked into a single provider.

Yet most architectural discussions still treat multi-cloud as an edge case rather than the default scenario.

Fictional Case Study: FinBank's Multi-Cloud Journey

Let's examine a realistic scenario through a fictitious bank (let’s call it FinBank), a century-old traditional bank modernizing its infrastructure to compete with agile FinTech startups while maintaining stringent regulatory compliance.

FinBank's original architecture represents decades of evolution: core banking capabilities, credit decisioning, and card services wrapped in microservices and exposed through API gateways. Their on-premise, event-driven architecture has microservices emitting events and logs to platform services, with hundreds of interconnected components handling customer engagement, analytics, business intelligence, regulatory reporting, and third-party integrations.

Figure 1: Finbank's Architecture

[Click here to expand image above to full-size]

I am purposefully showing the complex architecture to highlight that this isn't just a high-level architecture diagram, it's a living ecosystem with hundreds of interconnected components. Imagine moving these different interconnected components across multi-cloud. They are dealing with this complexity.

Their migration strategy follows a pragmatic approach:

  • Core banking systems remain on-premise due to regulatory requirements and migration complexity.
  • Risk management components migrate to AWS to leverage their comprehensive security services and compliance certifications.
  • Advanced analytics and business intelligence move to Azure for robust data and AI capabilities.
  • DevOps services centralize in Azure to provide unified pipeline deployment across all environments.

Figure 2: Finbank's Multicloud Strategy

[Click here to expand image above to full-size]

This seemingly straightforward migration immediately surfaces complex challenges that every multi-cloud architect must address. The real question isn't whether these challenges will emerge, it's whether you'll be prepared for them.

Challenges in Multi-Cloud Event-Driven Architecture

Latency

Multi-cloud latency isn't just about network speed, it's about the compound effect of architectural decisions across cloud boundaries. Consider a transaction that needs to traverse from on-premise to AWS for risk assessment, then to Azure for analytics processing, and back to on-premise for core banking updates. Each hop introduces latency, but the cumulative effect can transform a sub-100 ms transaction into a multi-second operation.

Different cloud providers have different mechanisms to handle connectivity issues between components. For example, Azure offers ExpressRoute, and AWS has Direct Connect. They provide reliable, dedicated links that help you bypass public internet and deliver high-performing, low-latency connectivity between different components.

But networking improvements alone won't solve code-level inefficiencies. You also need to ensure you're taking care of latency considerations at your code level.

Here is a typical transaction service implementation that ignores multi-cloud realities:

------------------------
Publisher:
public TransactionService(IKafkaProducer kafkaProducer)
{
  _kafkaProducer = kafkaProducer;
}
public async Task CreateTransaction(Transaction transaction)
{
  // Create and publish event immediately
  var transactionEvent = new TransactionCreatedEvent
  {
    TransactionId = transaction.Id,
    AccountId = transaction.AccountId,
    Amount = transaction.Amount,
    Timestamp = DateTime.UtcNow
  };
  await _kafkaProducer.ProduceAsync("transactions", transactionEvent);
}
--------------------------------
Subscribers:
------------------------------
private async Task ProcessTransactionEvent(TransactionCreatedEvent transactionEvent)
{
  try
  {
    // Perform risk analysis
    var riskScore = await PerformRiskAnalysis(transactionEvent);
    // Take action based on risk score
    if (riskScore > RiskThreshold)
    {
      await FlagForReview(transactionEvent);
    }
  }
  catch (Exception ex)
  {
    // Log error and continue
    _logger.LogError(ex, "Error processing transaction event");
  }
}
-----------------------------
private async Task ProcessTransactionEvent(TransactionCreatedEvent transactionEvent)
{
  try
  {
    // Process analytics
    var insights = await GenerateInsights(transactionEvent);
    // Store in data lake
    await _dataLakeClient.StoreAsync("transaction-insights", insights);
  }
  catch (Exception ex)
  {
    // Log error and continue
    _logger.LogError(ex, "Error processing transaction for analytics");
  }
}   
  
--------------------------------------

This simple implementation has no specific considerations for latency between multi-clouds. How can you fix it? There are several optimizations you can implement.

The optimizations include:

  • Compression configuration to reduce bandwidth use when events cross cloud boundaries, because every byte impacts latency
  • Batch optimization with larger batch sizes to reduce end-to-end latency by forty to sixty percent, even accounting for lingering delays of the larger size..
  • Calibrated timeout values to optimize duration; default timeouts result in timeouts that are either too long and waste resources or too short and cause premature retries and cascading failures.
  • Account-based partitioning to ensure related transactions follow defined routes, enabling effective caching strategies.

Sample fix for Publisher:

------------------
public async Task CreateTransaction(Transaction transaction)
{
  #region Create Event...
  // Configure message with appropriate settings for multi-cloud
  var producerConfig = new ProducerConfig
  {
    // Compression to reduce bandwidth usage
    CompressionType = CompressionType.Snappy,
    // Batch optimization for cross-cloud transfers
    BatchSize = 32768,  // Larger batches for efficient transfer
    LingerMs = 20,    // Small delay to improve batching
    // Network optimizations for cross-cloud latency
    SocketTimeoutMs = 30000,    // Longer socket timeout for cross-cloud
    DeliveryTimeoutMs = 30000,   // Extended timeout for cross-cloud
    SocketNagleDisable = true    // Disable Nagle's algorithm
  };
  // Use account-based partitioning for consistent routing
  string topic = $"transactions-{transaction.AccountId % 10}";
  string key = transaction.AccountId.ToString();
  // Publish to Kafka with configured message settings
  await _kafkaProducer.ProduceAsync(topic, transactionEvent, key, producerConfig);
}
   
----------------

Sample fix for Subscriber:

----------------
public async Task CreateTransaction(Transaction transaction)
{
  #region Create Event...
  // Configure message with appropriate settings for multi-cloud
  var producerConfig = new ProducerConfig
  {
    // Compression to reduce bandwidth usage
    CompressionType = CompressionType.Snappy,
    // Batch optimization for cross-cloud transfers
    BatchSize = 32768,  // Larger batches for efficient transfer
    LingerMs = 20,    // Small delay to improve batching
    // Network optimizations for cross-cloud latency
    SocketTimeoutMs = 30000,    // Longer socket timeout for cross-cloud
    DeliveryTimeoutMs = 30000,   // Extended timeout for cross-cloud
    SocketNagleDisable = true    // Disable Nagle's algorithm
  };
  // Use account-based partitioning for consistent routing
  string topic = $"transactions-{transaction.AccountId % 10}";
  string key = transaction.AccountId.ToString();
  // Publish to Kafka with configured message settings
  await _kafkaProducer.ProduceAsync(topic, transactionEvent, key, producerConfig);
}
------------------------

These are considerations you should look at when thinking about latency in multi-cloud distributed environments. All these settings differ based on your specific application requirements and message size throughput needs. The key is recognizing that you need deliberate code-level optimizations to address latency considerations in multi-cloud scenarios.

These latency optimizations aren't micro-optimizations, they are fundamental architectural decisions that determine whether your system scales gracefully or fails under load.

Resilience: Beyond Immediate Availability

Here is an uncomfortable truth: Most resilience strategies focus on the wrong problem. As engineers, we typically put our efforts into handling failures that occur during an outage or when a service component is down. Equally important is how you recover from those failures after the outage is over. This approach to recovery creates systems that "fail fast" but "recover never".

Consider what happens during a typical multi-cloud outage:

Figure 3: Multi-cloud outage scenario

Without proper resilience design:

  • Failed events are lost forever
  • Services continue hammering failed dependencies without circuit breakers
  • After outages, missing replay capabilities mean data inconsistencies persist indefinitely

The problem compounds in multi-cloud environments where different providers have different failure modes, recovery times, and SLA guarantees. Your system's resilience is only as strong as its weakest cross-cloud dependency.

Building comprehensive resilience requires a systematic approach:

  • Event stores such as persistent storage using the Outbox Pattern, dedicated stores, or Kafka retention

Figure 4: Event Store in EDA

  • Circuit breakers prevent cascading failures by failing fast when dependencies are unhealthy
  • Systematic replay including automated recovery mechanisms that restore lost events after outages

The following code example shows a sample implementation with the fixes above:

------------
public async Task CreateTransaction(Transaction transaction)
{
  // Save to database first
  await _transactionRepository.SaveAsync(transaction);
  // Event Creation - Same as before
  // Latency settings from previous example
  // Use resilience policy with retry pattern for publishing
  await _resiliencePolicy
    .WaitAndRetryAsync(
      5, // Configurable, Retry 5 times at application level
      attempt => TimeSpan.FromSeconds(Math.Pow(2, attempt)), // Exponential backoff
      onRetry: (ex, timeSpan, attempt, ctx) =>
      {
        _logger.LogWarning(ex, "Retry {Attempt} publishing transaction event {TransactionId}", 
                  attempt, transaction.Id);
      })
    .ExecuteAsync(async () =>
    {
      // Partitioning strategy for consistent routing
      var topic = $"transactions-{transaction.AccountId % 10}";
      var key = transaction.AccountId.ToString();
      // Publish with delivery handler to confirm delivery
      var deliveryResult = await _kafkaProducer.ProduceAsync(
        topic,
        transactionEvent,
        key,
        producerConfig);
      if (deliveryResult.Status != PersistenceStatus.Persisted)
      {
        throw new KafkaDeliveryException(
          $"Failed to deliver message: {deliveryResult.Status}");
      }
      // Log successful delivery
      _logger.LogInformation(
        "Transaction event published successfully: {TransactionId}, Partition: {Partition}, Offset: {Offset}",
        transaction.Id, deliveryResult.Partition, deliveryResult.Offset);
    });
}
   
----------------------

The combination of event stores, resilient policies, and systematic event replay capabilities creates a distributed system that not only survives failures, but also recovers automatically, which is a critical requirement for multi-cloud architectures.

Event Ordering

In single-node systems, event ordering is trivial; events are processed in the order they occur. In distributed systems, especially across cloud providers with different network characteristics, event ordering becomes a complex coordination problem.

Consider this scenario: An on-premise system sends a "transaction" event to both AWS and Azure simultaneously. Due to network latency differences, Azure receives and processes the event first, performs fraud analysis, and sends the results to AWS, all before AWS receives the original creation event.

Figure 5: Event Ordering Scenario

The consequences are severe:

  • Risk management systems process events out of sequence
  • Fraud checks complete before transaction validation
  • Regulatory reports contain inconsistent data
  • Financial audits fail due to temporal inconsistencies

Traditional solutions like distributed locks or consensus algorithms don't work well across cloud boundaries due to latency and availability trade-offs. Instead, successful multi-cloud architectures use a multi-layered approach.

At the publisher level, whichever publisher is creating your transaction should ensure that each event gets a strictly increasing sequence number. The publisher has the responsibility of ensuring that any event going out should have a strictly increasing sequence number.

If you're already using account-based partitioning, that also helps because you have similar transactions going through similar services, giving you inherent message ordering internally.

At the subscriber level, you need to do verification. Each subscriber needs to do sequence verification, ensuring that events are processed in the correct expected sequence. Event processing also needs to be deferred. In our example, where AWS receives the message from Azure earlier than from on-premises, it needs to defer processing message 2 until it receives and completes processing message 1.

Consistency in distributed systems is also really important in these scenarios. Do you want your system to be strongly consistent, or are you okay with eventual consistency? This choice comes with real compromises: stronger consistency guarantees typically mean slower performance (due to coordination overhead) and higher costs (more complex infrastructure, additional network calls, and resource usage). Meanwhile, eventual consistency can offer better performance and lower costs, but requires your application to handle temporary inconsistencies. Different components may have different consistency requirements based on these trade-offs. The solution depends on your specific application requirements and what deals you're willing to make.

Here is one such pattern.

Figure 6: Consistency Pattern

The key insight: Consistency isn't binary, it is on a spectrum. For example, Azure Cosmos DB has five consistency levels, a broad spectrum that depends on your application requirements and how you want to handle them in a distributed setup.

Duplicate Events

Network failures, retries, and cross-cloud communication inherently create duplicate events. While duplicate risk processing merely wastes resources, duplicate financial transactions create regulatory nightmares and audit failures.

The challenge multiplies in multi-cloud environments where different providers have different retry policies, timeout behaviors, and failure modes. A single business transaction might trigger multiple technical events across cloud boundaries, in which each event is subject to duplication.

Figure 7: Duplicate Transaction Processing

Successful duplicate handling requires a four-layer defense strategy:

Figure 8: Handling duplicate events

First, start with your publisher. The piece of code generating the event needs to ensure it creates a unique one; for example, using a cloud event schema, an open specification for describing cloud events to be used across different providers. Second, at your producer configuration level. Suppose you're using Kafka as a message broker for publishing. In that case, Kafka has a producer configuration setting called "idempotent," which ensures that across the network, you won't have duplicate events. Third, at your subscriber level, handle duplicate events in your implementation. When you receive the event, first check your process table. If your transaction exists there, it's a duplicate, so you ignore it. If it doesn't exist, move forward with processing and add a row in that transaction log table for future duplicate event checks. Lastly, ensure that your event handler implementation itself is idempotent, meaning that if you rerun for the same event, it should not have a negative impact on the event. This defense-in-depth approach recognizes that duplicates are inevitable in distributed systems—the goal is handling them gracefully rather than preventing them entirely.

Additional Multi-Cloud Considerations

Let's also look at some additional considerations that become very important when talking about multi-cloud.

Security and Compliance: The Expanding Attack Surface

Multi-cloud environments exponentially increase attack surfaces. Each cloud provider has different security models, IAM systems, network policies, and compliance frameworks. A security vulnerability in cross-cloud communication can compromise your entire distributed system.

Schema Evolution: Change Without Breaking

Event-driven architectures inevitably require schema changes. In multi-cloud environments, schema evolution becomes more complex because different components may be deployed independently across different providers with different release cycles.

Observability, logging, and distributed tracing

We need to log the whole event journey across all clouds, which means having good observability and distributed tracing across these environments. There are cloud-native platforms available designed to handle the complexities of multi-cloud environments. Make sure you identify the right platform for observability across your different distributed components.

Cloud-Native vs. Cloud-Agnostic: The Eternal Trade-off

The tension between leveraging cloud-specific capabilities and maintaining portability requires careful architectural decisions. Cloud-native approaches offer better performance and deeper integration, while cloud-agnostic approaches provide flexibility and reduce vendor lock-in.

Actionable Insights: The DEPOSITS Framework

Successful multi-cloud event-driven architectures follow these principles:

Design for failure. Assume components will fail at the worst possible times. Build failure handling into every aspect of your system from day one.

Embrace event stores. Persistent event storage naturally addresses many distributed system challenges and enables powerful recovery scenarios.

Prioritize regular reviews Continuously audit your architecture for optimization opportunities. Multi-cloud systems evolve rapidly, and yesterday's optimal configuration may be today's bottleneck.

Observability first You cannot debug what you cannot see. Invest heavily in distributed tracing, metrics, and logging across all cloud boundaries.

Start small, scale gradually. Migrating entire architectures simultaneously is a recipe for disaster. Begin with isolated, well-understood workloads.

Invest in a robust event backbone. Your messaging infrastructure is the nervous system of your distributed architecture. Don't skimp on reliability, performance, or operational tooling.

Team education. Distributed systems require specialized knowledge. Cloud providers innovate rapidly, and your teams need continuous upskilling to keep pace.

Success. Following these principles leads to systems that thrive in multi-cloud environments rather than merely surviving them.

The Path Forward: Embracing Complexity

Multi-cloud event-driven architectures represent a fundamental shift in how we design and operate distributed systems. The challenges are real, complex, and often counterintuitive. But the alternative, avoiding multi-cloud, isn't viable in today's technology landscape.

The organizations that succeed are those that treat multi-cloud complexity as a design constraint rather than an operational afterthought. They invest in the right abstractions, build comprehensive failure handling from the beginning, and create teams with deep distributed systems expertise.

The question isn't whether multi-cloud is worth the complexity, it's whether you'll control the complexity before it controls you. The choice is yours, but the 3 AM phone calls won't wait for your decision.

Start with comprehensive observability, embrace failures as learning opportunities, and invest in both robust technical infrastructure and team capabilities. Most importantly, remember that successful multi-cloud architectures aren't built, they're evolved, tested under fire, and continuously refined.

The future belongs to systems that can thrive across cloud boundaries. The time to start building them is now.

For further details of the subject discussed in this article, you can watch the video presentation.

About the Author

Teena Idnani

Show moreShow less

Rate this Article

Adoption
Style

Related Content

The InfoQ Newsletter

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.

BT

AltStyle によって変換されたページ (->オリジナル) /