One Org or Many? The Postmortem Nobody Wants to Write

DEV Community

organizations.amazonaws.com events of type AttachPolicy and DetachPolicy and publishes to an SNS topic with subscriptions to the security Slack channel and PagerDuty. MTTD for this type of change dropped from ~31 minutes (the time it took to correlate during the incident) to under 2 minutes in post-implementation validation tests.

Single-Org vs. Multi-Org: Real Trade-offs

Criterion	Dimension	Single Organization	Multiple Organizations
SCP blast radius	Root affects 100% of accounts instantly	Isolated by Organization boundary; changes are independent	—
Operational complexity	Lower: single landing zone pipeline, single Control Tower	Higher: multiple pipelines, multiple Control Tower enrollments	—
Cost visibility	Native via Consolidated Billing	Requires cross-account CUR + Athena or AWS Cost Explorer linked accounts	—
Regulatory isolation (PCI, SOC2)	Possible via OUs, but policy boundary is logical, not physical	Physical boundary between orgs; auditors accept more readily	—
Management Account compromise	One compromised account = potential access to entire organization	Blast radius limited to specific org; other orgs unaffected	—
SCP propagation latency	Seconds to a few minutes across all accounts	Same behavior within each org; orgs are independent	—

The Real Problem with SCPs: No Staging, No Automatic Rollback

One of the most important findings from the postmortem was that the team had treated SCPs like ordinary infrastructure code — with the same deployment pipeline as a security group or IAM role. This is a mental model error.

SCPs are access control policies with immediate propagation and no native progressive rollback mechanism. There is no aws organizations deploy-policy --canary 10%. When you attach-policy to an OU with 200 accounts, all 200 accounts are affected simultaneously. AWS Organizations has no concept of deployment rings for policies.

The practical implication is that the SCP change process must be treated like a production database-level change — with a maintenance window, dual approval, and a tested rollback plan. The rollback plan for an SCP is simple: detach-policy. But if you do not know what policy was in place before, or if the change was composed of multiple operations, rollback may not be trivial.

What we implemented: an immutable state registry of SCPs per OU/Root, stored in S3 with versioning enabled and Object Lock in COMPLIANCE mode for 90 days. Before any attach-policy, the pipeline saves the current state. Automated rollback is a Lambda that reads the previous state from S3 and executes the inverse operations. Automated rollback time in tests was 45 seconds — compared to the 7 minutes it took to identify and manually execute during the actual incident.

A frequently overlooked detail: SCPs with explicit Deny take precedence over any Allow in identity policies, including IAM Role policies with AdministratorAccess. This means that not even the account root user (unless explicitly excluded via aws:PrincipalType: Root) can execute actions blocked by an SCP. In our incident, this is what made the situation so severe — there was no escape hatch in the payments account.

FinOps in Multi-Org: The Argument That Overcomes Resistance

The most common argument against multiple Organizations is the loss of consolidated cost visibility. That argument was valid in 2018. In 2024, it is a solved problem — with some important caveats.

AWS Cost and Usage Report (CUR 2.0) can be configured for delivery to a centralized S3 bucket in a dedicated billing account, even across multiple Organizations, using a cross-account S3 bucket policy pattern with s3:PutObject allowed for the billingreports.amazonaws.com service principal from multiple Management Account IDs. Athena + AWS Glue Crawler over this data produces a unified cost view that the CFO can consume via QuickSight with row-level security per business unit.

What is not natively solved: Reserved Instances and Savings Plans are not shared across Organizations. This is a real cost. In our analysis, the payments account used approximately 18ドルk/month in Compute Savings Plans that, upon moving to a new Organization, could no longer be shared with tooling accounts in the original org. The solution was to consolidate Savings Plans in the new payments org and use On-Demand for tooling workloads with more variable usage — the cost delta was approximately 1ドル.2k/month, which was accepted as the cost of regulatory isolation.

A pattern I recommend: use AWS Cost Categories with rules based on CostCenter and BusinessUnit tags applied via tag policies in both Organizations. This allows financial reporting to be agnostic to Organizations topology — the CFO sees by cost center, not by org.

AWS Well-Architected: Affected Pillars

security: SCPs must have a change management lifecycle equivalent to production database changes. Use aws:PrincipalType: Root as an explicit escape hatch in critical SCPs. Implement EventBridge + SNS for immediate detection of policy attach/detach in the Management Account. Consider Organization boundary as physical isolation for PCI-DSS and BACEN 4.893 regimes.
reliability: Organizations design must minimize the blast radius of operational changes. Circuit breakers in downstream services (EKS/Resilience4j, Lambda with Dead Letter Queue) are necessary but insufficient — they mask the error without resolving the cause. Add specific health checks for AccessDeniedException with low-latency alarms (< 5 minutes MTTD). Implement automated SCP rollback with immutable state in S3 Object Lock.

Anti-Patterns That Lead to the Incident

Treating SCPs as ordinary infrastructure in the CI/CD pipeline without differentiated approval by target level (Root, production OU, sandbox OU)
Using a single Organization to consolidate cost governance without evaluating the blast radius of security policies on regulated workloads
Configuring 5xx alarms on API Gateway as the only detection signal without specific alarms for AccessDeniedException in CloudTrail
Assuming that circuit breakers in downstream services substitute for architectural isolation — they are complementary, not equivalent
Failing to map cross-region dependencies of accounts before applying SCPs with aws:RequestedRegion conditions
Relying on OUs as regulatory isolation boundaries for PCI-DSS auditors without explicitly documenting that the boundary is logical, not physical

Architect's Note: After this incident, I started recommending a simple rule: if you have workloads with distinct regulatory regimes (PCI-DSS, SOC 2, BACEN) or with availability SLOs above 99.9%, they belong in separate Organizations — not separate OUs. The additional operational cost of multiple Organizations is real, but it is a predictable and manageable engineering cost; the cost of a 47-minute incident on a payments pipeline is not. The hardest lesson was realizing we treated SCPs as code when we should have treated them like production database schema changes: with staging, dual approval, tested rollback, and a maintenance window. That is not bureaucracy — it is reliability engineering applied to the control plane.

Verdict: When to Use One or Multiple Organizations

Use a single AWS Organization when: all your workloads share the same regulatory regime, the same availability SLO, and the platform team has capacity to implement rigorous guardrails in the SCP change pipeline. Use multiple Organizations when: you have distinct regulatory regimes (especially PCI-DSS or BACEN 4.893), SLOs above 99.9% on critical workloads, or when auditors require evidence of physical policy isolation. The cost of non-shared Savings Plans is quantifiable and generally lower than the cost of a single incident caused by unrestricted blast radius. The decision is not about operational simplicity — it is about where you accept that inevitable human error has consequences.

References

Originally published at fernando.moretes.com. By Fernando F. Azevedo — Senior Solutions Architect.