InfoQ Homepage News How Dropbox Created a Distributed Async Task Framework at Scale
How Dropbox Created a Distributed Async Task Framework at Scale
This item in japanese
Nov 23, 2020 2 min read
by
Write for InfoQ
Feed your curiosity. Help 550k+ globalsenior developers
each month stay ahead.Get in touch
Engineers at Dropbox recently published how they designed a distributed async task framework (ATF) that can handle tens of thousands of async tasks scheduled per second.
Dropbox's engineers needed an infrastructure system that would allow them to power all task scheduling requirements in their system, "from dragging a file into a Dropbox folder to scheduling a marketing campaign." They decided to design an in-house system since there were no open-source or off-the-shelf solutions that could meet their scaling needs.
A year after its introduction, ATF currently handles 9,000 async tasks scheduled per second and is used by 28 engineering teams internally at Dropbox.
Arun Sai Krishnan, a software engineer at Dropbox, told InfoQ:
We had two in-house systems that were replaced by ATF. One was a system for immediate task execution built on top of Cape, and another was a system used for delayed task execution. ATF replaced these to get better features (like control over a dedicated execution layer) and system guarantees (like isolation across lambdas). In addition, consolidating these systems, which performed very similar roles, reduced our maintenance overhead.
ATF's main feature is its ability to enable developers to define callbacks and schedule tasks that execute against these pre-defined callbacks. It invokes callbacks in an at-least-once manner while guaranteeing that it does not run the same task instance more than once concurrently. The engineers chose this behavior to avoid requiring their users to design their callbacks to handle concurrency and possible race conditions. Also, ATF is highly available, with 99.9% availability for receiving scheduling requests.
The following diagram describes ATF's architecture.
Source: https://dropbox.tech/infrastructure/asynchronous-task-scheduling-at-dropbox
This diagram depicts the following process:
- An ATF Frontend service receives requests to schedule tasks via gRPC.
- The Frontend service registers the task in the task store, which is implemented by Dropbox's in-house Edgestore data store.
- A Store Consumer service periodically polls the task store. It pushes tasks ready for execution into a queue.
- The queue is implemented using AWS Simple Queue Service (SQS). Each callback and priority pair gets a dedicated queue. This setup allows for prioritizing more important tasks per registered callback.
- Controllers pull tasks for callbacks for which they are in charge and store them on local queues. This layer also prioritizes the work for the executors.
- Multiple executors per controller then poll the controller for work in a message processing loop and execute the required tasks.
- Both controllers and executors update the HeartBeat & Status Controller (HSC) regarding their progress. The HSC, in turn, updates the task store with each task's state, thus allowing clients to query various tasks' progress.
Dropbox's engineers designed ATF to be a self-serve framework for all developers at Dropbox. To promote this approach, they decided that all worker clusters (controllers and executors) are entirely owned and managed by callback owners. This clear separation of concerns increases ATF's usability by its consumers and allows the ATF team to focus on the core system parts.
Krishnan said the following regarding the process of getting feedback from consumers:
Feedback was collected from the teams that would be using the system during the design process. It was also widely vetted during our design review process. In addition, we incorporated continuous feedback from our clients on issues such as maintenance costs, alerting, and host management.
Possible future extensions to ATF include periodic task execution, better support for task chaining, and support for dead-letter queues for misbehaving tasks.
This content is in the Asynchronous Architecture topic
Related Topics:
-
Related Editorial
-
Related Sponsors
-
Popular across InfoQ
-
AWS Introduces ECS Managed Instances for Containerized Applications
-
Producing a Better Software Architecture with Residuality Theory
-
GitHub Introduces New Embedding Model to Improve Code Search and Context
-
Google DeepMind Introduces CodeMender, an AI Agent for Automated Code Repair
-
Building Distributed Event-Driven Architectures across Multi-Cloud Boundaries
-
Mental Models in Architecture and Societal Views of Technology: A Conversation with Nimisha Asthagiri
-
Related Content
The InfoQ Newsletter
A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example