1

I am looking for advice on selecting or building a job placement algorithm.

In my company, we have a simple computing platform built on top of Kubernetes. Multiple clients, send compute jobs to a single Kafka topic, multiple workers continuously pull for new tasks from the queue, execute jobs and go back to pick up next tasks. Any worker can execute any job, the system has practically a single queue.

I need to modify this system to get data/cache aware. Imagine that each job requires some data [D1, D2, ... Dn], when this job lands on a worker [Wi] it first needs to retrieve this data and cache it locally before it can start execution. When a new job comes in with a requirement for some data [Dx], I want it to be assigned to a "worker pool" where workers already cached that data [Dx].

I am free to modify the architecture to implement the new functionality, for example we can replace Kafka or use multiple topics or introduce some look up tables etc.

Requirements:

  • The number of datasets is not known in advance, new datasets come at runtime.
  • Consider that worker can store/cache infinite number of datasets.
  • It has to be a push based mechanism where worker selects a queue/topic to get the next task from.

Perhaps an algorithm like this already exists. I would appreciate any direction.

rzickler
1291 gold badge1 silver badge6 bronze badges
asked Jun 28, 2024 at 13:17
6
  • 3
    I think @IvovanderVeeken has the best approach. Use a distributed cache, and then your system doesn't need to care which worker processes the job. Commented Jun 28, 2024 at 13:57
  • 2
    A distributed cache is simple, but their are cases like the issue described where the data to be cached is large >GB and needed with fast low latency access, such that a distributed cache would require excessive bandwidth requirements. Commented Jun 28, 2024 at 14:14
  • 1
    The normal Kafka architecture would be multiple partitions per topic (so that you can have multiple consumers), and to give each message partition key. Here, you might want the data ID as the key. Unfortunately, you need multiple items of data (so can't have a single partition key), and it seems you don't really want to utilize partition-based concurrency. A better solution may be a shared cache (k8s volume) between the workers – storage is probably virtualized anyway, there's no latency difference between my cache and another worker's cache. You might want a network file system like Ceph. Commented Jun 29, 2024 at 17:49
  • Thank you for all of your suggestions. I have considered a shared cache (and it might be the final solution if I will not find anything better), but it might not work well if you need to share large datasets (as mentioned above). Shared k8s volume will work well within the scope of a Node, but we have 100s of nodes, so this reduces the problem by the order of Pods per Node. Commented Jun 30, 2024 at 19:46
  • These is my naive thoughts: (1) the set of datasets required per task [Dk, Dj, Dm] form a unique Kafka topic. (2) If a client doesn't see a topic it creates it, (3) we have fast lookup of existing topics in some cache (or use Kafka API), (3) a worker based on the data it has, knows which topics it can consume and chooses one based on the largest number of tasks there (4) worker consumes tasks until queue is empty (5) then it uses lookup to select another Topic based on the logic above. Feels a bit complicated, but I can't find a better option so far. Commented Jun 30, 2024 at 19:48

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.