4

We are running jobs whose parameters come from a web page and are executed on large files on a spark cluster. After processing, we want to display the data back, written to text files using

rdd.saveAsTextFile(path) 

We have a session id that is a common root for the output folders. Meaning it is a random folder but linked to the user session id.

What is a good way to keep track of, pointers to the different files, send pages back to the front end?

Meaning so we can have a list of files and send the results back to a monitoring (summary) and detail page showing the contents of the files.

asked Nov 14, 2016 at 15:19
4
  • You want to show back list of those files? Commented Sep 6, 2018 at 18:36
  • Both :a list of files and send the results back to a monitoring (summary) and detail page showing the contents of the files. thank you @Łukasz-gawron Commented Sep 7, 2018 at 8:46
  • 1. where those files are stored? Commented Sep 7, 2018 at 19:24
  • 2. Root dir is session id - how hierarchy of those output folders looks like? Commented Sep 7, 2018 at 19:32

1 Answer 1

1

Without getting into premature optimization, consider the following design principles:

  1. Convention. It seems like you already made the choice to have predictable path names in HDFS (based on a user session ID). You can extend this to have predictable paths for each job. If the jobs are initiated by a web application, then that web app can generate whatever name or ID is associated with the job, and create the HDFS path for the Spark job output in a consistent and predictable fashion.
  2. Authority. Every data element should have exactly one authoritative home, no matter how many copies of its values are scattered around the architecture. In your example, it seems proper for the web app to be authoritative on user session IDs and job IDs, and for HDFS to be authoritative on what files are present in a directory and what their contents are. So your web app then must maintain job IDs associated with a user session somewhere, and query HDFS (following the predictable path convention) to get a list of output files and their contents.
answered Nov 3, 2018 at 13:56

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.