spark output back to web page

Question 1

We are running jobs whose parameters come from a web page and are executed on large files on a spark cluster. After processing, we want to display the data back, written to text files using

rdd.saveAsTextFile(path)

We have a session id that is a common root for the output folders. Meaning it is a random folder but linked to the user session id.

What is a good way to keep track of, pointers to the different files, send pages back to the front end?

Meaning so we can have a list of files and send the results back to a monitoring (summary) and detail page showing the contents of the files.

Question 2

You want to show back list of those files?

Question 3

Both :a list of files and send the results back to a monitoring (summary) and detail page showing the contents of the files. thank you @Łukasz-gawron

Question 4

1. where those files are stored?

Question 5

2. Root dir is session id - how hierarchy of those output folders looks like?

Question 6

Without getting into premature optimization, consider the following design principles:

Convention. It seems like you already made the choice to have predictable path names in HDFS (based on a user session ID). You can extend this to have predictable paths for each job. If the jobs are initiated by a web application, then that web app can generate whatever name or ID is associated with the job, and create the HDFS path for the Spark job output in a consistent and predictable fashion.
Authority. Every data element should have exactly one authoritative home, no matter how many copies of its values are scattered around the architecture. In your example, it seems proper for the web app to be authoritative on user session IDs and job IDs, and for HDFS to be authoritative on what files are present in a directory and what their contents are. So your web app then must maintain job IDs associated with a user session somewhere, and query HDFS (following the predictable path convention) to get a list of output files and their contents.

Tajh Taylor Tajh TaylorTajh Taylor 2171 silver badge3 bronze badges · Accepted Answer · 2018-11-03 13:56:10Z

Without getting into premature optimization, consider the following design principles:

Convention. It seems like you already made the choice to have predictable path names in HDFS (based on a user session ID). You can extend this to have predictable paths for each job. If the jobs are initiated by a web application, then that web app can generate whatever name or ID is associated with the job, and create the HDFS path for the Spark job output in a consistent and predictable fashion.
Authority. Every data element should have exactly one authoritative home, no matter how many copies of its values are scattered around the architecture. In your example, it seems proper for the web app to be authoritative on user session IDs and job IDs, and for HDFS to be authoritative on what files are present in a directory and what their contents are. So your web app then must maintain job IDs associated with a user session somewhere, and query HDFS (following the predictable path convention) to get a list of output files and their contents.

Stack Exchange Network

spark output back to web page

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

spark output back to web page

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions