Implement document processing system

Question 1

I have to implement a class with method that takes document:

struct TDocument {
 string url; // document identifier
 uint64_t pubDate;
 uint64_t fetchTime;
 string text;
 uint64_t firstFetchTime = 0; 
};
The method should return document with updated fields according to the following rules:

text and fetchTime must be the same as the document with max fetchTime (take into account only documents with the same URL)

pubDate must be the same as the pubDate in document with min fetchTime (take into account only documents with the same URL)

firstFetchTime must be the same as the fetchTime in the document with min fetchTime (take into account only documents with the same URL)

Example:
Input:
url="url1", pubDate=1, fetchTime=1, text="text1"
url="url1", pubDate=2, fetchTime=3, text="text2"

Output:
url="url1", pubDate=1, fetchTime=1, text="text1", firstFetchTime=1 // for the first input line return the same document with updated firstFetchTime
url="url1", pubDate=1, fetchTime=2, text="text2", firstFetchTime=1 // for the second input line update pubDate is 1 (because the should take min value among documents with the same URL

My idea: is to store minFetchTime and maxFetchTime among documents with the same URL

Could you please review my solution?
Are there similar tasks on the LeetCode.com?
Code

struct TDocument {
 string url;
 uint64_t pubDate; // pubdate of the minFetchTime 
 uint64_t fetchTime; // maxFetchTime
 string text; // maxFetchTime
 uint64_t firstFetchTime = 0; // minFetchTime
};
struct UtilityDocument {
 UtilityDocument(string url_, uint64_t pubDate_, string text_, uint64_t firstFetchTime_, uint64_t maxFetchTime_, uint64_t minFetchTime_)
 : url(url_), pubDate(pubDate_), text(text_), firstFetchTime(firstFetchTime_), maxFetchTime(maxFetchTime_), minFetchTime(minFetchTime_) {}
public:
 string url;
 uint64_t pubDate;
 string text;
 uint64_t firstFetchTime;
 uint64_t maxFetchTime;
 uint64_t minFetchTime;
};
class TProcessor {
public:
 std::shared_ptr<TDocument> Process(std::shared_ptr<TDocument> input) {
 if (docs_.contains(input->url)) {
 if (input->fetchTime > docs_[input->url]->maxFetchTime) {
 docs_[input->url]->maxFetchTime = input->fetchTime;
 docs_[input->url]->text = input->text;
 input->pubDate = docs_[input->url]->pubDate;
 input->firstFetchTime = docs_[input->url]->minFetchTime;
 } 
 if (input->fetchTime < docs_[input->url]->minFetchTime) {
 docs_[input->url]->minFetchTime = input->fetchTime;
 docs_[input->url]->pubDate = input->pubDate;
 input->text = docs_[input->url]->text;
 input->fetchTime = docs_[input->url]->maxFetchTime;
 }
 } else {
 input->firstFetchTime = input->fetchTime;
 docs_[input->url] = shared_ptr<UtilityDocument>(new UtilityDocument(
 input->url,
 input->pubDate,
 input->text,
 input->firstFetchTime,
 input->fetchTime, // maxFetchTime
 input->fetchTime // minFetchTime
 ));
 }
 return input;
 };
private:
 unordered_map<string, std::shared_ptr<UtilityDocument>> docs_;
};

Question 2

There should be some documentation, I think at least of TDocument and TProcessor.Process().

It looks tempting to have something modelling what's common to TDocument and whatever models info about all same URL TDocuments for Process() - I don't think the latter is a/one document and name it differently.
Without any methods in TDocument, the question of is-a seems moot,
but seeing no use of UtilityDocument.firstFetchTime, I'm unconvinced derivation helps.

Far as I remember, specifying visibility public is pointless in a struct without preceding restriction.

In Process(), all those repetitions of docs_[input->url] are too verbose for my taste.

As 37307554 points out, not setting the doc's pubDate and firstFetchTime looks flaky, I'll extend that to text and fetchTime.

/*** a TDocument keeps URL, text, publication date and time fetched
 * and in addition the time this URL was first fetched _when Process()ed_ */
struct TDocument {
 string url; // document identifier
 uint64_t pubDate; // processed: of doc with min fetchTime
 uint64_t fetchTime; // processed: of doc with max fetchTime
 string text; // processed: of doc with max fetchTime
 uint64_t firstFetchTime = 0; // processed: of doc with min fetchTime
};
struct URLInfo { // for documents with same URL
 URLInfo(string url_, string text_,
 uint64_t pubDate_, uint64_t maxFetchTime_, uint64_t minFetchTime_)
 : url(url_), text(text_),
 pubDate(pubDate_), maxFetchTime(maxFetchTime_), minFetchTime(minFetchTime_)
 {}
 string url;
 string text;
 uint64_t pubDate;
 uint64_t maxFetchTime;
 uint64_t minFetchTime;
};
class TProcessor {
public:
 /** URL info instantiated or updated as necessary;
 * text and fetchTime are set to the URL's text & max fetchTime;
 * pubDate & firstFetchTime to the URL's pubDate & min fetchTime. */
 std::shared_ptr<TDocument> Process(std::shared_ptr<TDocument> input) {
 auto found = docs_.find(input->url);
 if (found != docs_.end()) {
 // assert(found->minFetchTime <= found->maxFetchTime) ?
 if (input->fetchTime > found->maxFetchTime) {
 found->maxFetchTime = input->fetchTime;
 found->text = input->text;
 } else if (input->fetchTime < found->minFetchTime) {
 found->minFetchTime = input->fetchTime;
 found->pubDate = input->pubDate;
 } // XXX what if ==?
 input->pubDate = found->pubDate;
 input->fetchTime = found->maxFetchTime;
 input->text = found->text;
 input->firstFetchTime = found->minFetchTime;
 } else {
 input->firstFetchTime = input->fetchTime;
 docs_[input->url] = shared_ptr<URLInfo>(new URLInfo(
 input->url,
 input->text,
 input->pubDate,
 input->fetchTime, // maxFetchTime
 input->fetchTime // minFetchTime
 ));
 }
 return input;
 };
private:
 unordered_map<string, std::shared_ptr<URLInfo>> docs_;
};

Using an else if where comparing doc fetchTime and URL max/minFetchTime leads me to suspect a gap in the specification.

Caveat: I didn't code C++ in earnest for decades, no modern C++, obviously.

Question 3

When a new document with the same URL but a different fetchTime is processed in your process method, you correctly update maxFetchTime and minFetchTime as needed. However, the logic to update pubDate and firstFetchTime in the input document could be more explicitly aligned with the rules. Specifically, pubDate should always reflect the pubDate of the document with the minimum fetchTime, and firstFetchTime should also be updated to reflect the minimum fetchTime.

You can optimize updating the input document with values from docs_ and vice versa by determining whether an update is necessary before making any changes. You can do this by first checking if the current document's fetchTime is indeed the new minimum or maximum for that URL before proceeding with updates.

Consider breaking down the Process method into smaller helper functions for handling specific tasks, such as updating the document for a new maximum fetchTime or a new minimum fetchTime.

If and when TProcessor::Process might be called from multiple threads, consider thread safety for your docs_ map as the current implementation does not appear to be thread-safe.

Consider adding error handling or validation for the input document, such as checking for null pointers or invalid data.

greybeard greybeard 7,4013 gold badges21 silver badges55 bronze badges · Accepted Answer · 2024-01-31 12:35:13Z

There should be some documentation, I think at least of TDocument and TProcessor.Process().

It looks tempting to have something modelling what's common to TDocument and whatever models info about all same URL TDocuments for Process() - I don't think the latter is a/one document and name it differently.
Without any methods in TDocument, the question of is-a seems moot,
but seeing no use of UtilityDocument.firstFetchTime, I'm unconvinced derivation helps.

Far as I remember, specifying visibility public is pointless in a struct without preceding restriction.

In Process(), all those repetitions of docs_[input->url] are too verbose for my taste.

As 37307554 points out, not setting the doc's pubDate and firstFetchTime looks flaky, I'll extend that to text and fetchTime.

/*** a TDocument keeps URL, text, publication date and time fetched
 * and in addition the time this URL was first fetched _when Process()ed_ */
struct TDocument {
 string url; // document identifier
 uint64_t pubDate; // processed: of doc with min fetchTime
 uint64_t fetchTime; // processed: of doc with max fetchTime
 string text; // processed: of doc with max fetchTime
 uint64_t firstFetchTime = 0; // processed: of doc with min fetchTime
};
struct URLInfo { // for documents with same URL
 URLInfo(string url_, string text_,
 uint64_t pubDate_, uint64_t maxFetchTime_, uint64_t minFetchTime_)
 : url(url_), text(text_),
 pubDate(pubDate_), maxFetchTime(maxFetchTime_), minFetchTime(minFetchTime_)
 {}
 string url;
 string text;
 uint64_t pubDate;
 uint64_t maxFetchTime;
 uint64_t minFetchTime;
};
class TProcessor {
public:
 /** URL info instantiated or updated as necessary;
 * text and fetchTime are set to the URL's text & max fetchTime;
 * pubDate & firstFetchTime to the URL's pubDate & min fetchTime. */
 std::shared_ptr<TDocument> Process(std::shared_ptr<TDocument> input) {
 auto found = docs_.find(input->url);
 if (found != docs_.end()) {
 // assert(found->minFetchTime <= found->maxFetchTime) ?
 if (input->fetchTime > found->maxFetchTime) {
 found->maxFetchTime = input->fetchTime;
 found->text = input->text;
 } else if (input->fetchTime < found->minFetchTime) {
 found->minFetchTime = input->fetchTime;
 found->pubDate = input->pubDate;
 } // XXX what if ==?
 input->pubDate = found->pubDate;
 input->fetchTime = found->maxFetchTime;
 input->text = found->text;
 input->firstFetchTime = found->minFetchTime;
 } else {
 input->firstFetchTime = input->fetchTime;
 docs_[input->url] = shared_ptr<URLInfo>(new URLInfo(
 input->url,
 input->text,
 input->pubDate,
 input->fetchTime, // maxFetchTime
 input->fetchTime // minFetchTime
 ));
 }
 return input;
 };
private:
 unordered_map<string, std::shared_ptr<URLInfo>> docs_;
};

Using an else if where comparing doc fetchTime and URL max/minFetchTime leads me to suspect a gap in the specification.

Caveat: I didn't code C++ in earnest for decades, no modern C++, obviously.

Stack Exchange Network

Implement document processing system

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Implement document processing system

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions