Skip to main content
Software Engineering

Return to Question

spelling and wording improved, very long sentence split in two
Source Link
Doc Brown
  • 218.9k
  • 35
  • 405
  • 619

In machine learning we have modules that perform operations on data in a sequencial manner. Modules are generally the following:

  1. Data Collection Module: Takes raw data from a resource (filesystem,database,...) and ensures the data in the pipeline. We can assume that the result of this module is an object of type "Dataset"

  2. Data Cleaning Module: Takes as input the Dataset object and checks errors in data; mainly missing values and outputs a new Dataset object with clean data

  3. Data preprocessing: Takes the new Dataset object makes math operations on the data it wraps (normalization, normalization, standardization ...) and output a new dataset object with a new format.

  4. A Training Module: This module is responsible of training machine learning models using different algorithms, it can be on algorithm or multiple algorithms in multiple stage to have different result to compare them and select the best peroforming model

  5. Last, a test model: That takes the trained and selected model and ensures it has low error rates by inputting a sample of test data

The pipeline, we want to develop as you see encompasses a set of object, each of which performs a set of operations on our input data and passes the output to the next object in the sequence.So our data is forced to force the path we specify by configuring our pipeline. With some variable objects that can change in any of the chains of our pipeline.

The pipeline can be represented as follows:

raw_data ==> data_collection > data_cleaning > data_preprocessing > model_training > model_testing ==> model

These modules have some common operations such that execute()execute() and validate()validate() data. Do you think thatImagine a single abstract class (let's say we call it "IMLOperation" classIMLOperation) that works as an interface holding these common operations from which all the operations in the pipeline isare derived (sub-classed as Preprocess, Collect .. objects) and then using using. Do you think that this approach together with the Iterator Design Pattern to ensure the order is suitable to develop this pipeline? Or Isis the strategy pattern along with a client that provides a stack of ordered operations a better solution for this pipeline architecture ?

In machine learning we have modules that perform operations on data in a sequencial manner. Modules are generally the following:

  1. Data Collection Module: Takes raw data from a resource (filesystem,database,...) and ensures the data in the pipeline. We can assume that the result of this module is an object of type "Dataset"

  2. Data Cleaning Module: Takes as input the Dataset object and checks errors in data; mainly missing values and outputs a new Dataset object with clean data

  3. Data preprocessing: Takes the new Dataset object makes math operations on the data it wraps (normalization, normalization, standardization ...) and output a new dataset object with a new format.

  4. A Training Module: This module is responsible of training machine learning models using different algorithms, it can be on algorithm or multiple algorithms in multiple stage to have different result to compare them and select the best peroforming model

  5. Last, a test model: That takes the trained and selected model and ensures it has low error rates by inputting a sample of test data

The pipeline, we want to develop as you see encompasses a set of object, each of which performs a set of operations on our input data and passes the output to the next object in the sequence.So our data is forced to force the path we specify by configuring our pipeline. With some variable objects that can change in any of the chains of our pipeline.

The pipeline can be represented as follows:

raw_data ==> data_collection > data_cleaning > data_preprocessing > model_training > model_testing ==> model

These modules have some common operations such that execute() and validate() data. Do you think that a single abstract (let's say we call it "IMLOperation" class that works as an interface holding these common operations from which all the operations in the pipeline is derived (sub-classed as Preprocess, Collect .. objects) and then using using the Iterator Design Pattern to ensure the order is suitable to develop this pipeline? Or Is the strategy pattern along with a client that provides a stack of ordered operations a better solution for this pipeline architecture ?

In machine learning we have modules that perform operations on data in a sequencial manner. Modules are generally the following:

  1. Data Collection Module: Takes raw data from a resource (filesystem,database,...) and ensures the data in the pipeline. We can assume that the result of this module is an object of type "Dataset"

  2. Data Cleaning Module: Takes as input the Dataset object and checks errors in data; mainly missing values and outputs a new Dataset object with clean data

  3. Data preprocessing: Takes the new Dataset object makes math operations on the data it wraps (normalization, normalization, standardization ...) and output a new dataset object with a new format.

  4. A Training Module: This module is responsible of training machine learning models using different algorithms, it can be on algorithm or multiple algorithms in multiple stage to have different result to compare them and select the best peroforming model

  5. Last, a test model: That takes the trained and selected model and ensures it has low error rates by inputting a sample of test data

The pipeline, we want to develop as you see encompasses a set of object, each of which performs a set of operations on our input data and passes the output to the next object in the sequence.So our data is forced to force the path we specify by configuring our pipeline. With some variable objects that can change in any of the chains of our pipeline.

The pipeline can be represented as follows:

raw_data ==> data_collection > data_cleaning > data_preprocessing > model_training > model_testing ==> model

These modules have some common operations such that execute() and validate() data. Imagine a single abstract class (let's say we call it IMLOperation) that works as an interface holding these common operations from which all the operations in the pipeline are derived (sub-classed as Preprocess, Collect .. objects). Do you think that this approach together with the Iterator Design Pattern to ensure the order is suitable to develop this pipeline? Or is the strategy pattern along with a client that provides a stack of ordered operations a better solution for this pipeline architecture ?

edited title
Link
Cloo
  • 133
  • 8

Is Python suitable for Machine Learning pipeline design patterns?

added 381 characters in body
Source Link
Cloo
  • 133
  • 8

In machine learning we have modules that perform operations on data in a sequencial manner. Modules are generally the following:

  1. Data Collection Module: Takes raw data from a resource (filesystem,database,...) and ensures the data in the pipeline. We can assume that the result of this module is an object of type "Dataset"

  2. Data Cleaning Module: Takes as input the Dataset object and checks errors in data; mainly missing values and outputs a new Dataset object with clean data

  3. Data preprocessing: Takes the new Dataset object makes math operations on the data it wraps (normalization, normalization, standardization ...) and output a new dataset object with a new format.

  4. A Training Module: This module is responsible of training machine learning models using different algorithms, it can be on algorithm or multiple algorithms in multiple stage to have different result to compare them and select the best peroforming model

  5. Last, a test model: That takes the trained and selected model and ensures it has low error rates by inputting a sample of test data

The pipeline, we want to develop as you see encompasses a set of object, each of which performs a set of operations on our input data and passes the output to the next object in the sequence.So our data is forced to force the path we specify by configuring our pipeline. With some variable objects that can change in any of the chains of our pipeline.

The pipeline can be represented as follows:

raw_data ==> data_collection > data_cleaning > data_preprocessing > model_training > model_testing ==> model

And is pythonThese modules have some common operations such that execute() and validate() data. Do you think that a single abstract (let's say we call it "IMLOperation" class that works as an interface holding these common operations from which all the operations in the pipeline is derived (sub-classed as Preprocess, Collect .. objects) and then using using the Iterator Design Pattern to ensure the order is suitable programming language to develop such patterns as it does not contain 'interfaces' and does not support full 100% oop principles suchthis pipeline? Or Is the strategy pattern along with a client that Encapsulation and Inheritanceprovides a stack of ordered operations a better solution for this pipeline architecture ?

In machine learning we have modules that perform operations on data in a sequencial manner. Modules are generally the following:

  1. Data Collection Module: Takes raw data from a resource (filesystem,database,...) and ensures the data in the pipeline. We can assume that the result of this module is an object of type "Dataset"

  2. Data Cleaning Module: Takes as input the Dataset object and checks errors in data; mainly missing values and outputs a new Dataset object with clean data

  3. Data preprocessing: Takes the new Dataset object makes math operations on the data it wraps (normalization, normalization, standardization ...) and output a new dataset object with a new format.

  4. A Training Module: This module is responsible of training machine learning models using different algorithms, it can be on algorithm or multiple algorithms in multiple stage to have different result to compare them and select the best peroforming model

  5. Last, a test model: That takes the trained and selected model and ensures it has low error rates by inputting a sample of test data

The pipeline, we want to develop as you see encompasses a set of object, each of which performs a set of operations on our input data and passes the output to the next object in the sequence.So our data is forced to force the path we specify by configuring our pipeline. With some variable objects that can change in any of the chains of our pipeline.

The pipeline can be represented as follows:

raw_data ==> data_collection > data_cleaning > data_preprocessing > model_training > model_testing ==> model

And is python a suitable programming language to develop such patterns as it does not contain 'interfaces' and does not support full 100% oop principles such that Encapsulation and Inheritance?

In machine learning we have modules that perform operations on data in a sequencial manner. Modules are generally the following:

  1. Data Collection Module: Takes raw data from a resource (filesystem,database,...) and ensures the data in the pipeline. We can assume that the result of this module is an object of type "Dataset"

  2. Data Cleaning Module: Takes as input the Dataset object and checks errors in data; mainly missing values and outputs a new Dataset object with clean data

  3. Data preprocessing: Takes the new Dataset object makes math operations on the data it wraps (normalization, normalization, standardization ...) and output a new dataset object with a new format.

  4. A Training Module: This module is responsible of training machine learning models using different algorithms, it can be on algorithm or multiple algorithms in multiple stage to have different result to compare them and select the best peroforming model

  5. Last, a test model: That takes the trained and selected model and ensures it has low error rates by inputting a sample of test data

The pipeline, we want to develop as you see encompasses a set of object, each of which performs a set of operations on our input data and passes the output to the next object in the sequence.So our data is forced to force the path we specify by configuring our pipeline. With some variable objects that can change in any of the chains of our pipeline.

The pipeline can be represented as follows:

raw_data ==> data_collection > data_cleaning > data_preprocessing > model_training > model_testing ==> model

These modules have some common operations such that execute() and validate() data. Do you think that a single abstract (let's say we call it "IMLOperation" class that works as an interface holding these common operations from which all the operations in the pipeline is derived (sub-classed as Preprocess, Collect .. objects) and then using using the Iterator Design Pattern to ensure the order is suitable to develop this pipeline? Or Is the strategy pattern along with a client that provides a stack of ordered operations a better solution for this pipeline architecture ?

Refocusing the question to something more specific.
Source Link
maple_shaft
  • 26.6k
  • 12
  • 60
  • 136
Loading
Post Reopened by maple_shaft
Post Undeleted by maple_shaft
Post Deleted by Jörg W Mittag, gnat, Doc Brown
added 164 characters in body
Source Link
Cloo
  • 133
  • 8
Loading
Post Undeleted by Community Bot
Post Deleted by Community Bot
edited tags; edited title
Link
Cloo
  • 133
  • 8
Loading
Post Closed as "Needs more focus" by gnat, Doc Brown, Philip Kendall, Christophe, joshp
added 1535 characters in body
Source Link
Cloo
  • 133
  • 8
Loading
Source Link
Cloo
  • 133
  • 8
Loading

AltStyle によって変換されたページ (->オリジナル) /