Return to Question

spelling and wording improved, very long sentence split in two

edited Feb 11, 2021 at 6:47

218.9k
35
405
619

In machine learning we have modules that perform operations on data in a sequencial manner. Modules are generally the following:

Data Collection Module: Takes raw data from a resource (filesystem,database,...) and ensures the data in the pipeline. We can assume that the result of this module is an object of type "Dataset"
Data Cleaning Module: Takes as input the Dataset object and checks errors in data; mainly missing values and outputs a new Dataset object with clean data
Data preprocessing: Takes the new Dataset object makes math operations on the data it wraps (normalization, normalization, standardization ...) and output a new dataset object with a new format.
A Training Module: This module is responsible of training machine learning models using different algorithms, it can be on algorithm or multiple algorithms in multiple stage to have different result to compare them and select the best peroforming model
Last, a test model: That takes the trained and selected model and ensures it has low error rates by inputting a sample of test data

The pipeline, we want to develop as you see encompasses a set of object, each of which performs a set of operations on our input data and passes the output to the next object in the sequence.So our data is forced to force the path we specify by configuring our pipeline. With some variable objects that can change in any of the chains of our pipeline.

The pipeline can be represented as follows:

raw_data ==> data_collection > data_cleaning > data_preprocessing > model_training > model_testing ==> model

These modules have some common operations such that execute()execute() and validate()validate() data. Do you think thatImagine a single abstract class (let's say we call it "IMLOperation" classIMLOperation) that works as an interface holding these common operations from which all the operations in the pipeline isare derived (sub-classed as Preprocess, Collect .. objects) and then using using. Do you think that this approach together with the Iterator Design Pattern to ensure the order is suitable to develop this pipeline? Or Isis the strategy pattern along with a client that provides a stack of ordered operations a better solution for this pipeline architecture ?

In machine learning we have modules that perform operations on data in a sequencial manner. Modules are generally the following:

Data Collection Module: Takes raw data from a resource (filesystem,database,...) and ensures the data in the pipeline. We can assume that the result of this module is an object of type "Dataset"
Data Cleaning Module: Takes as input the Dataset object and checks errors in data; mainly missing values and outputs a new Dataset object with clean data
Data preprocessing: Takes the new Dataset object makes math operations on the data it wraps (normalization, normalization, standardization ...) and output a new dataset object with a new format.
A Training Module: This module is responsible of training machine learning models using different algorithms, it can be on algorithm or multiple algorithms in multiple stage to have different result to compare them and select the best peroforming model
Last, a test model: That takes the trained and selected model and ensures it has low error rates by inputting a sample of test data

The pipeline can be represented as follows:

raw_data ==> data_collection > data_cleaning > data_preprocessing > model_training > model_testing ==> model

These modules have some common operations such that execute() and validate() data. Do you think that a single abstract (let's say we call it "IMLOperation" class that works as an interface holding these common operations from which all the operations in the pipeline is derived (sub-classed as Preprocess, Collect .. objects) and then using using the Iterator Design Pattern to ensure the order is suitable to develop this pipeline? Or Is the strategy pattern along with a client that provides a stack of ordered operations a better solution for this pipeline architecture ?

In machine learning we have modules that perform operations on data in a sequencial manner. Modules are generally the following:

Data Collection Module: Takes raw data from a resource (filesystem,database,...) and ensures the data in the pipeline. We can assume that the result of this module is an object of type "Dataset"
Data Cleaning Module: Takes as input the Dataset object and checks errors in data; mainly missing values and outputs a new Dataset object with clean data
Data preprocessing: Takes the new Dataset object makes math operations on the data it wraps (normalization, normalization, standardization ...) and output a new dataset object with a new format.
A Training Module: This module is responsible of training machine learning models using different algorithms, it can be on algorithm or multiple algorithms in multiple stage to have different result to compare them and select the best peroforming model
Last, a test model: That takes the trained and selected model and ensures it has low error rates by inputting a sample of test data

The pipeline can be represented as follows:

raw_data ==> data_collection > data_cleaning > data_preprocessing > model_training > model_testing ==> model

These modules have some common operations such that execute() and validate() data. Imagine a single abstract class (let's say we call it IMLOperation) that works as an interface holding these common operations from which all the operations in the pipeline are derived (sub-classed as Preprocess, Collect .. objects). Do you think that this approach together with the Iterator Design Pattern to ensure the order is suitable to develop this pipeline? Or is the strategy pattern along with a client that provides a stack of ordered operations a better solution for this pipeline architecture ?

edited title

Link

edited Feb 11, 2021 at 5:51

Cloo

edited Feb 11, 2021 at 5:51

Cloo

Is Python suitable for Machine Learning pipeline design patterns?

added 381 characters in body

Source Link

edited Feb 11, 2021 at 5:34

Cloo

edited Feb 11, 2021 at 5:34

Cloo

In machine learning we have modules that perform operations on data in a sequencial manner. Modules are generally the following:

Data Collection Module: Takes raw data from a resource (filesystem,database,...) and ensures the data in the pipeline. We can assume that the result of this module is an object of type "Dataset"
Data Cleaning Module: Takes as input the Dataset object and checks errors in data; mainly missing values and outputs a new Dataset object with clean data
Data preprocessing: Takes the new Dataset object makes math operations on the data it wraps (normalization, normalization, standardization ...) and output a new dataset object with a new format.
A Training Module: This module is responsible of training machine learning models using different algorithms, it can be on algorithm or multiple algorithms in multiple stage to have different result to compare them and select the best peroforming model
Last, a test model: That takes the trained and selected model and ensures it has low error rates by inputting a sample of test data

The pipeline can be represented as follows:

raw_data ==> data_collection > data_cleaning > data_preprocessing > model_training > model_testing ==> model

And is pythonThese modules have some common operations such that execute() and validate() data. Do you think that a single abstract (let's say we call it "IMLOperation" class that works as an interface holding these common operations from which all the operations in the pipeline is derived (sub-classed as Preprocess, Collect .. objects) and then using using the Iterator Design Pattern to ensure the order is suitable programming language to develop such patterns as it does not contain 'interfaces' and does not support full 100% oop principles suchthis pipeline? Or Is the strategy pattern along with a client that Encapsulation and Inheritanceprovides a stack of ordered operations a better solution for this pipeline architecture ?

In machine learning we have modules that perform operations on data in a sequencial manner. Modules are generally the following:

Data Collection Module: Takes raw data from a resource (filesystem,database,...) and ensures the data in the pipeline. We can assume that the result of this module is an object of type "Dataset"
Data Cleaning Module: Takes as input the Dataset object and checks errors in data; mainly missing values and outputs a new Dataset object with clean data
Data preprocessing: Takes the new Dataset object makes math operations on the data it wraps (normalization, normalization, standardization ...) and output a new dataset object with a new format.
A Training Module: This module is responsible of training machine learning models using different algorithms, it can be on algorithm or multiple algorithms in multiple stage to have different result to compare them and select the best peroforming model
Last, a test model: That takes the trained and selected model and ensures it has low error rates by inputting a sample of test data

The pipeline can be represented as follows:

raw_data ==> data_collection > data_cleaning > data_preprocessing > model_training > model_testing ==> model

And is python a suitable programming language to develop such patterns as it does not contain 'interfaces' and does not support full 100% oop principles such that Encapsulation and Inheritance?

In machine learning we have modules that perform operations on data in a sequencial manner. Modules are generally the following:

Data Collection Module: Takes raw data from a resource (filesystem,database,...) and ensures the data in the pipeline. We can assume that the result of this module is an object of type "Dataset"
Data Cleaning Module: Takes as input the Dataset object and checks errors in data; mainly missing values and outputs a new Dataset object with clean data
Data preprocessing: Takes the new Dataset object makes math operations on the data it wraps (normalization, normalization, standardization ...) and output a new dataset object with a new format.
A Training Module: This module is responsible of training machine learning models using different algorithms, it can be on algorithm or multiple algorithms in multiple stage to have different result to compare them and select the best peroforming model
Last, a test model: That takes the trained and selected model and ensures it has low error rates by inputting a sample of test data

The pipeline can be represented as follows:

raw_data ==> data_collection > data_cleaning > data_preprocessing > model_training > model_testing ==> model

Refocusing the question to something more specific.

Source Link

edited Feb 10, 2021 at 14:29

maple_shaft ♦

edited Feb 10, 2021 at 14:29

maple_shaft ♦

26.6k
12
60
136

Post Reopened by maple_shaft ♦

occurred Feb 10, 2021 at 14:27

Post Undeleted by maple_shaft ♦

occurred Feb 10, 2021 at 14:26

Post Deleted by Jörg W Mittag, gnat, Doc Brown

occurred Feb 10, 2021 at 12:40

added 164 characters in body

Source Link

edited Feb 10, 2021 at 9:26

Cloo

edited Feb 10, 2021 at 9:26

Cloo

Post Undeleted by Community Bot

occurred Feb 10, 2021 at 9:23

Post Deleted by Community Bot

occurred Feb 10, 2021 at 8:51

edited tags; edited title

Link

edited Feb 10, 2021 at 8:50

Cloo

edited Feb 10, 2021 at 8:50

Cloo

Post Closed as "Needs more focus" by gnat, Doc Brown, Philip Kendall, Christophe, joshp

occurred Feb 10, 2021 at 8:02

added 1535 characters in body

Source Link

edited Feb 10, 2021 at 7:38

Cloo

edited Feb 10, 2021 at 7:38

Cloo

Source Link

asked Feb 10, 2021 at 4:33

Cloo

asked Feb 10, 2021 at 4:33

Cloo