I have completed an ETL project to collect, parse and load files. I decided to make it clean OOP way using interfaces and abstract, but have some questions below.
Sub Main()
Dim collectionOfParsers As New List(Of EtlParser)
Dim xmlparser1 As New XmlParser
Dim xmlparser2 As New XmlParser
Dim xmlparser3 As New XmlParser
Dim txtparser1 As New TxtParser
Dim txtparser2 As New TxtParser
collectionOfParsers.Add(xmlparser1)
collectionOfParsers.Add(xmlparser2)
collectionOfParsers.Add(xmlparser3)
collectionOfParsers.Add(txtparser1)
collectionOfParsers.Add(txtparser2)
For Each parser As EtlParser In collectionOfParsers
parser.SaySomething()
Dim canOpenFiles = TryCast(parser, ICanOpenFiles)
If (canOpenFiles IsNot Nothing) Then
canOpenFiles.OpenFiles()
End If
Dim canReadFiles = TryCast(parser, ICanReadFiles)
If (canReadFiles IsNot Nothing) Then
canReadFiles.Readfiles()
End If
Dim canTransFiles = TryCast(parser, ICanTransformFiles)
If (canTransFiles IsNot Nothing) Then
canTransFiles.TransformFile()
End If
Dim canSaveFiles = TryCast(parser, ICanSaveFiles)
If (canSaveFiles IsNot Nothing) Then
canSaveFiles.Savefiles()
End If
Next
End Sub
Public MustInherit Class Etl
End Class
Public MustInherit Class EtlParser : Inherits Etl
Protected Sub CanParse()
Console.WriteLine("Yes")
End Sub
Protected Overridable Sub SaySomething()
Console.WriteLine("EtlParser say something")
End Sub
Protected MustOverride Sub CanParseFormat()
End Class
Public Interface ICanOpenFiles
Sub OpenFiles()
End Interface
Public Interface ICanReadFiles
Sub Readfiles()
End Interface
Public Interface ICanSaveFiles
Sub Savefiles()
End Interface
Public Interface ICanTransformFiles
Sub TransformFile()
End Interface
Public Class XmlParser : Inherits EtlParser
Implements ICanOpenFiles, ICanReadFiles, ICanTransformFiles, ICanSaveFiles
Public Sub OpenFiles() Implements ICanOpenFiles.OpenFiles
Throw New NotImplementedException()
End Sub
Public Sub Readfiles() Implements ICanReadFiles.Readfiles
Throw New NotImplementedException()
End Sub
Public Sub TransformFile() Implements ICanTransformFiles.TransformFile
Throw New NotImplementedException()
End Sub
Public Sub Savefiles() Implements ICanSaveFiles.Savefiles
Throw New NotImplementedException()
End Sub
Protected Overrides Sub CanParseFormat()
Throw New NotImplementedException()
End Sub
Protected Overrides Sub SaySomething()
'MyBase.SaySomething()
Console.WriteLine("XmlParser say something")
End Sub
End Class
Public Class CsvParser : Inherits EtlParser
Implements ICanOpenFiles, ICanReadFiles, ICanTransformFiles, ICanSaveFiles
Public Sub OpenFiles() Implements ICanOpenFiles.OpenFiles
Throw New NotImplementedException()
End Sub
Public Sub Readfiles() Implements ICanReadFiles.Readfiles
Throw New NotImplementedException()
End Sub
Public Sub TransformFile() Implements ICanTransformFiles.TransformFile
Throw New NotImplementedException()
End Sub
Public Sub Savefiles() Implements ICanSaveFiles.Savefiles
Throw New NotImplementedException()
End Sub
Protected Overrides Sub CanParseFormat()
Throw New NotImplementedException()
End Sub
Protected Overrides Sub SaySomething()
'MyBase.SaySomething()
Console.WriteLine("CsvParser say something")
End Sub
End Class
Q1: Once i collect the files from network drive (this will be done by Collector later on). What is your opinion should i make xmlparser class to handle many files or just one? If the second option then as you can see i created already many xmlparser instances (1 instance per each file), however i am not sure here maybe should i have xmlparser prepared for all files and then call it just once?
Q2: Regarding the for each loop i parametrized common type as EtlParser to pass diffrent specific parsers (is it ok by the way?). Can you explain me how it's possible specific parser within the loop is seen as passed object type - for instance i passed XmlParser and within i see it as well - i thought that when passing specific parser e.g XmlParser through parameter (his parent - EtlParser) it becomes EtlParser and i have to cast it again to XmlParser again inside loop. Would like to understand that.
Q3: As long as i know definition of interfaces e.g "Need to provide common functionality to unrelated classes" what in my example code is real benefit as all of my specific parsers uses the same interfaces at the end? All can open, read, transform and save...
Q4: As you see i have 3 specific parser classes: CsvParser, XmlParser, TxtParser inheriting from their base EtlParser class. Wouldn't it be better to make one parser class and instead make interface IXml, ITxt, ICsv which will be implemented? At this moment i think what i have is proper.
Q5: Why in the Main method i cannot do: parser.SaySomething() However when i look at parser item it shows exactly correct type.
Q6: Any ideas, advices to my current code besides?
1 Answer 1
Q1: It takes nanoseconds to create an object and milliseconds to access a file; i.e. roughly one million times longer! Don't try to optimize things that will have absolutely no noticeable effect at the expense of clarity!
Q2: Since XmlParser
has no methods specific to XmlParser
(i.e. existing only in XmlParser
), there is no advantage in casting the object to it. But since the base class EtlParser
does not implement the interfaces, you must cast the object to these interfaces (what you are doing).
Q3, Q4, Q6: This is one possible approach. I will suggest you another one.
Q5: SaySomething()
is Protected
, which means that it is only visible within the class defining it and its descendants. Make it Public
.
Critics: Your interface makes operations like opening files public. The caller then must know whether this operations is available and call it. But this is a technical implementation detail which should be kept private. A public interface should concentrate on the desired high level logic. I.E. read data, transform data and maybe write data.
Suggestion: I would choose a more flexible approach allowing you to compose parsers from single components (like Lego bricks). Define this set of interfaces:
Public Interface IDataSource(Of T)
Function Read() As IEnumerable(Of T)
End Interface
Public Interface ITransformer(Of TSource, TResult)
Function Transform(ByVal source As IEnumerable(Of TSource)) As IEnumerable(Of TResult)
End Interface
Public Interface IDataSink(Of T)
Sub Write(ByVal data As IEnumerable(Of T))
End Interface
The idea is to implement these interfaces by different classes. You would have one class for an XML-data-source, one for a file-data-source, one for a transformation, etc.
A data source can be a text-file an XML-file a database or be a dummy data source for test purposes. It is the data source’s responsibility to open, read and close files etc. You don't need separate interfaces for all these operations.
Note that file names and connection strings can be passed as constructor parameters and don't need to be specified in the interfaces.
Define classes serving as transport vehicle for single data records like RawData
, PreProcessedData
, RefinedData
used as generic type arguments for the interfaces. You will probably choose names for these classes that are better suited for your specific problem.
You can even chain several transformations like this:
read >>(RawData)>> transform 1 >>(PreProcessedData)>> transform 2 >>(RefinedData)>> write
One advantage of this approach is that you can apply the same transformations to different types of data sources (having the same TSource
) and store the result into different types of destinations (having the same TResult
).
Note: Iterators (Visual Basic) will help you to implement these interfaces.
Let's make a very simple example. We have a CSV-File with a name column and two number columns. We want to transform this file into another one containing the name column and one number column containing the sum of the two numbers.
Input file:
Joe,3,4
Mike,6,2
Sue,10,3
Expected output file:
Joe,7
Mike,8
Sue,13
We need two data classes
Public Class InputData
Public Property Name As String
Public Property X As Integer
Public Property Y As Integer
End Class
Public Class OutputData
Public Property Name As String
Public Property Sum As Integer
End Class
A reader
Public Class ExampleCsvReader
Implements IDataSource(Of InputData)
Private m_filename As String
Public Sub New(ByVal filename As String)
m_filename = filename
End Sub
Public Iterator Function Read() As IEnumerable(Of InputData) _
Implements IDataSource(Of InputData).Read
For Each line As String In File.ReadLines(m_filename)
Dim parts = line.Split(","c)
If parts.Length = 3 Then
Yield New InputData With {.Name = parts(0), _
.X = CInt(parts(1)), .Y = CInt(parts(2))}
End If
Next
End Function
End Class
A transformer
Public Class ExampleTransformer
Implements ITransformer(Of InputData, OutputData)
Public Iterator Function Transform(source As IEnumerable(Of InputData)) _
As IEnumerable(Of OutputData) _
Implements ITransformer(Of InputData, OutputData).Transform
For Each record As InputData In source
Yield New OutputData With {.Name = record.Name, .Sum = record.X + record.Y}
Next
End Function
End Class
A writer
Public Class ExampleCsvWriter
Implements IDataSink(Of OutputData)
Private m_filename As String
Public Sub New(ByVal filename As String)
m_filename = filename
End Sub
Public Sub Write(data As IEnumerable(Of OutputData)) _
Implements IDataSink(Of OutputData).Write
Using sw As StreamWriter = File.CreateText(m_filename)
For Each record As OutputData In data
sw.WriteLine($"{record.Name},{record.Sum}")
Next
End Using
End Sub
End Class
And finally we can stitch the parts together
Dim reader = New ExampleCsvReader(inputFile)
Dim transformer = New ExampleTransformer()
Dim writer = New ExampleCsvWriter(outputFile)
Dim inputData = reader.Read()
Dim outputData = transformer.Transform(inputData)
writer.Write(outputData)
Generic solution: This approach also lets you also realize a more generic solution. You are free to create generic readers that for instance return data in a dictionary. The data type could be a Dictionary(Of String, Object)
for instance, storing property name/value pairs. A reader could implement a IDataSource(Of Dictionary(Of String, Object))
, for instance.
VB specific: The Yield
statement is like a Return
statement that returns a value, but unlike the latter, it does not exit the function and continues its execution to return the next value of the enumeration, and so on, until the end of the function is reached.
Besides iterators I also used Object Initializers, String Interpolation (Point 12.), Using Statement.
-
\$\begingroup\$ First of all thank you very much Olivier for taking your time to help me out. To be honest wit you i read your post x times and can't get full picture of propsoed solution which somehow seems to be very good. What is not clear to me is do you propose to have one class inherits from EtlParser for instnace MainParser for all diffrent sources and implement all your proposed interfaces? Is it what you mean as for now i have XmlParser/CsvParser/TxtParser do you mean to make one for all and implement your interfaces? \$\endgroup\$Arie– Arie2017年09月16日 20:38:41 +00:00Commented Sep 16, 2017 at 20:38
-
\$\begingroup\$ Would it be piossible to extend your answer which could show the whole change to my solution that i could understand fully? I also do not get the iterator approach which could help me out in the solution also would appreciate to get it on example. The best would be also to see where (you mentioned constructors) or how specific file path should be passed and more intresting where specific parser functionality like this for parsing csv's, txt's and database should be placed within solution. If you could put it in one peace i would really appreciate. Thank so much dude ! \$\endgroup\$Arie– Arie2017年09月16日 20:41:43 +00:00Commented Sep 16, 2017 at 20:41
-
\$\begingroup\$ Looks nice however you said before that "You would have one class for an XML-data-source, one for a file-data-source.." However from what i see if i will have diffrent csv file than what you shown in example i'd have to implement new class e.g ExampleCsvReaderX so would mean each diffrent file structure is equal to create new ExampleCsvReaderX class and both InputDataX and OutputDataX classes for it. Am i right with this? Also if diffrent transformer new ExampleTransformerX has to be created. If i am not correct can you show based on your example how new diffrent csv file would be implemented? \$\endgroup\$Arie– Arie2017年09月18日 09:58:41 +00:00Commented Sep 18, 2017 at 9:58
-
\$\begingroup\$ P.S I can make new topic if you like. Let me know. I already marked this as an asnwer. Thanks so much Olivier your help is invaluable to me. \$\endgroup\$Arie– Arie2017年09月18日 09:59:28 +00:00Commented Sep 18, 2017 at 9:59
-
\$\begingroup\$ Well, you are free to create generic readers that for instance return data in a dictionary. The data type could be a
Dictionary(Of String, Object)
for instance, storing property name/value pairs. \$\endgroup\$Olivier Jacot-Descombes– Olivier Jacot-Descombes2017年09月18日 10:10:49 +00:00Commented Sep 18, 2017 at 10:10