2
\$\begingroup\$

I have completed an ETL project to collect, parse and load files. I decided to make it clean OOP way using interfaces and abstract, but have some questions below.

Sub Main()
 Dim collectionOfParsers As New List(Of EtlParser)
 Dim xmlparser1 As New XmlParser
 Dim xmlparser2 As New XmlParser
 Dim xmlparser3 As New XmlParser
 Dim txtparser1 As New TxtParser
 Dim txtparser2 As New TxtParser
 collectionOfParsers.Add(xmlparser1)
 collectionOfParsers.Add(xmlparser2)
 collectionOfParsers.Add(xmlparser3)
 collectionOfParsers.Add(txtparser1)
 collectionOfParsers.Add(txtparser2)
 For Each parser As EtlParser In collectionOfParsers
 parser.SaySomething()
 Dim canOpenFiles = TryCast(parser, ICanOpenFiles)
 If (canOpenFiles IsNot Nothing) Then
 canOpenFiles.OpenFiles()
 End If
 Dim canReadFiles = TryCast(parser, ICanReadFiles)
 If (canReadFiles IsNot Nothing) Then
 canReadFiles.Readfiles()
 End If
 Dim canTransFiles = TryCast(parser, ICanTransformFiles)
 If (canTransFiles IsNot Nothing) Then
 canTransFiles.TransformFile()
 End If
 Dim canSaveFiles = TryCast(parser, ICanSaveFiles)
 If (canSaveFiles IsNot Nothing) Then
 canSaveFiles.Savefiles()
 End If
 Next
 End Sub
 Public MustInherit Class Etl
 End Class
 Public MustInherit Class EtlParser : Inherits Etl
 Protected Sub CanParse()
 Console.WriteLine("Yes")
 End Sub
 Protected Overridable Sub SaySomething()
 Console.WriteLine("EtlParser say something")
 End Sub
 Protected MustOverride Sub CanParseFormat()
 End Class
 Public Interface ICanOpenFiles
 Sub OpenFiles()
 End Interface
 Public Interface ICanReadFiles
 Sub Readfiles()
 End Interface
 Public Interface ICanSaveFiles
 Sub Savefiles()
 End Interface
 Public Interface ICanTransformFiles
 Sub TransformFile()
 End Interface
 Public Class XmlParser : Inherits EtlParser
 Implements ICanOpenFiles, ICanReadFiles, ICanTransformFiles, ICanSaveFiles
 Public Sub OpenFiles() Implements ICanOpenFiles.OpenFiles
 Throw New NotImplementedException()
 End Sub
 Public Sub Readfiles() Implements ICanReadFiles.Readfiles
 Throw New NotImplementedException()
 End Sub
 Public Sub TransformFile() Implements ICanTransformFiles.TransformFile
 Throw New NotImplementedException()
 End Sub
 Public Sub Savefiles() Implements ICanSaveFiles.Savefiles
 Throw New NotImplementedException()
 End Sub
 Protected Overrides Sub CanParseFormat()
 Throw New NotImplementedException()
 End Sub
 Protected Overrides Sub SaySomething()
 'MyBase.SaySomething()
 Console.WriteLine("XmlParser say something")
 End Sub
 End Class
 Public Class CsvParser : Inherits EtlParser
 Implements ICanOpenFiles, ICanReadFiles, ICanTransformFiles, ICanSaveFiles
 Public Sub OpenFiles() Implements ICanOpenFiles.OpenFiles
 Throw New NotImplementedException()
 End Sub
 Public Sub Readfiles() Implements ICanReadFiles.Readfiles
 Throw New NotImplementedException()
 End Sub
 Public Sub TransformFile() Implements ICanTransformFiles.TransformFile
 Throw New NotImplementedException()
 End Sub
 Public Sub Savefiles() Implements ICanSaveFiles.Savefiles
 Throw New NotImplementedException()
 End Sub
 Protected Overrides Sub CanParseFormat()
 Throw New NotImplementedException()
 End Sub
 Protected Overrides Sub SaySomething()
 'MyBase.SaySomething()
 Console.WriteLine("CsvParser say something")
 End Sub
 End Class

Q1: Once i collect the files from network drive (this will be done by Collector later on). What is your opinion should i make xmlparser class to handle many files or just one? If the second option then as you can see i created already many xmlparser instances (1 instance per each file), however i am not sure here maybe should i have xmlparser prepared for all files and then call it just once?

Q2: Regarding the for each loop i parametrized common type as EtlParser to pass diffrent specific parsers (is it ok by the way?). Can you explain me how it's possible specific parser within the loop is seen as passed object type - for instance i passed XmlParser and within i see it as well - i thought that when passing specific parser e.g XmlParser through parameter (his parent - EtlParser) it becomes EtlParser and i have to cast it again to XmlParser again inside loop. Would like to understand that.

Q3: As long as i know definition of interfaces e.g "Need to provide common functionality to unrelated classes" what in my example code is real benefit as all of my specific parsers uses the same interfaces at the end? All can open, read, transform and save...

Q4: As you see i have 3 specific parser classes: CsvParser, XmlParser, TxtParser inheriting from their base EtlParser class. Wouldn't it be better to make one parser class and instead make interface IXml, ITxt, ICsv which will be implemented? At this moment i think what i have is proper.

Q5: Why in the Main method i cannot do: parser.SaySomething() However when i look at parser item it shows exactly correct type.

Q6: Any ideas, advices to my current code besides?

200_success
145k22 gold badges190 silver badges478 bronze badges
asked Sep 16, 2017 at 15:35
\$\endgroup\$

1 Answer 1

3
\$\begingroup\$

Q1: It takes nanoseconds to create an object and milliseconds to access a file; i.e. roughly one million times longer! Don't try to optimize things that will have absolutely no noticeable effect at the expense of clarity!

Q2: Since XmlParser has no methods specific to XmlParser (i.e. existing only in XmlParser), there is no advantage in casting the object to it. But since the base class EtlParser does not implement the interfaces, you must cast the object to these interfaces (what you are doing).

Q3, Q4, Q6: This is one possible approach. I will suggest you another one.

Q5: SaySomething() is Protected, which means that it is only visible within the class defining it and its descendants. Make it Public.


Critics: Your interface makes operations like opening files public. The caller then must know whether this operations is available and call it. But this is a technical implementation detail which should be kept private. A public interface should concentrate on the desired high level logic. I.E. read data, transform data and maybe write data.

Suggestion: I would choose a more flexible approach allowing you to compose parsers from single components (like Lego bricks). Define this set of interfaces:

Public Interface IDataSource(Of T)
 Function Read() As IEnumerable(Of T)
End Interface
Public Interface ITransformer(Of TSource, TResult)
 Function Transform(ByVal source As IEnumerable(Of TSource)) As IEnumerable(Of TResult)
End Interface
Public Interface IDataSink(Of T)
 Sub Write(ByVal data As IEnumerable(Of T))
End Interface

The idea is to implement these interfaces by different classes. You would have one class for an XML-data-source, one for a file-data-source, one for a transformation, etc.

A data source can be a text-file an XML-file a database or be a dummy data source for test purposes. It is the data source’s responsibility to open, read and close files etc. You don't need separate interfaces for all these operations.

Note that file names and connection strings can be passed as constructor parameters and don't need to be specified in the interfaces.

Define classes serving as transport vehicle for single data records like RawData, PreProcessedData, RefinedData used as generic type arguments for the interfaces. You will probably choose names for these classes that are better suited for your specific problem.

You can even chain several transformations like this:

read >>(RawData)>> transform 1 >>(PreProcessedData)>> transform 2 >>(RefinedData)>> write

One advantage of this approach is that you can apply the same transformations to different types of data sources (having the same TSource) and store the result into different types of destinations (having the same TResult).

Note: Iterators (Visual Basic) will help you to implement these interfaces.


Let's make a very simple example. We have a CSV-File with a name column and two number columns. We want to transform this file into another one containing the name column and one number column containing the sum of the two numbers.

Input file:

Joe,3,4
Mike,6,2
Sue,10,3

Expected output file:

Joe,7
Mike,8
Sue,13

We need two data classes

Public Class InputData
 Public Property Name As String
 Public Property X As Integer
 Public Property Y As Integer
End Class
Public Class OutputData
 Public Property Name As String
 Public Property Sum As Integer
End Class

A reader

Public Class ExampleCsvReader
 Implements IDataSource(Of InputData)
 Private m_filename As String
 Public Sub New(ByVal filename As String)
 m_filename = filename
 End Sub
 Public Iterator Function Read() As IEnumerable(Of InputData) _
 Implements IDataSource(Of InputData).Read
 For Each line As String In File.ReadLines(m_filename)
 Dim parts = line.Split(","c)
 If parts.Length = 3 Then
 Yield New InputData With {.Name = parts(0), _
 .X = CInt(parts(1)), .Y = CInt(parts(2))}
 End If
 Next
 End Function
End Class

A transformer

Public Class ExampleTransformer
 Implements ITransformer(Of InputData, OutputData)
 Public Iterator Function Transform(source As IEnumerable(Of InputData)) _
 As IEnumerable(Of OutputData) _
 Implements ITransformer(Of InputData, OutputData).Transform
 For Each record As InputData In source
 Yield New OutputData With {.Name = record.Name, .Sum = record.X + record.Y}
 Next
 End Function
End Class

A writer

Public Class ExampleCsvWriter
 Implements IDataSink(Of OutputData)
 Private m_filename As String
 Public Sub New(ByVal filename As String)
 m_filename = filename
 End Sub
 Public Sub Write(data As IEnumerable(Of OutputData)) _
 Implements IDataSink(Of OutputData).Write
 Using sw As StreamWriter = File.CreateText(m_filename)
 For Each record As OutputData In data
 sw.WriteLine($"{record.Name},{record.Sum}")
 Next
 End Using
 End Sub
End Class

And finally we can stitch the parts together

Dim reader = New ExampleCsvReader(inputFile)
Dim transformer = New ExampleTransformer()
Dim writer = New ExampleCsvWriter(outputFile)
Dim inputData = reader.Read()
Dim outputData = transformer.Transform(inputData)
writer.Write(outputData)

Generic solution: This approach also lets you also realize a more generic solution. You are free to create generic readers that for instance return data in a dictionary. The data type could be a Dictionary(Of String, Object) for instance, storing property name/value pairs. A reader could implement a IDataSource(Of Dictionary(Of String, Object)), for instance.

VB specific: The Yield statement is like a Return statement that returns a value, but unlike the latter, it does not exit the function and continues its execution to return the next value of the enumeration, and so on, until the end of the function is reached.

Besides iterators I also used Object Initializers, String Interpolation (Point 12.), Using Statement.

answered Sep 16, 2017 at 17:00
\$\endgroup\$
9
  • \$\begingroup\$ First of all thank you very much Olivier for taking your time to help me out. To be honest wit you i read your post x times and can't get full picture of propsoed solution which somehow seems to be very good. What is not clear to me is do you propose to have one class inherits from EtlParser for instnace MainParser for all diffrent sources and implement all your proposed interfaces? Is it what you mean as for now i have XmlParser/CsvParser/TxtParser do you mean to make one for all and implement your interfaces? \$\endgroup\$ Commented Sep 16, 2017 at 20:38
  • \$\begingroup\$ Would it be piossible to extend your answer which could show the whole change to my solution that i could understand fully? I also do not get the iterator approach which could help me out in the solution also would appreciate to get it on example. The best would be also to see where (you mentioned constructors) or how specific file path should be passed and more intresting where specific parser functionality like this for parsing csv's, txt's and database should be placed within solution. If you could put it in one peace i would really appreciate. Thank so much dude ! \$\endgroup\$ Commented Sep 16, 2017 at 20:41
  • \$\begingroup\$ Looks nice however you said before that "You would have one class for an XML-data-source, one for a file-data-source.." However from what i see if i will have diffrent csv file than what you shown in example i'd have to implement new class e.g ExampleCsvReaderX so would mean each diffrent file structure is equal to create new ExampleCsvReaderX class and both InputDataX and OutputDataX classes for it. Am i right with this? Also if diffrent transformer new ExampleTransformerX has to be created. If i am not correct can you show based on your example how new diffrent csv file would be implemented? \$\endgroup\$ Commented Sep 18, 2017 at 9:58
  • \$\begingroup\$ P.S I can make new topic if you like. Let me know. I already marked this as an asnwer. Thanks so much Olivier your help is invaluable to me. \$\endgroup\$ Commented Sep 18, 2017 at 9:59
  • \$\begingroup\$ Well, you are free to create generic readers that for instance return data in a dictionary. The data type could be a Dictionary(Of String, Object) for instance, storing property name/value pairs. \$\endgroup\$ Commented Sep 18, 2017 at 10:10

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.