Creating an interface that connects to different data sources

Question 1

I am working on a project which connects to different data sources and fetches data. The problem is each of these data source needs different parameters to fetch the data

s3 = S3(ACCESS_KEY, SECRET_KEY, BUCKET, HOST)
db = DB(HOST, USERNAME, PASSWORD, DB_NAME, SCHEMA)
sftp = SFTP(HOST, USERNAME, PASSWORD)

The fetch data function also a different signature

s3.fetch_data(folder_path, filename)
db.fetch_data(table_name, filter_args)
sftp.fetch_data(file_path)

How to design a common interface that can stream data from and to any of the above data sources(defined dynamically via a config). Is there a design pattern that addresses this problem.

I have looked into strategy pattern but I assume that it applies to cases where the behavior changes but the is-a relationship prevails.

Incase of repository pattern there needs to be a common object on multiple storage

Both cases doesn't apply here

Question 2

I would look into Uniform Resource Locator for a high level interface. The query string part allows for passing custom parameters depending on data source.

Question 3

Do you call fetch_data once at the start (for each source), and then work with the data, or does your application invoke fetch_data over and over with different parameters? E.g. db.fetch_data(table1, args1), then later on db.fetch_data(table2, args2), etc?

Question 4

Fetch data gets invoked at different places. The object initialisation depends on a json config passed to the program. In case of S3, get_files_list and then call read_data on each, for db get_tables and then read_data from each table

Question 5

Connection Strings, Paths, and URIs.

A Connection string is a string containing all the information needed to connect to a service. A good example of this are ODBC connection strings. They identify the specific kind of service provider which further processes the string to connect to its service.

A Path is a string which directs a given service to a particular resource of interest. The most prolific example is a simple File Path string. It has separators and depending on file-service special wild-card characters for selecting sets of resources. XPath is another good example.

A URI is the synergy of these two concepts into a single string. There is even a standard pattern for doing these: protocol://user:pass@server/path. But any string which accomplishes picking a service, contains the service location/configuration, and directs the service to a resource/s will do the job.

Service Locators

A service Locator is the bootstrapper for this whole scheme. It is a central location to which each protocol/service handler registers itself along with two key pieces of information:

How to identify a URI it can handle. Be that a pattern, string prefix, or callable function.
A function to handle connecting to the service.

eg: web browsers have a service locator which looks at the string before :// and check that against a list of implementers for: http, https, ftp, ...

Service Providers

The service provider is responsible for the next part of the processing. It will be passed all or some of the URI (perhaps missing the protocol selector) and will then be expected to form the connection to the service.

How the connection is formed and whether it directly connects to the resource path, or connects to the root and then traverses through the resource path is up to the provider. The traversal may even be figurative in the sense that it returns an object representing the resource at the end of the path without first confirming that it exists. Which is useful for operations such as creating that resource.

Once connected, the service provider is responsible for presenting that resource using one or more knowledge abstractions. And providing implementations for operations on that abstraction.

There is another way that a service provider can handle a URI. That is by not handling the URI at all, I'll cover this at the bottom.

Generic Knowledge Representation

All knowledge is representable by a graph.

Consider a File-system.

Directories/Folders contain entries that are either Directories, or are Files (at the simple end, more complex file-systems might include pipes, semaphores, devices, etc...). Also there is no rule that an entry only exists in one directory, its even reasonable for a directory to hold itself.

Files contain a blob of unstructured data of supposed meaning to someone, just not to the file-system/general file-system watchers. Some file system treat Files as a different kind of collection contain meta-data and forks (forks being named blobs of unstructured data).

The take away is that the file system has leaves like the meta-data values, and blobs of unstructured data. On top of this it has branches which are collections of one style or another, usually maps but perhaps even just sets, or lists. One of those collections is the root collection (if the graph is a tree) or has been blessed as the root because from it all other collections/files can be reached (there may be several candidates).

The same thinking can be applied to other knowledge representations like sql databases, no-sql databases, json documents, etc...

Databases for example are like this: Server > Database > Schema > Tables > Records > Map

Knowledge Operations

Representing knowledge is good, but knowledge must often be mutated. Operations such as:

authentication and authorisation (revealing more of the graph),
changing root (revealing a different graph),
altering the graph structure (like moving a directory)
altering the blob contents (like changing a file's contents)
altering operations (like changing who is authorised to read/write xyz, or adding a script to be run before/after another operation)

Not to mention the most obvious operations:

access parent (not sensible in every knowledge system but useful where it make sense)
list children (who is contained)
read (.. and get their value)

Fortunately these operations are generic, and can be broadly expressed even though this or that provider may or may not support them.

You can express these in a number of ways depending on language, but it boils down to having a Entry or Node interface which supports a series of operations. That's it.

If the Entry is a directory it will respond to the list children operation with some children (unless its empty). But a file Entry would be empty, or return its meta-data, or its forks. A table Entry might return its records.

Traversal and Pathing

Which leads us full circle back to the start. How to describe the traversal from a root, be it well known (fully qualified) or from some contextual/passed in root (relative). The answer: Paths.

More specifically a Path Object - which is a series of Operations. Now most pathing systems only support traversal and selection operations. But oddly enough they don't have to be just traversal or selection operations. Just operations - which means the Path can describe creating a new something at a location, or deleting a selection of things.

Even better these Path objects aren't at the level of your service providers (which might have their own ideas about pathing) thus you only have to care about your specific abstraction and its supported operations. When you follow a Path the series of operations are literally called one after the other on the result of the previous operation, starting from a given root.

If the operation doesn't exist the path is invalid (or not real)
If the result was empty then the path points to nothing
otherwise there is a something remaining at the end which could be a range of things, or a single specific thing.

Along the way resources may have been created, deleted, skipped, selected, traversed, or ignored.

The simplest path is just a chain of select this entry operations, and also represents over +80% of the uses for a path. Just highlighting that paths are limited Domain Specific Languages, not unlike SQL for databases.

Obviously the more powerful operations available to the pathing DSL the more work is needed to sanitise paths specified by users, but conversely the easier it is to express certain kinds of actions.

This does leave the thorny issue of how you serialise the path, but I think that is a different sort of question.

Service Providers Mark 2

Assuming you have built out all of this high level abstraction machinery (you could have stopped at each heading and said nope don't need more) we can circle back to the service providers.

Truth is we don't have to pass a URI to them, or any paths, or even a connection string. Instead our pathing machinery can digest the uri into a series of operations that not only include traversal of the knowledge system itself but also traversal of the connection to that knowledge system.

In short the first operation operates on the service Locator (or a service locator that was passed in) to traverse to the entry linked to that protocol. In fact this might be a multi part process now eg: git+ssh:// first selects git, then instructs it to connect via ssh.

Each successive operation provides configuration information of one sort or another until the path calls an operation that needs the service provider to contact the backend service. At which point the connection is made, and accessing starts to happen. Which operation that is is up to the service provider.

Thus the service provider needs only to expose entries contain only the semantically reasonable operations at that point in time from its perspective.

Any client can list those operations and apply something (like a human) to reason its way through. Or it can have a pre-constructed path that it hands over and hopes the provider will agree that it is indeed valid.

Even more elegantly any entry object can be passed in with a path, so instead of the first operation being applied to the service locator, its instead applied to the passed in entry. Thus we can entertain knowledge systems that aren't globally registered, or we can move from one location to another without reconstructing a fully qualified traversal path.

Question 6

Sounds like a good start of a post which could become actually an answer to the question.

Question 7

Hey kain thanks for answering the questions. I could see how the URI based method can solve the object initialisation but how to extend it to read and write methods? Even if it's URI/query construction, we are moving the problem to a different place isn't it?

Question 8

Under the hood, you are only moving bytes from source A to source B. Well, arrays of bytes probably. That's your bare minimum interface. However, It's likely the programming language you use already provides you with abstractions for the matter (or functions) so that you only have to build your own abstraction on top of them or use the ones provided out of the box by the language. At the very end, you will probably have several concrete components each of which addressed to solve different integrations (SFTP, HTTP, file, etc). Each of these classes knows how to read and write these bytes.

Question 9

Each of these concrete elements deals with the difference between the different data sources. That's encapsulation and it's what prevents you from implementing God components good at everything and nothing at all at the same time.

Question 10

@BhavaniRavi Hopefully this update explains the patterns of thought in this field a bit better. It is quite a deep topic touching on Language design, Semiotics, Topology, Ontology, and a fair few others. Not to mention the many many ways of achieving similar outcomes.

Question 11

The thing about an interface is that to the consumer of said interface, all implementations of that interface look alike. In essence, the consumer shouldn't be able to know (nor care) which implementation is being used.

The examples you use already violate this premise. The consumer of your three data sources treats the db source differently than the other two, because it supplies the db source with filters that it simply does not provide to the other two data sources.

Therefore, it's not possible to fit these three into a contract.

Secondly, you've been very quiet about the return type of these three fetch operations, which is another cause for concern when trying to fit an interface. I highly doubt that your s3, sftp and db sources are going to be natively returning the same type. So you already need some kind of conversion of this returned data into a reusable object.

What you're going to find here is that this often leads to making a DTO class that is specific to a db table (not just any table the consumer freely chooses), which in turn suggests that the consumer shouldn't be freely choosing the table name, but should rather be provided with a specific method that accesses a predetermined and hardcoded table (name), which in will return a specific DTO that matches the content of that database table.

This is incongruous with your file-based approach, unless you have an implicit expectation that certain files contain certain data that is also serializable to the same DTO class, but you didn't mention any of that.

Overall, your approach seems like it's missing several layers' worth of complexity and analysis. You present an example in which the consumer is expected to provide a table name and query filters, without somehow being aware if you're even accessing a database or not. That doesn't make sense.

You're going to need indirections and reusability patterns here, but this is more than I can reasonably write an answer for based on the information you have provided. It requires in-depth knowledge of the requirements, context, and what leads you to believe that you are (and need to be) able to handle these three data sources interchangeably.

Kain0_0 Kain0_0 16.6k19 silver badges40 bronze badges · Answer 1 · 2021-01-05 07:17:11Z