A table backup application

Question 1

The code works fine, but I'm not sure about the way I've created different functions and called them in others. I usually separate bigquery related functions to a utility module, but since there were only a couple, I included them inside the class; I'm not sure I like them that way.

Another thing I don't like is how none of the functions take any arguments. Is that a bad practice, or is this type of structuring fine?

I also feel like I have declared too many property variables and made the class cluttered.

@logger.catch(reraise=True)
class AccrualTable:
 __credentials = service_account.Credentials.from_service_account_file(gbq_config["CREDENTIALS"])
 client = bigquery.Client(credentials=__credentials)
 def check_snapshot_eligibility(self):
 snapshot_date = ''
 eligible = False
 query_result = self.client.query(<sql>).to_dataframe()
 if not query_result.empty:
 eligible = True
 snapshot_date = query_result.iloc[0]['snapshot_date'].strftime('%Y_%m_%d')
 return eligible, snapshot_date
 def __init__(self, accrual_type):
 assert re.findall(accrual_type, r'(aa)(ab)', re.IGNORECASE), "Invalid accruals type. Should be 'aa' or 'ab'"
 self.accrual_type = accrual_type.upper()
 self.dataset_name = bigquery.DatasetReference(accruals_config["PROJECT_ID"], accruals_config["DATASET_ID"])
 self.table_name = self.dataset_name.table(accruals_config['TABLES'][self.accrual_type])
 self.full_table_name = '.'.join([self.table_name.project,
 self.table_name.dataset_id,
 self.table_name.table_id])
 self.snapshot_eligible, self.snapshot_date = self.check_snapshot_eligibility()
 self.snapshot_name = accruals_config['SNAPSHOT_FILENAME'].format(accruals_type=self.accrual_type,
 timestamp=self.snapshot_date)
 self.destination_uri = os.path.join(accruals_config["SNAPSHOT_BUCKET"],
 self.accrual_type,
 self.snapshot_name,
 '*')
 def extract_to_bucket(self):
 extract_job = self.client.extract_table(
 self.table_name,
 self.destination_uri
 ) # API request
 extract_job.result() # Waits for job to complete.
 return 'Success'
 def build_snapshot_ddl(self) -> str:
 accruals_metadata = self.client.query(<query to extract ddl>).to_dataframe()
 # convert to external table with the snapshot name
 snapshot_ddl_partial = re.sub(f'CREATE TABLE `{self.full_table_name}`',
 f'CREATE EXTERNAL TABLE `{accruals_config["PROJECT_ID"]}.{accruals_config["DATASET_ID"]}.{self.snapshot_name}`',
 accruals_metadata.iloc[0]['ddl'])
 # remove semicolon before options
 snapshot_ddl_complete = re.sub(';', '', snapshot_ddl_partial) + f'''
 OPTIONS (
 format = 'CSV',
 uris = ['{self.destination_uri}']
 );
 '''
 return snapshot_ddl_complete
 def store_metadata(self):
 logger.info('storing snapshot metadata')
 metadata_record = pd.DataFrame({'backup_timestamp': [datetime.now()],
 'accrual_type': [self.accrual_type],
 'backup_location': [self.destination_uri],
 'table_schema': [self.build_snapshot_ddl()]})
 dest_table = accruals_config['ACCRUALS_SNAPSHOT_METADATA']
 logger.info(f'inserting contents of records dataframe to {dest_table}')
 job = self.client.load_table_from_dataframe(dataframe=metadata_record,
 destination=dest_table)
 logger.debug(job.result())
 logger.info('done')
 return

I have a wrapper that calls this class and performs operations based on the flag


from settings import logger
from accrual_table import AccrualTable
logger.info('Beginning Accruals Snapshot Process')
for accrual in ['aa', 'ab']:
 logger.info(f'Begin {accrual}')
 accrual_table = AccrualTable(accrual)
 take_snapshot = accrual_table.extract_to_bucket()
 if take_snapshot == 'Success':
 accrual_table.store_metadata()
logger.info('Complete!')

Question 2

self.accrual_type = accrual_type.upper()

Looks like you're using strings to depict values from a finite set. Enums are better suited for this. The main advantage in your case would be the inability to pass a value that doesn't exist in the enumeration, while with strings (that are not restricted by anything) you can't be sure whether the passed value is a real accrual_type or not.

return 'Success'

Returning hardcoded strings is an instant red flag: outside code doesn't know what to expect from such method. In this case there is only one possible value that can be returned, which makes it even less meaningful: just don't return anything instead. If there is a chance for a failure, a boolean value can be used (although it's still not ideal from the readability standpoint).

def check_snapshot_eligibility(self):
 ...
 return eligible, snapshot_date

This method's name doesn't represent its return value. You would expect such method to return a boolean, but it provides another value which is not mentioned anywhere. In this case there are 2 good ways to solve it:

Rename the method into get_snapshot_date. If it's not 'eligible', it will just return None as it already does, which can be easily accounted for.
Split the method into 2: one that checks for eligibility and the other one that obtains the snapshot date.

self.snapshot_name = accruals_config['SNAPSHOT_FILENAME'].format(accruals_type=self.accrual_type, timestamp=self.snapshot_date)

This line will output an incorrect name if the previous check failed and snapshot_date is None.

def build_snapshot_ddl(self) -> str:

This lonely type hint looks out of place here. Since you're not using them everywhere (that would be totally fine), I don't see a reason why this particular return value needs a hint, from the code it's pretty clear that it returns a string.

All the manual sql query machinations are very hard to comprehend and extremely error-prone. Consider using something like this to do the dirty work for you.

logger.debug(job.result())

Don't put logic inside logging functions. You may delete this line without a second thought (it's just logging after all) and then wonder why the code doesn't work anymore. (may not be relevant in this particular case but has to be said anyway: good logging libraries may not even run code written inside logging functions in production environment, which will, once again, break the code in this case)

And another nipick regarding logging: make sure it's clear what is being logged.

logger.info('done')

In this case the person reading logs will have no idea what this 'done' stands for. A simple tag referencing the name of the class solves this problem.

As for your questions, methods with no arguments apart from self are completely fine. The less is always the better. The number of properties doesn't matter as long as they are actually being used. You can get rid of dataset_name since it's only used in the next line to declare another property. snapshot_eligible is not used anywhere and snapshot_date is, once again, only used on the next line.

It's hard to say more without looking at where this class is being used externally.

Question 3

Thank you for the detailed critique. I'm actually using a wrapper where I use the snapshot_eligible flag. I'm trying to add it to the question

Question 4

What @QuasiStellar said! And...

 take_snapshot = accrual_table.extract_to_bucket()
 if take_snapshot == 'Success':
 accrual_table.store_metadata()

design of Public API

Take care to expose as public just the methods and attributes that callers actually need and may rely on in future revs. Make all others _private by prepending an _ underscore. This makes it easier for folks to figure out your API and call into it. The smaller public surface area will make it easier to issue future releases which you are confident won't break existing client code.

In the three lines of code above, the middle one should just be deleted, as raise suffices to report lack of success. Consider converting a pair of methods to _private helpers, and exporting a new .backup_table() which extracts and then stores the metadata. If a client use case arises where extract-only is of interest then it's always trivial to expose it publicly again. My prediction? YAGNI! Such simplification and hiding plays into your "cluttered" critique.

noun vs verb identifiers

Even if you kept the current three source lines, the take_snapshot identifier should still be renamed. Both "extract" and "store" are excellent action verbs so they're a good fit for those method names. But to get back a Thing called "take" is odd. Better to call it a noun: status or snapshot_status.

comments suggest helpers

You felt a need to write this bit of English:

 # remove semicolon before options

and I thank you, I find it helpful.

But maybe you'd like to delete the comment and write a trivial helper whose single responsibility is to _remove_semicolon_before_options()

performance regression

You're using a logger instead of print(), and that's good! Among its advantages, I imagine it gives you timestamps "for free".

Still, you might consider recording

from time import time
...
 t0 = time()

Then do a cloud operation, then log elapsed = time() - t0.

You might also log a row count.

Here is the operational concern. As months go by, a table may grow unexpectedly large, or a cloud resource might respond unexpectedly slowly. The ability to conveniently graph historic row counts, elapsed times, and their ratio is invaluable for diagnosing and remediating sluggish performance. The ratio is called "app level throughput" and has units of row/sec.

A common scenario is to see a sudden jump in elapsed time starting on a certain day. Then we get to ask "what changed on that day?" Often a config change or a production release will be the culprit, and git diff will both reveal the details and will guide maintainers who are testing a fix.

QuasiStellar QuasiStellar 2,3274 silver badges17 bronze badges · Answer 1 · 2023-04-18 09:47:34Z

self.accrual_type = accrual_type.upper()

Looks like you're using strings to depict values from a finite set. Enums are better suited for this. The main advantage in your case would be the inability to pass a value that doesn't exist in the enumeration, while with strings (that are not restricted by anything) you can't be sure whether the passed value is a real accrual_type or not.

return 'Success'

Returning hardcoded strings is an instant red flag: outside code doesn't know what to expect from such method. In this case there is only one possible value that can be returned, which makes it even less meaningful: just don't return anything instead. If there is a chance for a failure, a boolean value can be used (although it's still not ideal from the readability standpoint).

def check_snapshot_eligibility(self):
 ...
 return eligible, snapshot_date

This method's name doesn't represent its return value. You would expect such method to return a boolean, but it provides another value which is not mentioned anywhere. In this case there are 2 good ways to solve it:

Rename the method into get_snapshot_date. If it's not 'eligible', it will just return None as it already does, which can be easily accounted for.
Split the method into 2: one that checks for eligibility and the other one that obtains the snapshot date.

self.snapshot_name = accruals_config['SNAPSHOT_FILENAME'].format(accruals_type=self.accrual_type, timestamp=self.snapshot_date)

This line will output an incorrect name if the previous check failed and snapshot_date is None.

def build_snapshot_ddl(self) -> str:

This lonely type hint looks out of place here. Since you're not using them everywhere (that would be totally fine), I don't see a reason why this particular return value needs a hint, from the code it's pretty clear that it returns a string.

All the manual sql query machinations are very hard to comprehend and extremely error-prone. Consider using something like this to do the dirty work for you.

logger.debug(job.result())

Don't put logic inside logging functions. You may delete this line without a second thought (it's just logging after all) and then wonder why the code doesn't work anymore. (may not be relevant in this particular case but has to be said anyway: good logging libraries may not even run code written inside logging functions in production environment, which will, once again, break the code in this case)

And another nipick regarding logging: make sure it's clear what is being logged.

logger.info('done')

In this case the person reading logs will have no idea what this 'done' stands for. A simple tag referencing the name of the class solves this problem.

As for your questions, methods with no arguments apart from self are completely fine. The less is always the better. The number of properties doesn't matter as long as they are actually being used. You can get rid of dataset_name since it's only used in the next line to declare another property. snapshot_eligible is not used anywhere and snapshot_date is, once again, only used on the next line.

It's hard to say more without looking at where this class is being used externally.

Thank you for the detailed critique. I'm actually using a wrapper where I use the snapshot_eligible flag. I'm trying to add it to the question

J_H J_H 41.4k3 gold badges38 silver badges157 bronze badges · Answer 2 · 2023-04-18 17:30:06Z

What @QuasiStellar said! And...

 take_snapshot = accrual_table.extract_to_bucket()
 if take_snapshot == 'Success':
 accrual_table.store_metadata()

design of Public API

Take care to expose as public just the methods and attributes that callers actually need and may rely on in future revs. Make all others _private by prepending an _ underscore. This makes it easier for folks to figure out your API and call into it. The smaller public surface area will make it easier to issue future releases which you are confident won't break existing client code.

In the three lines of code above, the middle one should just be deleted, as raise suffices to report lack of success. Consider converting a pair of methods to _private helpers, and exporting a new .backup_table() which extracts and then stores the metadata. If a client use case arises where extract-only is of interest then it's always trivial to expose it publicly again. My prediction? YAGNI! Such simplification and hiding plays into your "cluttered" critique.

noun vs verb identifiers

Even if you kept the current three source lines, the take_snapshot identifier should still be renamed. Both "extract" and "store" are excellent action verbs so they're a good fit for those method names. But to get back a Thing called "take" is odd. Better to call it a noun: status or snapshot_status.

comments suggest helpers

You felt a need to write this bit of English:

 # remove semicolon before options

and I thank you, I find it helpful.

But maybe you'd like to delete the comment and write a trivial helper whose single responsibility is to _remove_semicolon_before_options()

performance regression

You're using a logger instead of print(), and that's good! Among its advantages, I imagine it gives you timestamps "for free".

Still, you might consider recording

from time import time
...
 t0 = time()

Then do a cloud operation, then log elapsed = time() - t0.

You might also log a row count.

Here is the operational concern. As months go by, a table may grow unexpectedly large, or a cloud resource might respond unexpectedly slowly. The ability to conveniently graph historic row counts, elapsed times, and their ratio is invaluable for diagnosing and remediating sluggish performance. The ratio is called "app level throughput" and has units of row/sec.

A common scenario is to see a sudden jump in elapsed time starting on a certain day. Then we get to ask "what changed on that day?" Often a config change or a production release will be the culprit, and git diff will both reveal the details and will guide maintainers who are testing a fix.

Stack Exchange Network

A table backup application

2 Answers 2

design of Public API

noun vs verb identifiers

comments suggest helpers

performance regression

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

A table backup application

2 Answers 2

design of Public API

noun vs verb identifiers

comments suggest helpers

performance regression

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions