The code works fine, but I'm not sure about the way I've created different functions and called them in others. I usually separate bigquery related functions to a utility module, but since there were only a couple, I included them inside the class; I'm not sure I like them that way.
Another thing I don't like is how none of the functions take any arguments. Is that a bad practice, or is this type of structuring fine?
I also feel like I have declared too many property variables and made the class cluttered.
@logger.catch(reraise=True)
class AccrualTable:
__credentials = service_account.Credentials.from_service_account_file(gbq_config["CREDENTIALS"])
client = bigquery.Client(credentials=__credentials)
def check_snapshot_eligibility(self):
snapshot_date = ''
eligible = False
query_result = self.client.query(<sql>).to_dataframe()
if not query_result.empty:
eligible = True
snapshot_date = query_result.iloc[0]['snapshot_date'].strftime('%Y_%m_%d')
return eligible, snapshot_date
def __init__(self, accrual_type):
assert re.findall(accrual_type, r'(aa)(ab)', re.IGNORECASE), "Invalid accruals type. Should be 'aa' or 'ab'"
self.accrual_type = accrual_type.upper()
self.dataset_name = bigquery.DatasetReference(accruals_config["PROJECT_ID"], accruals_config["DATASET_ID"])
self.table_name = self.dataset_name.table(accruals_config['TABLES'][self.accrual_type])
self.full_table_name = '.'.join([self.table_name.project,
self.table_name.dataset_id,
self.table_name.table_id])
self.snapshot_eligible, self.snapshot_date = self.check_snapshot_eligibility()
self.snapshot_name = accruals_config['SNAPSHOT_FILENAME'].format(accruals_type=self.accrual_type,
timestamp=self.snapshot_date)
self.destination_uri = os.path.join(accruals_config["SNAPSHOT_BUCKET"],
self.accrual_type,
self.snapshot_name,
'*')
def extract_to_bucket(self):
extract_job = self.client.extract_table(
self.table_name,
self.destination_uri
) # API request
extract_job.result() # Waits for job to complete.
return 'Success'
def build_snapshot_ddl(self) -> str:
accruals_metadata = self.client.query(<query to extract ddl>).to_dataframe()
# convert to external table with the snapshot name
snapshot_ddl_partial = re.sub(f'CREATE TABLE `{self.full_table_name}`',
f'CREATE EXTERNAL TABLE `{accruals_config["PROJECT_ID"]}.{accruals_config["DATASET_ID"]}.{self.snapshot_name}`',
accruals_metadata.iloc[0]['ddl'])
# remove semicolon before options
snapshot_ddl_complete = re.sub(';', '', snapshot_ddl_partial) + f'''
OPTIONS (
format = 'CSV',
uris = ['{self.destination_uri}']
);
'''
return snapshot_ddl_complete
def store_metadata(self):
logger.info('storing snapshot metadata')
metadata_record = pd.DataFrame({'backup_timestamp': [datetime.now()],
'accrual_type': [self.accrual_type],
'backup_location': [self.destination_uri],
'table_schema': [self.build_snapshot_ddl()]})
dest_table = accruals_config['ACCRUALS_SNAPSHOT_METADATA']
logger.info(f'inserting contents of records dataframe to {dest_table}')
job = self.client.load_table_from_dataframe(dataframe=metadata_record,
destination=dest_table)
logger.debug(job.result())
logger.info('done')
return
I have a wrapper that calls this class and performs operations based on the flag
from settings import logger
from accrual_table import AccrualTable
logger.info('Beginning Accruals Snapshot Process')
for accrual in ['aa', 'ab']:
logger.info(f'Begin {accrual}')
accrual_table = AccrualTable(accrual)
take_snapshot = accrual_table.extract_to_bucket()
if take_snapshot == 'Success':
accrual_table.store_metadata()
logger.info('Complete!')
2 Answers 2
self.accrual_type = accrual_type.upper()
Looks like you're using strings to depict values from a finite set. Enums are better suited for this. The main advantage in your case would be the inability to pass a value that doesn't exist in the enumeration, while with strings (that are not restricted by anything) you can't be sure whether the passed value is a real accrual_type or not.
return 'Success'
Returning hardcoded strings is an instant red flag: outside code doesn't know what to expect from such method. In this case there is only one possible value that can be returned, which makes it even less meaningful: just don't return anything instead. If there is a chance for a failure, a boolean value can be used (although it's still not ideal from the readability standpoint).
def check_snapshot_eligibility(self):
...
return eligible, snapshot_date
This method's name doesn't represent its return value. You would expect such method to return a boolean, but it provides another value which is not mentioned anywhere. In this case there are 2 good ways to solve it:
- Rename the method into
get_snapshot_date
. If it's not 'eligible', it will just returnNone
as it already does, which can be easily accounted for. - Split the method into 2: one that checks for eligibility and the other one that obtains the snapshot date.
self.snapshot_name = accruals_config['SNAPSHOT_FILENAME'].format(accruals_type=self.accrual_type, timestamp=self.snapshot_date)
This line will output an incorrect name if the previous check failed and snapshot_date
is None.
def build_snapshot_ddl(self) -> str:
This lonely type hint looks out of place here. Since you're not using them everywhere (that would be totally fine), I don't see a reason why this particular return value needs a hint, from the code it's pretty clear that it returns a string.
All the manual sql query machinations are very hard to comprehend and extremely error-prone. Consider using something like this to do the dirty work for you.
logger.debug(job.result())
Don't put logic inside logging functions. You may delete this line without a second thought (it's just logging after all) and then wonder why the code doesn't work anymore. (may not be relevant in this particular case but has to be said anyway: good logging libraries may not even run code written inside logging functions in production environment, which will, once again, break the code in this case)
And another nipick regarding logging: make sure it's clear what is being logged.
logger.info('done')
In this case the person reading logs will have no idea what this 'done' stands for. A simple tag referencing the name of the class solves this problem.
As for your questions, methods with no arguments apart from self
are completely fine. The less is always the better. The number of properties doesn't matter as long as they are actually being used. You can get rid of dataset_name
since it's only used in the next line to declare another property. snapshot_eligible
is not used anywhere and snapshot_date
is, once again, only used on the next line.
It's hard to say more without looking at where this class is being used externally.
-
\$\begingroup\$ Thank you for the detailed critique. I'm actually using a wrapper where I use the snapshot_eligible flag. I'm trying to add it to the question \$\endgroup\$mankand007– mankand0072023年04月18日 16:39:50 +00:00Commented Apr 18, 2023 at 16:39
What @QuasiStellar said! And...
take_snapshot = accrual_table.extract_to_bucket()
if take_snapshot == 'Success':
accrual_table.store_metadata()
design of Public API
Take care to expose as public just the methods and attributes
that callers actually need and may rely on in future revs.
Make all others _private
by prepending an _
underscore.
This makes it easier for folks to figure out your API and
call into it. The smaller public surface area will make it
easier to issue future releases which you are confident
won't break existing client code.
In the three lines of code above, the middle one should
just be deleted, as raise
suffices to report lack of success.
Consider converting a pair of methods to _private
helpers,
and exporting a new .backup_table()
which extracts
and then stores the metadata.
If a client use case arises where extract-only is of interest
then it's always trivial to expose it publicly again.
My prediction?
YAGNI!
Such simplification and hiding plays into your "cluttered" critique.
noun vs verb identifiers
Even if you kept the current three source lines,
the take_snapshot
identifier should still be renamed.
Both "extract" and "store" are excellent action verbs
so they're a good fit for those method names.
But to get back a Thing called "take" is odd.
Better to call it a noun: status
or snapshot_status
.
comments suggest helpers
You felt a need to write this bit of English:
# remove semicolon before options
and I thank you, I find it helpful.
But maybe you'd like to delete the comment and
write a trivial helper whose
single responsibility
is to _remove_semicolon_before_options()
performance regression
You're using a logger instead of print()
,
and that's good! Among its advantages, I imagine
it gives you timestamps "for free".
Still, you might consider recording
from time import time
...
t0 = time()
Then do a cloud operation,
then log elapsed = time() - t0
.
You might also log a row count.
Here is the operational concern. As months go by, a table may grow unexpectedly large, or a cloud resource might respond unexpectedly slowly. The ability to conveniently graph historic row counts, elapsed times, and their ratio is invaluable for diagnosing and remediating sluggish performance. The ratio is called "app level throughput" and has units of row/sec.
A common scenario is to see a sudden jump in elapsed time
starting on a certain day. Then we get to ask "what
changed on that day?" Often a config change or a
production release will be the culprit, and git diff
will both reveal the details and will guide maintainers
who are testing a fix.