-
Notifications
You must be signed in to change notification settings - Fork 5k
Improve pit performance #1673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve pit performance #1673
Conversation
Anyone can fix main branch? CI fails due to main branch problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the index file mentioned here and the _next column in the data file will not be used in this PR. Are you going to delete them together?
Lines 198 to 204 in 98f569e
It seems that the index file mentioned here and the _next column in the data file will not be used in this PR. Are you going to delete them together?
Lines 198 to 204 in 98f569e
if not overwrite and index_file.exists():with open(index_file, "rb") as fi:(first_year,) = struct.unpack(self.PERIOD_DTYPE, fi.read(self.PERIOD_DTYPE_SIZE))n_years = len(fi.read()) // self.INDEX_DTYPE_SIZEif interval == self.INTERVAL_quarterly:n_years //= 4start_year = first_year + n_years
The whole dump_pit.py
should be rewrited since we implement FilePitStorage
. So current dump file should look like
s = FilePitStorage("000001.SZ", "ROE") s.write(np_data)
@Fivele-Li, I think rewrite dump scripts could be done in another PR, since normal feature dump script should also be rewrited using LocalFeatureStorage
and LocalCalendarStorage
.
702de78
to
194284b
Compare
CharlieChi
commented
Sep 2, 2024
Current online update tools seem to be incompatible with these modifications, mind check it out?
@CharlieChi , which command failed, it's a long time since this pr created and I am not sure about current workflow.
@CharlieChi , which command failed, it's a long time since this pr created and I am not sure about current workflow.
qlib/workflow/online/update.py
start_time_buffer = get_date_by_shift(
self.last_end, -hist_ref + 1, clip_shift=False, freq=self.freq # pylint: disable=E1130
)
start_time = get_date_by_shift(self.last_end, 1, freq=self.freq)
seg = {"test": (start_time, self.to_date)}
return self.rmdl.get_dataset(
start_time=start_time_buffer, end_time=self.to_date, segments=seg, unprepared_dataset=unprepared_dataset
)
Here,when using model with PIT features, and update preds by short time range, like a day, these dataset will return empty dataframe。 while with long time range(one year between start_time and end_time), it works fine
@Abhijais4896
Abhijais4896
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def get_default_backend(self):
backend = {}
provider_name: str = re.findall("[A-Z][^A-Z]", self.class.name)[-2]
if hasattr(self, "provider_name"):
provider_name = getattr(self, "provider_name")
else:
provider_name: str = re.findall("[A-Z][^A-Z]", self.class.name)[-2]
# set default storage class
backend.setdefault("class", f"File{provider_name}Storage")
Uh oh!
There was an error while loading. Please reload this page.
Description
see #1671
Consider pit data, assume we have
T
trade days andN
report_period record:We access PIT table in 3 Ways:
1. observe latest data each trade day
Just loop through table and keep only latest
report_date
value. consume O(N)2. observe latest several
report_period
data for expression likeP(Mean($$roewa_q, 2))
Read data file once.
X
itemAlgorithm could be improved by loop back from the end until find
X
different period. But groupby use C level loop which should be faster.3. observe specific period from each trade day
Get all data belong to given period
How Has This Been Tested?
pytest qlib/tests/test_all_pipeline.py
under upper directory ofqlib
.Screenshots of Test Results (if appropriate):
image
Types of changes