Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Improve pit performance #1673

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
PaleNeutron wants to merge 36 commits into microsoft:main
base: main
Choose a base branch
Loading
from PaleNeutron:pit_fix
Open

Conversation

Copy link

@PaleNeutron PaleNeutron commented Oct 20, 2023
edited
Loading

Description

see #1671

Consider pit data, assume we have T trade days and N report_period record:

date report_period value
0 2011年10月18日 00:00:00 201103 0.318919
1 2012年03月23日 00:00:00 201104 0.4039
2 2012年04月11日 00:00:00 201004 0.403925
3 2012年04月11日 00:00:00 200904 0.403925

We access PIT table in 3 Ways:

1. observe latest data each trade day

Just loop through table and keep only latest report_date value. consume O(N)

2. observe latest several report_period data for expression like P(Mean($$roewa_q, 2))

Read data file once.

  • Loop through trade day, slice data[:tradeday],
    • groupby report_period, get the last item.
    • return last X item

Algorithm could be improved by loop back from the end until find X different period. But groupby use C level loop which should be faster.

3. observe specific period from each trade day

Get all data belong to given period

How Has This Been Tested?

  • Pass the test by running: pytest qlib/tests/test_all_pipeline.py under upper directory of qlib.
  • If you are adding a new feature, test on your own test scripts.

Screenshots of Test Results (if appropriate):

image

Types of changes

  • Fix bugs
  • Add new feature
  • Update documentation

@github-actions github-actions bot added the waiting for triage Cannot auto-triage, wait for triage. label Oct 20, 2023
Copy link
Author

Anyone can fix main branch? CI fails due to main branch problem.

Copy link
Contributor

@Fivele-Li Fivele-Li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the index file mentioned here and the _next column in the data file will not be used in this PR. Are you going to delete them together?

qlib/scripts/dump_pit.py

Lines 198 to 204 in 98f569e

if not overwrite and index_file.exists():
with open(index_file, "rb") as fi:
(first_year,) = struct.unpack(self.PERIOD_DTYPE, fi.read(self.PERIOD_DTYPE_SIZE))
n_years = len(fi.read()) // self.INDEX_DTYPE_SIZE
if interval == self.INTERVAL_quarterly:
n_years //= 4
start_year = first_year + n_years

Copy link
Author

It seems that the index file mentioned here and the _next column in the data file will not be used in this PR. Are you going to delete them together?

qlib/scripts/dump_pit.py

Lines 198 to 204 in 98f569e

if not overwrite and index_file.exists():
with open(index_file, "rb") as fi:
(first_year,) = struct.unpack(self.PERIOD_DTYPE, fi.read(self.PERIOD_DTYPE_SIZE))
n_years = len(fi.read()) // self.INDEX_DTYPE_SIZE
if interval == self.INTERVAL_quarterly:
n_years //= 4
start_year = first_year + n_years

The whole dump_pit.py should be rewrited since we implement FilePitStorage. So current dump file should look like

s = FilePitStorage("000001.SZ", "ROE")
s.write(np_data)

Copy link
Author

@Fivele-Li, I think rewrite dump scripts could be done in another PR, since normal feature dump script should also be rewrited using LocalFeatureStorage and LocalCalendarStorage.

Copy link

Current online update tools seem to be incompatible with these modifications, mind check it out?

Copy link
Author

@CharlieChi , which command failed, it's a long time since this pr created and I am not sure about current workflow.

Copy link

CharlieChi commented Sep 3, 2024
edited
Loading

@CharlieChi , which command failed, it's a long time since this pr created and I am not sure about current workflow.

qlib/workflow/online/update.py
 start_time_buffer = get_date_by_shift(
 self.last_end, -hist_ref + 1, clip_shift=False, freq=self.freq # pylint: disable=E1130
 )
 start_time = get_date_by_shift(self.last_end, 1, freq=self.freq)
 seg = {"test": (start_time, self.to_date)}
 return self.rmdl.get_dataset(
 start_time=start_time_buffer, end_time=self.to_date, segments=seg, unprepared_dataset=unprepared_dataset
 )

Here,when using model with PIT features, and update preds by short time range, like a day, these dataset will return empty dataframe。 while with long time range(one year between start_time and end_time), it works fine

Copy link

@Abhijais4896 Abhijais4896 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def get_default_backend(self):
backend = {}
provider_name: str = re.findall("[A-Z][^A-Z]", self.class.name)[-2]
if hasattr(self, "provider_name"):
provider_name = getattr(self, "provider_name")
else:
provider_name: str = re.findall("[A-Z][^A-Z]
", self.class.name)[-2]
# set default storage class
backend.setdefault("class", f"File{provider_name}Storage")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

2 more reviewers

@Abhijais4896 Abhijais4896 Abhijais4896 left review comments

@Fivele-Li Fivele-Li Fivele-Li left review comments

Reviewers whose approvals may not affect merge requirements

Assignees

No one assigned

Labels

waiting for triage Cannot auto-triage, wait for triage.

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

AltStyle によって変換されたページ (->オリジナル) /