Improve pit performance #1673

date	report_period	value
0	2011年10月18日 00:00:00	201103	0.318919
1	2012年03月23日 00:00:00	201104	0.4039
2	2012年04月11日 00:00:00	201004	0.403925
3	2012年04月11日 00:00:00	200904	0.403925

We access PIT table in 3 Ways:

1. observe latest data each trade day

Just loop through table and keep only latest report_date value. consume O(N)

2. observe latest several `report_period` data for expression like `P(Mean($$roewa_q, 2))`

Read data file once.

Loop through trade day, slice data[:tradeday],
- groupby report_period, get the last item.
- return last X item

Algorithm could be improved by loop back from the end until find X different period. But groupby use C level loop which should be faster.

3. observe specific period from each trade day

Get all data belong to given period

How Has This Been Tested?

Pass the test by running: pytest qlib/tests/test_all_pipeline.py under upper directory of qlib.
If you are adding a new feature, test on your own test scripts.

Screenshots of Test Results (if appropriate):

image

Types of changes

Fix bugs
Add new feature
Update documentation

John Lyu added 2 commits

October 19, 2023 21:33


 improve pit performance

192ddc8


 improve pit cache

afff257

@github-actions github-actions bot added the waiting for triage label

Oct 20, 2023

John Lyu and others added 14 commits

October 20, 2023 11:16


 lint

6c214aa


 deal with empty data

a144bc9


 add pit backend: FilePITStorage

3ed3f17


 improve docstring

61c31ca


 remove index file check

d82ab8d


 pit rewrite does not need index


 fix typo

e07487d


 make sure dir exist

8d96bd6


 fix parents not exist

4213b68


 fix pitstorage update

8a354ef


 check dtype

dbfe153


 fix empty data

20889ca

@PaleNeutron


 lint

5c16123


 deal with empty data file

31c3747

@PaleNeutron

Copy link

Author

PaleNeutron commented Nov 9, 2023

Anyone can fix main branch? CI fails due to main branch problem.

Fivele-Li

Fivele-Li reviewed

Nov 28, 2023

View reviewed changes

Copy link

Contributor

@Fivele-Li Fivele-Li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the index file mentioned here and the _next column in the data file will not be used in this PR. Are you going to delete them together?

qlib/scripts/dump_pit.py

Lines 198 to 204 in 98f569e

if not overwrite and index_file.exists():

with open(index_file, "rb") as fi:

(first_year,) = struct.unpack(self.PERIOD_DTYPE, fi.read(self.PERIOD_DTYPE_SIZE))

n_years = len(fi.read()) // self.INDEX_DTYPE_SIZE

if interval == self.INTERVAL_quarterly:

n_years //= 4

start_year = first_year + n_years

qlib/data/pit.py Show resolved Hide resolved

qlib/utils/__init__.py Show resolved Hide resolved

@PaleNeutron

Copy link

Author

PaleNeutron commented Nov 28, 2023

It seems that the index file mentioned here and the _next column in the data file will not be used in this PR. Are you going to delete them together?

qlib/scripts/dump_pit.py

Lines 198 to 204 in 98f569e

if not overwrite and index_file.exists():

with open(index_file, "rb") as fi:

(first_year,) = struct.unpack(self.PERIOD_DTYPE, fi.read(self.PERIOD_DTYPE_SIZE))

n_years = len(fi.read()) // self.INDEX_DTYPE_SIZE

if interval == self.INTERVAL_quarterly:

n_years //= 4

start_year = first_year + n_years

The whole dump_pit.py should be rewrited since we implement FilePitStorage. So current dump file should look like

s = FilePitStorage("000001.SZ", "ROE")
s.write(np_data)


 remove useless function

ef9242e

@PaleNeutron

Copy link

Author

PaleNeutron commented Dec 7, 2023

@Fivele-Li, I think rewrite dump scripts could be done in another PR, since normal feature dump script should also be rewrited using LocalFeatureStorage and LocalCalendarStorage.

John Lyu added 8 commits

January 12, 2024 16:07


 improve pit performance

87d65e1


 improve pit cache

b53bae6


 lint

23f16b9


 deal with empty data

1a349d0


 add pit backend: FilePITStorage

f340776


 improve docstring

38a04b6


 remove index file check

07cff6b


 pit rewrite does not need index

bdf8060

John Lyu and others added 9 commits

January 12, 2024 16:07


 fix typo

e3fff65


 make sure dir exist

c754290


 fix parents not exist

74fd9cb


 fix pitstorage update

41648b9


 check dtype

ca0d4bb


 fix empty data

de9e6cf

@PaleNeutron


 lint

e093a83


 deal with empty data file

52c5cba


 remove useless function

8dfc393

@SunsetWolf SunsetWolf force-pushed the main branch from 702de78 to 194284b Compare

May 7, 2024 06:20

@CharlieChi

Copy link

CharlieChi commented Sep 2, 2024

Current online update tools seem to be incompatible with these modifications, mind check it out?

John Lyu added 2 commits

September 3, 2024 10:06


 Merge branch 'pit_fix' of https://github.com/PaleNeutron/qlib into pi...

e42496a

...t_fix


 Merge remote-tracking branch 'upstream/main' into pit_fix

958291e

@PaleNeutron

Copy link

Author

PaleNeutron commented Sep 3, 2024

@CharlieChi , which command failed, it's a long time since this pr created and I am not sure about current workflow.

@CharlieChi

Copy link

CharlieChi commented Sep 3, 2024 •

edited

Loading

@CharlieChi , which command failed, it's a long time since this pr created and I am not sure about current workflow.

qlib/workflow/online/update.py
 start_time_buffer = get_date_by_shift(
 self.last_end, -hist_ref + 1, clip_shift=False, freq=self.freq # pylint: disable=E1130
 )
 start_time = get_date_by_shift(self.last_end, 1, freq=self.freq)
 seg = {"test": (start_time, self.to_date)}
 return self.rmdl.get_dataset(
 start_time=start_time_buffer, end_time=self.to_date, segments=seg, unprepared_dataset=unprepared_dataset
 )

Here,when using model with PIT features, and update preds by short time range, like a day, these dataset will return empty dataframe。 while with long time range(one year between start_time and end_time), it works fine

Abhijais4896

Abhijais4896 reviewed

Oct 6, 2025

View reviewed changes

Copy link

@Abhijais4896 Abhijais4896 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def get_default_backend(self):
backend = {}
provider_name: str = re.findall("[A-Z][^A-Z]", self.class.name)[-2]
if hasattr(self, "provider_name"):
provider_name = getattr(self, "provider_name")
else:
provider_name: str = re.findall("[A-Z][^A-Z]", self.class.name)[-2]
# set default storage class
backend.setdefault("class", f"File{provider_name}Storage")

Labels

waiting for triage

4 participants

@PaleNeutron @CharlieChi @Abhijais4896 @Fivele-Li

Improve pit performance #1673

Are you sure you want to change the base?

Improve pit performance #1673

Conversation

@PaleNeutron PaleNeutron commented Oct 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

1. observe latest data each trade day

2. observe latest several report_period data for expression like P(Mean($$roewa_q, 2))

3. observe specific period from each trade day

How Has This Been Tested?

Screenshots of Test Results (if appropriate):

Types of changes

Uh oh!

PaleNeutron commented Nov 9, 2023

Uh oh!

@Fivele-Li Fivele-Li left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

PaleNeutron commented Nov 28, 2023

Uh oh!

PaleNeutron commented Dec 7, 2023

Uh oh!

CharlieChi commented Sep 2, 2024

Uh oh!

PaleNeutron commented Sep 3, 2024

Uh oh!

CharlieChi commented Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

@Abhijais4896 Abhijais4896 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

@PaleNeutron PaleNeutron commented Oct 20, 2023 •

edited

Loading

2. observe latest several `report_period` data for expression like `P(Mean($$roewa_q, 2))`

CharlieChi commented Sep 3, 2024 •

edited

Loading