Re: [RFC][PATCH 1/3] PM /devfreq: add user frequency limits into devfreq struct
From: Chanwoo Choi
Date: Wed Feb 24 2021 - 02:49:42 EST
On 2/16/21 7:41 PM, Lukasz Luba wrote:
>
Hi Chanwoo,
>
>
On 2/15/21 3:00 PM, Chanwoo Choi wrote:
>
> Hi Lukasz,
>
>
>
> On Fri, Feb 12, 2021 at 7:28 AM Lukasz Luba <lukasz.luba@xxxxxxx> wrote:
>
>>
>
>>
>
>>
>
>> On 2/11/21 11:07 AM, Lukasz Luba wrote:
>
>>> Hi Chanwoo,
>
>>>
>
>>> On 2/3/21 10:21 AM, Lukasz Luba wrote:
>
>>>> Hi Chanwoo,
>
>>>>
>
>>>> Thank you for looking at this.
>
>>>>
>
>>>> On 2/3/21 10:11 AM, Chanwoo Choi wrote:
>
>>>>> Hi Lukasz,
>
>>>>>
>
>>>>> When accessing the max_freq and min_freq at devfreq-cooling.c,
>
>>>>> even if can access 'user_max_freq' and 'lock' by using the 'devfreq'
>
>>>>> instance,
>
>>>>> I think that the direct access of variables
>
>>>>> (lock/user_max_freq/user_min_freq)
>
>>>>> of struct devfreq are not good.
>
>>>>>
>
>>>>> Instead, how about using the 'DEVFREQ_TRANSITION_NOTIFIER'
>
>>>>> notification with following changes of 'struct devfreq_freq'?
>
>>>>
>
>>>> I like the idea with devfreq notification. I will have to go through the
>
>>>> code to check that possibility.
>
>>>>
>
>>>>> Also, need to add codes into devfreq_set_target() for initializing
>
>>>>> 'new_max_freq' and 'new_min_freq' before sending the DEVFREQ_POSTCHANGE
>
>>>>> notification.
>
>>>>>
>
>>>>> diff --git a/include/linux/devfreq.h b/include/linux/devfreq.h
>
>>>>> index 147a229056d2..d5726592d362 100644
>
>>>>> --- a/include/linux/devfreq.h
>
>>>>> +++ b/include/linux/devfreq.h
>
>>>>> @@ -207,6 +207,8 @@ struct devfreq {
>
>>>>> struct devfreq_freqs {
>
>>>>> unsigned long old;
>
>>>>> unsigned long new;
>
>>>>> + unsigned long new_max_freq;
>
>>>>> + unsigned long new_min_freq;
>
>>>>> };
>
>>>>>
>
>>>>>
>
>>>>> And I think that new 'user_min_freq'/'user_max_freq' are not necessary.
>
>>>>> You can get the current max_freq/min_freq by using the following steps:
>
>>>>>
>
>>>>> get_freq_range(devfreq, &min_freq, &max_freq);
>
>>>>> dev_pm_opp_find_freq_floor(pdev, &min_freq);
>
>>>>> dev_pm_opp_find_freq_floor(pdev, &max_freq);
>
>>>>>
>
>>>>> So that you can get the 'max_freq/min_freq' and then
>
>>>>> initialize the 'freqs.new_max_freq and freqs.new_min_freq'
>
>>>>> with them as following:
>
>>>>>
>
>>>>> in devfreq_set_target()
>
>>>>> get_freq_range(devfreq, &min_freq, &max_freq);
>
>>>>> dev_pm_opp_find_freq_floor(pdev, &min_freq);
>
>>>>> dev_pm_opp_find_freq_floor(pdev, &max_freq);
>
>>>>> freqs.new_max_freq = min_freq;
>
>>>>> freqs.new_max_freq = max_freq;
>
>>>>> devfreq_notify_transition(devfreq, &freqs, DEVFREQ_POSTCHANGE);
>
>>>>
>
>>>> I will plumb it in and check that option. My concern is that function
>
>>>> get_freq_range() would give me the max_freq value from PM QoS, which
>
>>>> might be my thermal limit - lower that user_max_freq. Then I still
>
>>>> need
>
>>>>
>
>>>> I've been playing with PM QoS notifications because I thought it would
>
>>>> be possible to be notified in thermal for all new set values - even from
>
>>>> devfreq sysfs user max_freq write, which has value higher that the
>
>>>> current limit set by thermal governor. Unfortunately PM QoS doesn't
>
>>>> send that information by design. PM QoS also by desing won't allow
>
>>>> me to check first two limits in the plist - which would be thermal
>
>>>> and user sysfs max_freq.
>
>>>>
>
>>>> I will experiment with this notifications and share the results.
>
>>>> That you for your comments.
>
>>>
>
>>> I have experimented with your proposal. Unfortunately, the value stored
>
>>> in the pm_qos which is read by get_freq_range() is not the user max
>
>>> freq. It's the value from thermal devfreq cooling when that one is
>
>>> lower. Which is OK in the overall design, but not for my IPA use case.
>
>>>
>
>>> What comes to my mind is two options:
>
>>> 1) this patch proposal, with simple solution of two new variables
>
>>> protected by mutex, which would maintain user stored values
>
>>> 2) add a new notification chain in devfreq to notify about new
>
>>> user written value, to which devfreq cooling would register; that
>
>>> would allow devfreq cooling to get that value instantly and store
>
>>> locally
>
>>
>
>> 3) How about new define for existing notification chain:
>
>> #define DEVFREQ_USER_CHANGE (2)
>
>
>
> I think that if we add the notification with specific actor like user change
>
> or OPP change or others, it is not proper. But, we can add the notification
>
> for min or max frequency change timing. Because the devfreq already has
>
> the notification for current frequency like DEVFREQ_PRECHANGE,
>
> DEVFREQ_POSTCHANGE.
>
>
>
> Maybe, we can add the following notification for min/max_freq.
>
> The following min_freq and max_freq values will be calculated by
>
> get_freq_range().
>
> DEVFREQ_MIN_FREQ_PRECHANGE
>
> DEVFREQ_MIN_FREQ_POSTCHANGE
>
> DEVFREQ_MAX_FREQ_PRECHANGE
>
> DEVFREQ_MAX_FREQ_POSTCHANGE
>
>
Would it be then possible to pass the user max freq value written via
>
sysfs? Something like in the example below, when writing into max sysfs:
>
>
1) starting in max_freq_store()
>
freqs.new_max_freq = max_freq;
>
devfreq_notify_transition(devfreq, &freqs, DEVFREQ_MAX_FREQ_PRECHANGE);
>
dev_pm_qos_update_request()
When we use the PRECHANGE and POSTCHANGE notification,
we should keep the consistent value.
When PRECHANGE, notify the previous min/max frequency
containing the user input/cooling policy/OPP.
When POSTCHANGE, notify the new min/max frequency
containing the user input/cooling policy/OPP.
But, in case of your suggestion, DEVFREQ_MAX_FREQ_PRECHANGE considers
only user input without cooling policy/opp.
>
>
2)then after a while in devfreq_set_target()
>
get_freq_range(devfreq, &min_freq, &max_freq);
>
dev_pm_opp_find_freq_floor(pdev, &min_freq);
>
dev_pm_opp_find_freq_floor(pdev, &max_freq);
>
freqs.new_min_freq = min_freq;
>
freqs.new_max_freq = max_freq;
>
devfreq_notify_transition(devfreq, &freqs, DEVFREQ_MAX_FREQ_POSTCHANGE);
>
>
This 2nd part is called after the PM QoS has changed that limit,
>
so might be missing (in case value was higher that current),
>
but thermal would know about that, so no worries.
It doesn't focus on only thermal. We need to consider
all potential user of max_freq notification.
In the devfreq subsystem like devfreq governor,
we might use the user min/max_freq without any restrictions.
But, in this case, devfreq provides the min/max_freq
to outside subsystem/drivers like devfreq-cooling.c of thermal.
IMHO, it is difficult to agree this approach.
If devfreq provides the various min/max_freq value to outside
of devfreq, it makes the confusion to understand the meaning
of min/max_freq. Actually, the other user doesn't need to
know the user input for min/max_freq.
>
>
>
>
>
>
>>
>
>> Then a modified devfreq_notify_transition() would get:
>
>> @@ -339,6 +339,10 @@ static int devfreq_notify_transition(struct devfreq
>
>> *devfreq,
>
>>
>
>> srcu_notifier_call_chain(&devfreq->transition_notifier_list,
>
>> DEVFREQ_POSTCHANGE, freqs);
>
>> break;
>
>> + case DEVFREQ_USER_CHANGE:
>
>> + srcu_notifier_call_chain(&devfreq->transition_notifier_list,
>
>> + DEVFREQ_USER_CHANGE, freqs);
>
>> + break;
>
>> default:
>
>> return -EINVAL;
>
>> }
>
>>
>
>> If that is present, I can plumb your suggestion with:
>
>> struct devfreq_freq {
>
>> + unsigned long new_max_freq;
>
>> + unsigned long new_min_freq;
>
>>
>
>> and populate them with values in the max_freq_store() by adding at the
>
>> end:
>
>>
>
>> freqs.new_max_freq = max_freq;
>
>> mutex_lock();
>
>> devfreq_notify_transition(devfreq, &freqs, DEVFREQ_USER_CHANGE);
>
>> mutex_unlock();
>
>>
>
>> I would handle this notification in devfreq cooling and keep the
>
>> value there, for future IPA checks.
>
>>
>
>> If you agree, I can send next version of the patch set.
>
>>
>
>>>
>
>>> What do you think Chanwoo?
>
>
>
> I thought that your suggestion to expose the user input for min/max_freq.
>
> But, these values are not valid for the public user. Actually, the devfreq core
>
> handles these values only internally without any explicit access from outside.
>
>
>
> I'm not sure that it is right or not to expose the internal value of
>
> devfreq struct.
>
> Until now, I think that it is not proper to show the interval value outside.
>
>
>
> Because the devfreq subsystem only provides the min_freq and max_freq
>
> which reflect the all requirement of user input/cooling policy/OPP
>
> instead of user_min_freq, user_max_freq.
>
>
>
> If we provide the user_min_freq, user_max_freq via DEVFREQ notification,
>
> we have to make the new sysfs attributes for user_min_freq and user_max_freq
>
> to show the value to the user. But, it seems that it is not nice.
>
>
I would say we don't have to expose it. Let's take a closer look into
>
an example. The main problem is with GPUs. The middleware is aware of
>
the OPPs in the GPU. If the middleware wants to switch into different
>
power-performance mode e.g. power-saving, it writes into this sysfs
>
the max allowed freq. IPA does not know about it and makes wrong
>
decisions. As you said, the sysfs read operation combines all:
>
user input/cooling policy/OPP, but that's not a problem for this aware
>
middleware. So it can stay as is.
>
The only addition would be this 'notification about user attempt of
>
reducing max device speed' internally inside the kernel, for those
>
subsystems which are interested in it.
As I commented on above, I'm not sure to provide the multiple
min/max_freq to outside of devfreq subsytem. Instead, it is ok
to use user min/max_freq inside of devfreq subsystem.
Unfortunately, I didn't suggests the good solution.
It is very important changes. So that I want to consider
the all users of devfreq.
>
>
>
>
> Actually, I have no other idea how to support your feature.
>
> We try to find the more proper method.
>
>
>
>
Thank you for coming back with your comments. I know it's not
>
an easy feature but I hope we can find a solution.
>
>
Regards,
>
Lukasz
>
>
--
Best Regards,
Chanwoo Choi
Samsung Electronics