Codeberg/Community
54
325
Fork
You've already forked Community
12

[RfC] Reconsidering OSI license approval in Terms of Use #1654

Open
opened 2024年09月21日 11:39:03 +02:00 by mikolaj · 28 comments

Comment

Background

As of today, Codeberg's Terms of Use (ToU) require repository contents to be licensed under a license approved by the Free Software Foundation (FSF) or the Open Source Initiative (OSI), as stated in the following two clauses (bolding mine):

§ 1 (2) Our service is open for all projects covered by a free software or open source licence, as defined by either the Free Software Foundation (FSF) or the Open Source Initiative (OSI).

and

§ 2 (1) Repository content shall be licensed under an open-source license approved by the Free Software Foundation (see list of the FSF) or the Open Source Initiative (see list of the OSI).
(...)

The OSI is currently working on a new definition of open source to be applied to "artificial intelligence" systems, which the OSI calls the "Open Source AI Definition" (OSAID) and intends to announce in late October. However, the OSAID appears to differ significantly in spirit from the OSI's original Open Source Definition (OSD), allowing for the dataset used to train a system (and thus generally necessary to replicate it) to be proprietary, as the checklist attached to the latest OSAID draft permits instead publishing a research paper, a technical report, or a draft card.

The OSI is explicit in its intention of not requiring training datasets to be open, having stated that (bolding mine)

The role of training data is one of the most hotly debated parts of the definition. After long deliberation and co-design sessions we have concluded that defining training data as a benefit, not a requirement, is the best way to go.

You may learn more about this controversy from a recent opinion piece by Steven J. Vaughan-Nichols published in The Register.

Request for comments (RfC)

In response to these news, I would like to ask the Codeberg community the following question:

Should Codeberg update its ToU if the OSI publishes the OSAID without a requirement for training datasets to be open? If so, what should be changed?

### Comment ## Background As of today, Codeberg's [Terms of Use](https://codeberg.org/Codeberg/org/src/branch/main/TermsOfUse.md) (ToU) require repository contents to be licensed under a license approved by the [Free Software Foundation](https://www.fsf.org/) (FSF) or the [Open Source Initiative](https://opensource.org/) (OSI), as stated in the following two clauses (bolding mine): >§ 1 (2) Our service is open for all projects covered by a free software or open source licence, as defined by either the Free Software Foundation (FSF) or the **Open Source Initiative (OSI)**. and >§ 2 (1) Repository content shall be licensed under an open-source license approved by the Free Software Foundation ([see list of the FSF](https://www.gnu.org/licenses/license-list.html)) or the **Open Source Initiative** ([see list of the OSI](https://opensource.org/licenses/)). >(...) The OSI is currently working on a new definition of open source to be applied to "artificial intelligence" systems, which the OSI calls the "[Open Source AI Definition](https://opensource.org/deepdive)" (OSAID) and intends to announce in late October. However, the OSAID appears to differ significantly in spirit from the OSI's original [Open Source Definition](https://opensource.org/osd) (OSD), allowing for the dataset used to train a system (and thus generally necessary to replicate it) to be proprietary, as the [checklist](https://opensource.org/deepdive/drafts/the-open-source-ai-definition-checklist-draft-v-0-0-9) attached to the latest OSAID draft permits instead publishing a research paper, a technical report, or a draft card. The OSI is explicit in its intention of not requiring training datasets to be open, having [stated](https://discuss.opensource.org/t/draft-v-0-0-9-of-the-open-source-ai-definition-is-available-for-comments/513) that (bolding mine) >The role of training data is one of the most hotly debated parts of the definition. After long deliberation and co-design sessions we have concluded that defining **training data as a benefit, not a requirement, is the best way to go**. You may learn more about this controversy from a recent [opinion piece](https://www.theregister.com/2024/09/14/opinion_column_osi/) by Steven J. Vaughan-Nichols published in The Register. ## Request for comments (RfC) In response to these news, I would like to ask the Codeberg community the following question: **Should Codeberg update its ToU if the OSI publishes the OSAID without a requirement for training datasets to be open? If so, what should be changed?**
Author
Copy link

I have two suggestions for changes to the ToU as a starting point:

  1. Clarify that "open source license" only refers to what is approved under the OSD and that OSAID is not relevant to this criterion.
  2. Require for the repository content licenses to be approved by both the FSF and the OSI, and not by merely one of them as it is now.
I have two suggestions for changes to the ToU as a starting point: 1. Clarify that "open source license" only refers to what is approved under the OSD and that OSAID is not relevant to this criterion. 2. Require for the repository content licenses to be approved by both the FSF and the OSI, and not by merely one of them as it is now.

Switching to "both FSF and OSI" sounds troublesome: There is unfortunately still a growing list of (often barely thought through but still used) licenses, and it's hard enough to get either seal of approval on them, much more both. (Sometimes the first / only party to make a decision is Debian, whose DFSG free criteria would also make a nice inclusion to our ToS, but their outcomes tend to be more case-by-case than FSF and OSI's lists of licenses).

On the OSAID front, I do hope that OSI keep a clear distinction of what they call OSI approved licenses and Open Source AI. Unless they start mixing those, the current ToU seem to clearly state that it projects need to use "an open-source license approved by [... or] the OSI" does not allow using "OSAI" or anything else that OSI happens to endorse that is not an open-source license. A rewording may make that more visible, and I wouldn't consider that as more of an editorial change.

If they do start mixing those terms, things are different, but I hope they are more professional than that.

Switching to "both FSF and OSI" sounds troublesome: There is unfortunately still a growing list of (often barely thought through but still used) licenses, and it's hard enough to get either seal of approval on them, much more both. (Sometimes the first / only party to make a decision is Debian, whose DFSG free criteria would also make a nice inclusion to our ToS, but their outcomes tend to be more case-by-case than FSF and OSI's lists of licenses). On the OSAID front, I do hope that OSI keep a clear distinction of what they call OSI approved licenses and Open Source AI. Unless they start mixing those, the current ToU seem to clearly state that it projects need to use "an open-source license approved by [... or] the OSI" does *not* allow using "OSAI" or anything else that OSI happens to endorse that is not an open-source license. A rewording may make that more visible, and I wouldn't consider that as more of an editorial change. If they *do* start mixing those terms, things are different, but I hope they are more professional than that.

I agree that we should exclude proprietary works (incl. proprietary training data) from being hosted on Codeberg.

Requiring repo licenses to be approved by both FSF and OSI though would mean excluding widely used licenses that currently are only approved by one of both (e.g. 0BSD, see https://spdx.org/licenses/).

I agree that we should exclude proprietary works (incl. proprietary training data) from being hosted on Codeberg. Requiring repo licenses to be approved by both FSF and OSI though would mean excluding widely used licenses that currently are only approved by one of both (e.g. 0BSD, see https://spdx.org/licenses/).

Meh.

Source code is defined as the preferred form used to create/modify/!?! something. By that definition alone the availability of training data is a requirement we shouldn't deviate from.

I really don't like the "both FSF and OSI" idea. The FSF has a somewhat skewed idea what free software even means IMHO (as in "a ROM you can't update is OK while the exact same code as uploadable firmware is not").

I'd prefer to simply clarify that (until further notice) the OSAID is not relevant for us.

Meh. Source code is defined as the preferred form used to create/modify/!?! something. By that definition alone the availability of training data is a requirement we shouldn't deviate from. I really don't like the "both FSF and OSI" idea. The FSF has a somewhat skewed idea what free software even means IMHO (as in "a ROM you can't update is OK while the exact same code as uploadable firmware is not"). I'd prefer to simply clarify that (until further notice) the OSAID is not relevant for us.

Thanks for starting this discussion!

👍 to what most people have said: I’d also be in favour of clarifying that the OSAI definitions aren’t used at Codeberg - but rather the code specific ones.

And I’d also be cautious of trying to require both FSF & OSI - as the overlap isn’t perfect, which could have unintended consequences!

Thanks for starting this discussion! 👍 to what most people have said: I’d also be in favour of clarifying that the OSAI definitions aren’t used at Codeberg - but rather the code specific ones. And I’d also be cautious of trying to require both FSF & OSI - as the overlap isn’t perfect, which could have unintended consequences!

I agree with the above comments.

This feels very much an attempt to blur the lines between ethically and unethically trained models, and to effectively allow more models to be eligible to slap on an OSS badge to gain legitimacy.

I would argue if the source of training data isn't disclosed, then the source of it's ability to function for it's intended purpose is effectively proprietary/closed source.

If this was any other kind of software, it wouldn't be happening, so why should models get an exception to the requirements, that all other software have?

As for the best course of action, I echo the above comments that however this is handled, there needs to be care taken to ensure that it doesn't cause unintended consequences.

I agree with the above comments. This feels very much an attempt to blur the lines between ethically and unethically trained models, and to effectively allow more models to be eligible to slap on an OSS badge to gain legitimacy. I would argue if the source of training data isn't disclosed, then the source of it's ability to function for it's intended purpose is effectively proprietary/closed source. If this was any other kind of software, it wouldn't be happening, so why should models get an exception to the requirements, that all other software have? As for the best course of action, I echo the above comments that however this is handled, there needs to be care taken to ensure that it doesn't cause unintended consequences.

For the OSAID issue, it is in my opinion clear, that we must make clear that we do not mean the OSAID when we refer to OSD. For the "FSF and OSI" vs. "FSF or OSI" issue, I honestly don't know how sensible it is to just blindly trust some controversial outside institutions that of course are also not able to think about every license that is out there. As a sort of compromise solution I would propose that stuff has to be approved by at least two of the following:

  • FSF
  • OSI's OSD
  • Debian
  • Codeberg e.V.'s presidium (in exceptional cases)

Also I'm not entirely sure whether we even should allow non-ideal licenses (like the 0BSD as mentioned previously), because a) I think we should incentivize everyone to use good and well-known/understood licenses and b) it's not like there is no alternative or that we're GitHub. Everyone can use something else if we're to restrictive for them (as for legacy projects that out of historical reasons are settled with some bad license like TeX or TrueCrypt) and we do have a mission to promote free software (which I would consider to include promotion of higher standards for what constitutes free software).

For the OSAID issue, it is in my opinion clear, that we must make clear that we do not mean the OSAID when we refer to OSD. For the "FSF and OSI" vs. "FSF or OSI" issue, I honestly don't know how sensible it is to just blindly trust some controversial outside institutions that of course are also not able to think about every license that is out there. As a sort of compromise solution I would propose that stuff has to be approved by at least two of the following: - FSF - OSI's OSD - Debian - Codeberg e.V.'s presidium (in exceptional cases) Also I'm not entirely sure whether we even should allow non-ideal licenses (like the 0BSD as mentioned previously), because a) I think we should incentivize everyone to use good and well-known/understood licenses and b) it's not like there is no alternative or that we're GitHub. Everyone can use something else if we're to restrictive for them (as for legacy projects that out of historical reasons are settled with some bad license like TeX or TrueCrypt) and we do have a mission to promote free software (which I would consider to include promotion of higher standards for what constitutes free software).

I don't think we should rely on definitions religously, neither the FSF's nor the OSIs (personally I also believe CC's non commercial licenses are fine, even though they contradict both the FSF and OSI, but that's a discussion for another day).

I'd say @Smurf's suggestion to disregard the OSAID for now is a good way to go.

Personally, the OSI's plan to not require training data to be open makes no sense to me.

I don't think we should rely on definitions religously, neither the FSF's nor the OSIs (personally I also believe CC's non commercial licenses are fine, even though they contradict both the FSF and OSI, but that's a discussion for another day). I'd say @Smurf's suggestion to disregard the OSAID for now is a good way to go. Personally, the OSI's plan to not require training data to be open makes no sense to me.

I am generally very happy that the discussion is constructive so far and seems to contain a large set of different perspectives on the subject. Thank you, @mikolaj!

Personally, the OSI's plan to not require training data to be open makes no sense to me.

This feels very much an attempt to blur the lines between ethically and unethically trained models, and to effectively allow more models to be eligible to slap on an OSS badge to gain legitimacy.

+1. Even if datasets are not "not source code", they impact the functionality under which a specific work operates, presumably in a way that the author of the said work can control. We're not talking about firmware blobs.

We should probably disregard the policy for now.

As a sort of compromise solution I would propose that stuff has to be approved by at least two of the following:

Regardless of whether we choose to do, this proposed policy is extremely interesting - that way, we would not have to depend on a single point of failure / "gatekeeper". If the others here thing that this policy sounds like a good idea, perhaps we could try to flesh it out a little and investigate e.g. whether any licenses would stop being allowed on Codeberg if such a change were to take place (or if e.g. Debian approves any licenses that we may not like).

I am generally very happy that the discussion is constructive so far and seems to contain a large set of different perspectives on the subject. Thank you, @mikolaj! > Personally, the OSI's plan to not require training data to be open makes no sense to me. > This feels very much an attempt to blur the lines between ethically and unethically trained models, and to effectively allow more models to be eligible to slap on an OSS badge to gain legitimacy. +1. Even if datasets are not "not source code", they impact the functionality under which a specific work operates, presumably in a way that the author of the said work *can* control. We're not talking about firmware blobs. We should probably disregard the policy for now. > As a sort of compromise solution I would propose that stuff has to be approved by at least two of the following: Regardless of whether we choose to do, this proposed policy is extremely interesting - that way, we would not have to depend on a single point of failure / "gatekeeper". If the others here thing that this policy sounds like a good idea, perhaps we could try to flesh it out a little and investigate e.g. whether any licenses would stop being allowed on Codeberg if such a change were to take place (or if e.g. Debian approves any licenses that we may not like).

if e.g. Debian approves any licenses that we may not like

It is my impression that in terms of "Free Software purity", Debian is the strictest of the three. Not only do they apply their list of criteria, they also apply a few extra tests for whether something is Free Software.

The downside of Debian as a source is that they reason about packages, not licenses. The most prominent example thereof is the LaTeX license an older version of which made strict requirements on files being renamed on modification: AFAICT that was only accepted for the LaTeX packages because there was some mechanism in place that allowed dealing with such renames without much of a hassle. (It may actually be aligned more with the Codeberg project to do the same, because hosting binary blobs that are declared Apache-2.0, as seen in ESP32's SDK, is probably contrary to the spirit, but that's a different concern).

If we go for a quorum regulation (such as "at least 2 of OSI, FSF, Debian and executive exception"), I'd appreciate Debian to be one source of definitions – with the second party acting as a check against corner cases. A phrasing for that could be:

We consider a license as "accepted by Debian" if a file that is not a license itself and that is part of Debian's main repository is either licensed exclusively under that license, or the license is part of its licensing options when all alternative license are clearly not Free Software licenses.

(Debian also has a list of generally accepted licenses at https://www.debian.org/legal/licenses/ and an even smaller set in /usr/share/common-licenses, but those are all just the common ones, which are probably all FSF and OSI cleared anyway.)

> if e.g. Debian approves any licenses that we may not like It is my impression that in terms of "Free Software purity", Debian is the strictest of the three. Not only do they apply their [list of criteria](https://www.debian.org/social_contract#guidelines), they also apply a few [extra](https://wiki.debian.org/DissidentTest) [tests](https://wiki.debian.org/DesertIslandTest) for whether something is Free Software. The downside of Debian as a source is that they reason about packages, not licenses. The most prominent example thereof is the [LaTeX license](https://en.wikipedia.org/wiki/LaTeX_Project_Public_License#cite_ref-2) an older version of which made strict requirements on files being renamed on modification: AFAICT that was only accepted for the LaTeX packages because there was some mechanism in place that allowed dealing with such renames without much of a hassle. (It may actually be aligned more with the Codeberg project to do the same, because hosting binary blobs that are declared Apache-2.0, as seen in ESP32's SDK, is probably contrary to the spirit, but that's a different concern). If we go for a quorum regulation (such as "at least 2 of OSI, FSF, Debian and executive exception"), I'd appreciate Debian to be one source of definitions – with the second party acting as a check against corner cases. A phrasing for that could be: > We consider a license as "accepted by Debian" if a file that is not a license itself and that is part of Debian's main repository is either licensed exclusively under that license, or the license is part of its licensing options when all alternative license are clearly not Free Software licenses. (Debian also has a list of generally accepted licenses at https://www.debian.org/legal/licenses/ and an even smaller set in /usr/share/common-licenses, but those are all just the common ones, which are probably all FSF and OSI cleared anyway.)

It is my impression that in terms of "Free Software purity", Debian is the strictest of the three.

For the record, I was only familiar with how Debian was strict, but not strictest of the three strict. Thank you for clarifying!

> It is my impression that in terms of "Free Software purity", Debian is the strictest of the three. For the record, I was only familiar with how Debian was strict, but not *strictest of the three* strict. Thank you for clarifying!

What about data that is in the Public Domain? While not really relevant for software, it definitely is for data. For example, if you create a repo of UN Resolutions or Shakespeares / Victor Hugos works (e.g. for testing purposes (I have done something similar (not an entire repo, rather some files in a repo) here: https://codeberg.org/Elshid/LetterDistributionViewer/src/branch/main/data)), they wouldn't be under any license at all.

What about data that is in the Public Domain? While not really relevant for software, it definitely is for data. For example, if you create a repo of UN Resolutions or Shakespeares / Victor Hugos works (e.g. for testing purposes (I have done something similar (not an entire repo, rather some files in a repo) here: https://codeberg.org/Elshid/LetterDistributionViewer/src/branch/main/data)), they wouldn't be under any license at all.

What about data that is in the Public Domain?

I am not sure what gave the impression that data in the Public Domain (disregarding the fact that the definition varies from country to country) would cause a problem or otherwise be controversial, given that it is possible to inspect these works and use them freely as you wish. I guess I'd say that attribution would help, so that another user can understand how you trained something. It would give them the ability to reconstruct something that you tried doing.

> What about data that is in the Public Domain? I am not sure what gave the impression that data in the Public Domain (disregarding the fact that the definition varies from country to country) would cause a problem or otherwise be controversial, given that it is possible to inspect these works and use them freely as you wish. I guess I'd say that *attribution* would help, so that another user can understand *how* you trained something. It would give them the ability to reconstruct something that you tried doing.

I don't think that it is controversial, it's just that these files wouldn't comply with §2(1), as no copyright applies to them at all.
Therefore, it may be a good idea to clarify this in a future version of the TOS.

I don't think that it is controversial, it's just that these files wouldn't comply with §2(1), as no copyright applies to them at all. Therefore, it may be a good idea to clarify this in a future version of the TOS.
Author
Copy link

The ToU says "Repository content shall be licensed under an open-source license (...)", but public domain is not a license. Even though public domain is present on the FSF's license list as a special case (but not on OSI's), this probably should be clarified.

The ToU says "Repository content shall be licensed under an open-source license (...)", but public domain is not a license. Even though public domain is [present](https://www.gnu.org/licenses/license-list.en.html#PublicDomain) on the FSF's license list as a special case ([but not on OSI's](https://opensource.org/faq#public-domain)), this probably should be clarified.

Therefore, it may be a good idea to clarify this in a future version of the TOS.

Well, I agree with you just on general principle.

However, in (legal as well as factual) practice, one non-negotiable prerequisite for having a problem with a copyright or license is the very existence of a copyright owner / license holder – and for PD works there is no such entity.

> Therefore, it may be a good idea to clarify this in a future version of the TOS. Well, I agree with you just on general principle. However, in (legal as well as factual) practice, one non-negotiable prerequisite for having a problem with a copyright or license is the very existence of a copyright owner / license holder – and for PD works there is no such entity.

Well, if you only allow things under a certain license and stuff isn't under that license, that's not allowed ,even if that means that does not even fall under copyright protection. Of course, I guess mods won't do anything against let's say a collection of Victor Hugo works, at least not because of the license. But still, I think that one should clarify that.

Well, if you only allow things under a certain license and stuff isn't under that license, that's not allowed ,even if that means that does not even fall under copyright protection. Of course, I guess mods won't do anything against let's say a collection of Victor Hugo works, at least not because of the license. But still, I think that one should clarify that.

OK, so, to recap, we have the following suggestions:

  1. Change how we approve licenses
  2. Issue clarification for public domain material

The AI problem is not something that should concern us now, but we could do those two things in the meantime.

OK, so, to recap, we have the following suggestions: 1. Change how we approve licenses 2. Issue clarification for public domain material The AI problem is not something that should concern us now, but we could do those two things in the meantime.

The downside of Debian as a source is that they reason about packages, not licenses. The most prominent example thereof is the LaTeX license an older version of which made strict requirements on files being renamed on modification: AFAICT that was only accepted for the LaTeX packages because there was some mechanism in place that allowed dealing with such renames without much of a hassle.

I don't know all the history here, but even the current version requires making it obvious it is not the original etc. (I don't personally object to this in the (La)TeX case and I typically licence anything I write under LPPL, but I wouldn't use the license for non-TeX software, for sure.)

A phrasing for that could be:

We consider a license as "accepted by Debian" if a file that is not a license itself and that is part of Debian's main repository is either licensed exclusively under that license, or the license is part of its licensing options when all alternative license are clearly not Free Software licenses.

I would be a little bit cautious about this. It is very likely there are some files in Debian's repos which shouldn't be there, however careful they are, because there are a vast number of files and human beings miss things. I would be surprised if every TeX file they distribute has a licence Codeberg would be comfortable with. I did some work on licensing for TeX Live some years back and managed to get some of the problematic files licensed properly, but I certainly did not solve the problem and I doubt others have managed to meanwhile. These days, CTAN won't take packages without a clear licence and TeX Live won't take software without a free licence acceptable to Debian, Fedora etc. But there are a lot of files which were accepted earlier and where the situation is likely to remain extremely unclear.

> The downside of Debian as a source is that they reason about packages, not licenses. The most prominent example thereof is the [LaTeX license](https://en.wikipedia.org/wiki/LaTeX_Project_Public_License#cite_ref-2) an older version of which made strict requirements on files being renamed on modification: AFAICT that was only accepted for the LaTeX packages because there was some mechanism in place that allowed dealing with such renames without much of a hassle. I don't know all the history here, but even the current version requires making it obvious it is not the original etc. (I don't personally object to this in the (La)TeX case and I typically licence anything I write under LPPL, but I wouldn't use the license for non-TeX software, for sure.) > A phrasing for that could be: > > > We consider a license as "accepted by Debian" if a file that is not a license itself and that is part of Debian's main repository is either licensed exclusively under that license, or the license is part of its licensing options when all alternative license are clearly not Free Software licenses. I would be a little bit cautious about this. It is very likely there are some files in Debian's repos which shouldn't be there, however careful they are, because there are a vast number of files and human beings miss things. I would be surprised if every TeX file they distribute has a licence Codeberg would be comfortable with. I did some work on licensing for TeX Live some years back and managed to get some of the problematic files licensed properly, but I certainly did not solve the problem and I doubt others have managed to meanwhile. These days, CTAN won't take packages without a clear licence and TeX Live won't take software without a free licence acceptable to Debian, Fedora etc. But there are a lot of files which were accepted earlier and where the situation is likely to remain extremely unclear.
Author
Copy link

I've opened a pull request to accept public domain content in the ToU.

I've opened a [pull request](https://codeberg.org/Codeberg/org/pulls/64) to accept public domain content in the ToU.
Owner
Copy link

The ToU should also explicitly allow non-software free content under the CC ‘free culture’ licences. (I can make a separate issue for that if needed.)

The ToU should also explicitly allow non-software free content under the CC ‘free culture’ licences. (I can make a separate issue for that if needed.)

Excuse me for an ignorant question, but how does this affect private repositories?

Excuse me for an ignorant question, but how does this affect private repositories?
Author
Copy link

@AdamWysokinski

Excuse me for an ignorant question, but how does this affect private repositories?

It doesn't appear that there will be any narrowing of what is currently (pre-OSAID) allowed to be hosted on Codeberg, so I don't think that currently existing private repositories will be affected anyhow.

If you mean future private repositories, then § 2 (2) of the ToU should apply:

Private repositories are only allowed for things required for FLOSS projects, like storing secrets, team-internal discussions or hiding projects from the public until they're ready for usage and/or contribution. They are also allowed for really small & personal stuff like your journal, config files, ideas or notes, but explicitly not as a personal cloud or media storage.

Which means that if the ToU is updated to exclude OSAID from its definition of FLOSS, all else being equal, then this exclusion will apply to both public and private repositories.

@AdamWysokinski > Excuse me for an ignorant question, but how does this affect private repositories? It doesn't appear that there will be any narrowing of what is currently (pre-OSAID) allowed to be hosted on Codeberg, so I don't think that currently existing private repositories will be affected anyhow. If you mean future private repositories, then § 2 (2) of the ToU should apply: >Private repositories are only allowed for things required for FLOSS projects, like storing secrets, team-internal discussions or hiding projects from the public until they're ready for usage and/or contribution. They are also allowed for really small & personal stuff like your journal, config files, ideas or notes, but explicitly not as a personal cloud or media storage. Which means that if the ToU is updated to exclude OSAID from its definition of FLOSS, all else being equal, then this exclusion will apply to both public and private repositories.

The simplest and safest approach for the community right now is to explicitly do nothing — state clearly that until community consensus says otherwise, Open Source is as defined in the OSD 1.9.

Several of us have been working on this and have already frozen the important CC-BY licensed artefacts from the OSI at this point in time to prevent another attempted fork when they return from a long vacation and decide to "harmonize" the OSAID with the OSD (or, more likely, vice versa). It's hosted on https://opensourcedefinition.org and we've also tracked down several prior versions and built a repo at https://opensourcedefinition.org/osd.git which I've mirrored here at https://codeberg.org/osd/osd

Several of us have just started working on a self-signinig Save Open Source (SOS) statement like the various RMS letters and would be happy to have help on that today for tomorrow. I expect it will live here, but we'll want to reach the Open Source communities on Github, Hugging Face, etc. too: https://codeberg.org/osd/sos

The OSD is proven on "openness" but weak (or at least implicit) on "completeness" (the OSAID is neither). Frameworks like the OSAID checklist and Linux Foundation's Model Openness Framework (MOF) attempt to address completeness, but both completely fail at openness; the former doesn't even require the data and the latter (Class I) allows it under "any license or unlicensed".

Per Bruce Perens, "the training data IS the source code" and you just need to apply the Open Source Definition (OSD) to both the source code and the data. While the default and safest option is to lock down and declare OSD 1.9 as the authoritative source for Open Source (wherever it's hosted/replicated), should consensus be achieved it would be possible to solve the completeness problem with a single sentence addition to the intro, as proposed in the WIP: https://opensourcedefinition.org/wip

Please join the discussions at https://discuss.opensourcedefinition.org - an uncensored version of discuss.os.o

The simplest and safest approach for the community right now is to explicitly do nothing — state clearly that until community consensus says otherwise, Open Source is as defined in the OSD 1.9. Several of us have been working on this and have already frozen the important CC-BY licensed artefacts from the OSI at this point in time to prevent another attempted fork when they return from a long vacation and decide to "harmonize" the OSAID with the OSD (or, more likely, vice versa). It's hosted on https://opensourcedefinition.org and we've also tracked down several prior versions and built a repo at https://opensourcedefinition.org/osd.git which I've mirrored here at https://codeberg.org/osd/osd Several of us have just started working on a self-signinig Save Open Source (SOS) statement like the various RMS letters and would be happy to have help on that today for tomorrow. I expect it will live here, but we'll want to reach the Open Source communities on Github, Hugging Face, etc. too: https://codeberg.org/osd/sos The OSD is proven on "openness" but weak (or at least implicit) on "completeness" (the OSAID is neither). Frameworks like the OSAID checklist and Linux Foundation's Model Openness Framework (MOF) attempt to address completeness, but both completely fail at openness; the former doesn't even require the data and the latter (Class I) allows it under "any license or unlicensed". Per Bruce Perens, "the training data IS the source code" and you just need to apply the Open Source Definition (OSD) to both the source code and the data. While the default and safest option is to lock down and declare OSD 1.9 as the authoritative source for Open Source (wherever it's hosted/replicated), should consensus be achieved it would be possible to solve the completeness problem with a single sentence addition to the intro, as proposed in the WIP: https://opensourcedefinition.org/wip Please join the discussions at https://discuss.opensourcedefinition.org - an uncensored version of discuss.os.o

Shouldn't this question involve actual data?

What would we lose by rejecting OSAID? Are there any existing projects on Codeberg that would have to be purged? Would any projects that simply make use of an LLM have their openness questioned? What do we lose and what do we gain?

It's difficult, because as far as I know, there are no "open" LLMs, I don't even know of any open datasets that are substantial enough to train an LLM. And even then, good luck actually training a model from scratch. It's certainly unfortunate that there are no truly open models, but at the same time, people are using these algorithms.

Personally, I'm in favor of just waiting. We should probably wait until actual open models start popping up before we decide if we should be discriminating against closed ones. I think that's the entire point behind the decisions made for OSAID. If nothing else, perhaps simply creating a BLOB badge to mark repositories that contain binary blobs or models with undisclosed training data.

Shouldn't this question involve actual data? What would we lose by rejecting OSAID? Are there any existing projects on Codeberg that would have to be purged? Would any projects that simply make use of an LLM have their openness questioned? What do we lose and what do we gain? It's difficult, because as far as I know, there are no "open" LLMs, I don't even know of any open datasets that are substantial enough to train an LLM. And even then, good luck actually training a model from scratch. It's certainly unfortunate that there are no truly open models, but at the same time, people are using these algorithms. Personally, I'm in favor of just waiting. We should probably wait until actual open models start popping up before we decide if we should be discriminating against closed ones. I think that's the entire point behind the decisions made for OSAID. If nothing else, perhaps simply creating a `BLOB` badge to mark repositories that contain binary blobs or models with undisclosed training data.
Author
Copy link

Update.

Turns out that someone in Debian has been already working on a policy draft for ML models. The last update in the repository was 7 months ago, however.

LWN.net wrote an article summarizing the most recent activities around the OSAID, which I recommend for everyone interested to read. It covers an announcement made by the Free Software Foundation, an aspirational statement issued by the Software Freedom Conservancy, some more opinions of people both opposed and supporting OSAID, and a response to editor's request for comment made by the OSI's executive director, Stefano Maffulli.

Steven J. Vaughan-Nichols has published another opinion piece criticizing the OSAID in The Register.

The OSI too has published a few articles on its blog, which ye may want to read to better understand their way of thinking.

**Update**. Turns out that someone in Debian has been already working on a [policy draft](https://salsa.debian.org/deeplearning-team/ml-policy) for ML models. The last update in the repository was 7 months ago, however. LWN.net wrote an [article](https://lwn.net/SubscriberLink/995159/fb948a90f9c42339/) summarizing the most recent activities around the OSAID, which I recommend for everyone interested to read. It covers an [announcement](https://www.fsf.org/news/fsf-is-working-on-freedom-in-machine-learning-applications) made by the Free Software Foundation, an [aspirational statement](https://sfconservancy.org/news/2024/oct/25/aspirational-on-llm-generative-ai-programming/) issued by the Software Freedom Conservancy, some more opinions of people both opposed and supporting OSAID, and a response to editor's request for comment made by the OSI's executive director, Stefano Maffulli. Steven J. Vaughan-Nichols has published another [opinion piece](https://www.theregister.com/2024/10/25/opinion_open_washing/) criticizing the OSAID in The Register. The OSI too has published a few articles on its [blog](https://opensource.org/), which ye may want to read to better understand their way of thinking.

Indeed, I'm a DD too and the OSI was founded by DDs so it's not surprising we're trying to right wrongs. lumin@'s ML-Policy refers to much of what's covered by OSAID as "ToxicCandy", where they show you how to train a model but don't give you the data you need to do it — if I had the training code for Llama and ran it on texts with my mother, for example, I would not end up with Llama.

There's a lot of mental masturbation over whether source code is a perfect analogy for training data, and even "preferred form" has proven surprisingly subjective, but you can still be objective by looking to the free software definition which further clarifies that what the developer actually changes is the source code; too bad that didn't go in the definition:

Source code is defined as the preferred form of the program for making changes in. Thus, whatever form a developer changes to develop the program is the source code of that developer's version.

While we're here, there's also the idea that being able to make any changes (i.e., fine-tuning) is equivalent to all changes (e.g, re-training, re-architecting) as required by the four freedoms:

Whether a change constitutes an improvement is a subjective matter. If your right to modify a program is limited, in substance, to changes that someone else considers an improvement, that program is not free.

If we look at what a[n original or downstream] developer actually changes to make any/all changes to an AI system, it's the training data, so that's what needs to be made available. If we're to compromise at all then it's on "public" datasets like Common Crawl (for which even they don't have a license to be able to re-license to you), but to do that would just be because the've managed to shift the Overton window of Open Source from open data to literally any data you can get your hands on.

Indeed, I'm a DD too and the OSI was founded by DDs so it's not surprising we're trying to right wrongs. lumin@'s ML-Policy refers to much of what's covered by OSAID as "ToxicCandy", where they show you how to train a model but don't give you the data you need to do it — if I had the training code for Llama and ran it on texts with my mother, for example, I would not end up with Llama. There's a lot of mental masturbation over whether source code is a perfect analogy for training data, and even "preferred form" has proven surprisingly subjective, but you can still be objective by looking to the [free software definition]([url](https://www.gnu.org/philosophy/free-sw.html#make-changes)) which further clarifies that what the developer _actually changes_ is the source code; too bad that didn't go in the definition: > Source code is defined as the preferred form of the program for making changes in. Thus, **whatever form a developer changes to develop the program** is the source code of that developer's version. While we're here, there's also the idea that being able to make _any_ changes (i.e., fine-tuning) is equivalent to _all_ changes (e.g, re-training, re-architecting) as required by the four freedoms: > Whether a change constitutes an improvement is a subjective matter. If your right to modify a program is limited, in substance, to changes that someone else considers an improvement, that program is not free. If we look at what a[n original or downstream] developer _actually changes_ to make any/all changes to an AI system, it's the training data, so that's what needs to be made available. If we're to compromise at all then it's on "public" datasets like Common Crawl (for which even they don't have a license to be able to re-license to you), but to do that would just be because the've managed to shift the Overton window of Open Source from open data to literally any data you can get your hands on.

@mikolaj

Thank you for the clarification.

@mikolaj Thank you for the clarification.
Sign in to join this conversation.
No Branch/Tag specified
main
No results found.
Labels
Clear labels
accessibility

Reduces accessibility and is thus a "bug" for certain user groups on Codeberg.
bug

Something is not working the way it should. Does not concern outages.
bug
infrastructure

Errors evidently caused by infrastructure malfunctions or outages
Codeberg

This issue involves Codeberg's downstream modifications and settings and/or Codeberg's structures.
contributions welcome

Please join the discussion and consider contributing a PR!
docs

No bug, but an improvement to the docs or UI description will help
duplicate

This issue or pull request already exists
enhancement

New feature
infrastructure

Involves changes to the server setups, use `bug/infrastructure` for infrastructure-related user errors.
legal

An issue directly involving legal compliance
licence / ToS

involving questions about the ToS, especially licencing compliance
please chill
we are volunteers

Please consider editing your posts and remember that there is a human on the other side. We get that you are frustrated, but it's harder for us to help you this way.
public relations

Things related to Codeberg's external communication
question

More information is needed
question
user support

This issue contains a clearly stated problem. However, it is not clear whether we have to fix anything on Codeberg's end, but we're helping them fix it and/or find the cause.
s/Forgejo

Related to Forgejo. Please also check Forgejo's issue tracker.
s/Forgejo/migration

Migration related issues in Forgejo
s/Pages

Issues related to the Codeberg Pages feature
s/Weblate

Issue is related to the Weblate instance at https://translate.codeberg.org
s/Woodpecker

Woodpecker CI related issue
security

involves improvements to the sites security
service

Add a new service to the Codeberg ecosystem (instead of implementing into Gitea)
upstream

An open issue or pull request to an upstream repository to fix this issue (partially or completely) exists (i.e. Gitea, Forgejo, etc.)
wontfix

Codeberg's current set of contributors are not planning to spend time on delegating this issue.
Milestone
Clear milestone
No items
No milestone
Projects
Clear projects
No items
No project
Assignees
Clear assignees
No assignees
15 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
Codeberg/Community#1654
Reference in a new issue
Codeberg/Community
No description provided.
Delete branch "%!s()"

Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?