Generative AI/ML models, such as those used for text generation, image synthesis, and other creative tasks, rely on inference parameters that control model behavior, such as temperature, Top P, and Top K. These parameters affect the model's internal decision-making processes, learning rate, and probability distributions. Incorrect settings can lead to unusual behavior such as text "hallucinations," unrealistic images, or failure to converge during training. The impact of such misconfigurations can compromise the integrity of the application. If the results are used in security-critical operations or decisions, then this could violate the intended security policy, i.e., introduce a vulnerability.
| Impact | Details |
|---|---|
|
Varies by Context; Unexpected State |
Scope: Integrity, Other
The product can generate inaccurate, misleading, or
nonsensical information.
|
|
Alter Execution Logic; Unexpected State; Varies by Context |
Scope: Other
If outputs are used in critical decision-making
processes, errors could be propagated to other systems or
components.
|
| Phase(s) | Mitigation |
|---|---|
|
Implementation; System Configuration; Operation |
Develop and adhere to robust parameter tuning
processes that include extensive testing and
validation.
|
|
Implementation; System Configuration; Operation |
Implement feedback mechanisms to continuously
assess and adjust model performance.
|
|
Documentation |
Provide comprehensive documentation and
guidelines for parameter settings to ensure consistent and
accurate model behavior.
|
| Nature | Type | ID | Name |
|---|---|---|---|
| ChildOf | Base Base - a weakness that is still mostly independent of a resource or technology, but with sufficient details to provide specific methods for detection and prevention. Base level weaknesses typically describe issues in terms of 2 or 3 of the following dimensions: behavior, property, technology, language, and resource. | 440 | Expected Behavior Violation |
| ChildOf | Class Class - a weakness that is described in a very abstract fashion, typically independent of any specific language or technology. More specific than a Pillar Weakness, but more general than a Base Weakness. Class level weaknesses typically describe issues in terms of 1 or 2 of the following dimensions: behavior, property, and resource. | 665 | Improper Initialization |
| PeerOf | Pillar Pillar - a weakness that is the most abstract type of weakness and represents a theme for all class/base/variant weaknesses related to it. A Pillar is different from a Category as a Pillar is still technically a type of weakness that describes a mistake, while a Category represents a common characteristic used to group related things. | 691 | Insufficient Control Flow Management |
| CanPrecede | Class Class - a weakness that is described in a very abstract fashion, typically independent of any specific language or technology. More specific than a Pillar Weakness, but more general than a Base Weakness. Class level weaknesses typically describe issues in terms of 1 or 2 of the following dimensions: behavior, property, and resource. | 684 | Incorrect Provision of Specified Functionality |
| Phase | Note |
|---|---|
| Build and Compilation | During model training, hyperparameters may be set without adequate validation or understanding of their impact. |
| Installation | During deployment, model parameters may be adjusted to optimize performance without comprehensive testing. |
| Patching and Maintenance | Updates or modifications may be made to the model that alter its behavior without thorough re-evaluation. |
Class: Not Language-Specific (Undetermined Prevalence)
Class: Not Architecture-Specific (Undetermined Prevalence)
AI/ML (Undetermined Prevalence)
Class: Not Technology-Specific (Undetermined Prevalence)
Example 1
Assume the product offers an LLM-based AI coding assistant to help users to write code as part of an Integrated Development Environment (IDE). Assume the model has been trained on real-world code, and the model behaves normally under its default settings. Suppose there is a default temperature of 1, with a range of temperature values from 0 (most deterministic) to 2.
Consider the following configuration.
The problem is that the configuration contains a temperature hyperparameter that is higher than the default. This significantly increases the likelihood that the LLM will suggest a package that did not exist at training time, a behavior sometimes referred to as "package hallucination." Note that other possible behaviors could arise from higher temperature, not just package hallucination.
An adversary could anticipate which package names could be generated and create a malicious package. For example, it has been observed that the same LLM might hallucinate the same package regularly. Any code that is generated by the LLM, when run by the user, would download and execute the malicious package. This is similar to typosquatting.
The risk could be reduced by lowering the temperature so that it reduces the unpredictable outputs and has a better chance of staying more in line with the training data. If the temperature is set too low, then some of the power of the model will be lost, and it may be less capable of producing solutions for rarely-encountered problems that are not reflected in the training data. However, if the temperature is not set low enough, the risk of hallucinating package names may still be too high. Unfortunately, the "best" temperature cannot be determined a priori, and sufficient empirical testing is needed.
In addition to more restrictive temperature settings, consider adding guardrails that test that independently verify any referenced package to ensure that it exists, is not obsolete, and comes from a trusted party.
Note that reducing temperature does not entirely eliminate the risk of package hallucination. Even with very low temperatures or other settings, there is still a small chance that a non-existent package name will be generated.
| Ordinality | Description |
|---|---|
|
Primary
|
(where the weakness exists independent of other weaknesses)
|
| Method | Details |
|---|---|
|
Automated Dynamic Analysis |
Manipulate inference parameters and perform
comparative evaluation to assess the impact of selected
values. Build a suite of systems using targeted tools that
detect problems such as prompt injection (CWE-1427) and
other problems. Consider statistically measuring token
distribution to see if it is consistent with expected
results.
Effectiveness: Moderate Note:Given the large variety of outcomes, it can be difficult to design testing to be comprehensive enough, and there is still a risk of unpredictable behavior. |
|
Manual Dynamic Analysis |
Manipulate inference parameters and perform
comparative evaluation to assess the impact of selected
values. Build a suite of systems using targeted tools that
detect problems such as prompt injection (CWE-1427) and
other problems. Consider statistically measuring token
distribution to see if it is consistent with expected
results.
Effectiveness: Moderate Note:Given the large variety of outcomes, it can be difficult to design testing to be comprehensive enough, and there is still a risk of unpredictable behavior. |
| Nature | Type | ID | Name |
|---|---|---|---|
| MemberOf | CategoryCategory - a CWE entry that contains a set of other entries that share a common characteristic. | 1412 | Comprehensive Categorization: Poor Coding Practices |
Rationale
This CWE entry is at the Base level of abstraction, which is a preferred level of abstraction for mapping to the root causes of vulnerabilities.Comments
Carefully read both the name and description to ensure that this mapping is an appropriate fit. Do not try to 'force' a mapping to a lower-level Base/Variant simply to comply with this preferred level of abstraction.Research Gap
| Submissions | ||
|---|---|---|
| Submission Date | Submitter | Organization |
|
2024年06月28日
(CWE 4.18, 2025年09月09日) |
Lily Wong | MITRE |
| Contributions | ||
| Contribution Date | Contributor | Organization |
|
2025年02月28日
(CWE 4.18, 2025年09月09日) |
AI WG "New Entry" subgroup | |
| Participated in regular meetings from February to August 2025 to develop and refine most elements of this entry. | ||
Use of the Common Weakness Enumeration (CWE™) and the associated references from this website are subject to the Terms of Use. CWE is sponsored by the U.S. Department of Homeland Security (DHS) Cybersecurity and Infrastructure Security Agency (CISA) and managed by the Homeland Security Systems Engineering and Development Institute (HSSEDI) which is operated by The MITRE Corporation (MITRE). Copyright © 2006–2025, The MITRE Corporation. CWE, CWSS, CWRAF, and the CWE logo are trademarks of The MITRE Corporation.