| Home > CAPEC List > CAPEC-80: Using UTF-8 Encoding to Bypass Validation Logic (Version 3.9) |
|
High
High
| Nature | Type | ID | Name |
|---|---|---|---|
| ChildOf | Standard Attack PatternStandard Attack Pattern - A standard level attack pattern in CAPEC is focused on a specific methodology or technique used in an attack. It is often seen as a singular piece of a fully executed attack. A standard attack pattern is meant to provide sufficient details to understand the specific technique and how it attempts to accomplish a desired goal. A standard level attack pattern is a specific type of a more abstract meta level attack pattern. | 267 | Leverage Alternate Encoding |
| PeerOf | Detailed Attack PatternDetailed Attack Pattern - A detailed level attack pattern in CAPEC provides a low level of detail, typically leveraging a specific technique and targeting a specific technology, and expresses a complete execution flow. Detailed attack patterns are more specific than meta attack patterns and standard attack patterns and often require a specific protection mechanism to mitigate actual attacks. A detailed level attack pattern often will leverage a number of different standard level attack patterns chained together to accomplish a goal. | 64 | Using Slashes and URL Encoding Combined to Bypass Validation Logic |
| PeerOf | Detailed Attack PatternDetailed Attack Pattern - A detailed level attack pattern in CAPEC provides a low level of detail, typically leveraging a specific technique and targeting a specific technology, and expresses a complete execution flow. Detailed attack patterns are more specific than meta attack patterns and standard attack patterns and often require a specific protection mechanism to mitigate actual attacks. A detailed level attack pattern often will leverage a number of different standard level attack patterns chained together to accomplish a goal. | 71 | Using Unicode Encoding to Bypass Validation Logic |
| View Name | Top Level Categories |
|---|---|
| Domains of Attack | Software |
| Mechanisms of Attack | Manipulate Data Structures |
Survey the application for user-controllable inputs: Using a browser or an automated tool, an attacker follows all public links and actions on a web site. They record all the links, the forms, the resources accessed and all other potential entry-points for the web application.
| Techniques |
|---|
| Use a spidering tool to follow and record all links and analyze the web pages to find entry points. Make special note of any links that include parameters in the URL. |
| Use a proxy tool to record all user input entry points visited during a manual traversal of the web application. |
| Use a browser to manually explore the website and analyze how it is constructed. Many browsers' plugins are available to facilitate the analysis or automate the discovery. |
Probe entry points to locate vulnerabilities: The attacker uses the entry points gathered in the "Explore" phase as a target list and injects various UTF-8 encoded payloads to determine if an entry point actually represents a vulnerability with insufficient validation logic and to characterize the extent to which the vulnerability can be exploited.
| Techniques |
|---|
| Try to use UTF-8 encoding of content in Scripts in order to bypass validation routines. |
| Try to use UTF-8 encoding of content in HTML in order to bypass validation routines. |
| Try to use UTF-8 encoding of content in CSS in order to bypass validation routines. |
| Scope | Impact | Likelihood |
|---|---|---|
Confidentiality Access Control Authorization | Bypass Protection Mechanism | |
Confidentiality Integrity Availability | Execute Unauthorized Commands | |
Integrity | Modify Data | |
Availability | Unreliable Execution |
The exact response required from an UTF-8 decoder on invalid input is not uniformly defined by the standards. In general, there are several ways a UTF-8 decoder might behave in the event of an invalid byte sequence:
It is possible for a decoder to behave in different ways for different types of invalid input.
RFC 3629 only requires that UTF-8 decoders must not decode "overlong sequences" (where a character is encoded in more bytes than needed but still adheres to the forms above). The Unicode Standard requires a Unicode-compliant decoder to "...treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence."
Overlong forms are one of the most troublesome types of UTF-8 data. The current RFC says they must not be decoded but older specifications for UTF-8 only gave a warning and many simpler decoders will happily decode them. Overlong forms have been used to bypass security validations in high profile products including Microsoft's IIS web server. Therefore, great care must be taken to avoid security issues if validation is performed before conversion from UTF-8, and it is generally much simpler to handle overlong forms before any input validation is done.
To maintain security in the case of invalid input, there are two options. The first is to decode the UTF-8 before doing any input validation checks. The second is to use a decoder that, in the event of invalid input, returns either an error or text that the application considers to be harmless. Another possibility is to avoid conversion out of UTF-8 altogether but this relies on any other software that the data is passed to safely handling the invalid data.
Another consideration is error recovery. To guarantee correct recovery after corrupt or lost bytes, decoders must be able to recognize the difference between lead and trail bytes, rather than just assuming that bytes will be of the type allowed in their position.
Perhaps the most famous UTF-8 attack was against unpatched Microsoft Internet Information Server (IIS) 4 and IIS 5 servers. If an attacker made a request that looked like this
the server didn't correctly handle %c0%af in the URL. What do you think %c0%af means? It's 11000000 10101111 in binary; and if it's broken up using the UTF-8 mapping rules, we get this: 11000000 10101111. Therefore, the character is 00000101111, or 0x2F, the slash (/) character! The %c0%af is an invalid UTF-8 representation of the / character. Such an invalid UTF-8 escape is often referred to as an overlong sequence.
So when the attacker requested the tainted URL, they accessed
In other words, they walked out of the script's virtual directory, which is marked to allow program execution, up to the root and down into the system32 directory, where they could pass commands to the command shell, Cmd.exe.
See also: CVE-2000-0884| CWE-ID | Weakness Name |
|---|---|
| 173 | Improper Handling of Alternate Encoding |
| 172 | Encoding Error |
| 180 | Incorrect Behavior Order: Validate Before Canonicalize |
| 181 | Incorrect Behavior Order: Validate Before Filter |
| 73 | External Control of File Name or Path |
| 74 | Improper Neutralization of Special Elements in Output Used by a Downstream Component ('Injection') |
| 20 | Improper Input Validation |
| 697 | Incorrect Comparison |
| 692 | Incomplete Denylist to Cross-Site Scripting |
| Submissions | ||
|---|---|---|
| Submission Date | Submitter | Organization |
| 2014年06月23日 (Version 2.6) | CAPEC Content Team | The MITRE Corporation |
| Modifications | ||
| Modification Date | Modifier | Organization |
| 2018年07月31日 (Version 2.12) | CAPEC Content Team | The MITRE Corporation |
| Updated References | ||
| 2020年07月30日 (Version 3.3) | CAPEC Content Team | The MITRE Corporation |
| Updated Example_Instances, Execution_Flow, Mitigations, Skills_Required | ||
| 2021年06月24日 (Version 3.5) | CAPEC Content Team | The MITRE Corporation |
| Updated Related_Weaknesses | ||
| 2022年09月29日 (Version 3.8) | CAPEC Content Team | The MITRE Corporation |
| Updated Example_Instances, Mitigations | ||
|
Use of the Common Attack Pattern Enumeration and Classification (CAPEC), and the associated references from this website are subject to the Terms of Use. Copyright © 2007–2025, The MITRE Corporation. CAPEC and the CAPEC logo are trademarks of The MITRE Corporation. |
||