Return to Answer

unicode escape sequence

edited Aug 23, 2019 at 18:14

14.1k
3
40
101

Maintainability

No matter the possibilities of a regex, it's not intended for parsing complex languages. Nobody (no sane person) is able to make changes to this regex (even if provided with comments):

^((?:(?:(?:[^"'/\n]|/(?!/))*)(?("(?=(?:\\\\|\\"|[^"])*"))(?:"(?:\\\\|\\"|[^"])*")|(?('(?=(?:\\\\|\\'|[^'])*'))(?:'(?:\\\\|\\'|[^'])*')|(?(/)|.))))*)

In addition, even if you think you are covering it all, some people might still find edge cases, rendering the regex incomplete, requiring even more subtle changes to the regex. At some point, you might (and probably will) get in a situation where you realise the problem is harder than you imagined, and that the problem might even require complex context, that just isn't solvable through regex. Back to drawing boards and think about an alternative solution the your problem.

As P.Roe so eloquently states in the comments:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Missing Edge Case(s)

There is a case I noticed you didn't cover.~~(削除) There is a case I noticed you didn't cover. (削除ここまで)~~ C# allows unicode escape sequences. Literals can have /xnnnn\xnnnn escape sequences, while all C# code can handle /unnnn\unnnn and /Unnnnnnnn\Unnnnnnnn escape sequences. The preprocessor of the compiler transforms these latter two into characters. This means /u2215 which represents a slash, but / is not found by your regex.only in the following cases :
- identifiers
- character literals
- regular string literals This means \u2215 which represents a slash / is not found by your regex.
I'm also not sure whether you can track inline comments in interpolated strings. Should be verified.

Alternative Approach

Scanning text to parse to some intermediate language, in this case C# comments grammar, requires a process called lexing or tokenizing. In a Lexer, you would like to scan the text and use an internal state machine to determine what you are currently looking at. A state might be QuoteState in which state a comment delimiter is a literal and not a comment. The end result of lexing provides you tokens (comments in this case) and their position in the input text.

Maintainability

No matter the possibilities of a regex, it's not intended for parsing complex languages. Nobody (no sane person) is able to make changes to this regex (even if provided with comments):

^((?:(?:(?:[^"'/\n]|/(?!/))*)(?("(?=(?:\\\\|\\"|[^"])*"))(?:"(?:\\\\|\\"|[^"])*")|(?('(?=(?:\\\\|\\'|[^'])*'))(?:'(?:\\\\|\\'|[^'])*')|(?(/)|.))))*)

As P.Roe so eloquently states in the comments:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Missing Edge Case(s)

There is a case I noticed you didn't cover. C# allows unicode escape sequences. Literals can have /xnnnn escape sequences, while all C# code can handle /unnnn and /Unnnnnnnn escape sequences. The preprocessor of the compiler transforms these latter two into characters. This means /u2215 which represents a slash / is not found by your regex.
I'm also not sure whether you can track inline comments in interpolated strings. Should be verified.

Alternative Approach

Maintainability

No matter the possibilities of a regex, it's not intended for parsing complex languages. Nobody (no sane person) is able to make changes to this regex (even if provided with comments):

^((?:(?:(?:[^"'/\n]|/(?!/))*)(?("(?=(?:\\\\|\\"|[^"])*"))(?:"(?:\\\\|\\"|[^"])*")|(?('(?=(?:\\\\|\\'|[^'])*'))(?:'(?:\\\\|\\'|[^'])*')|(?(/)|.))))*)

As P.Roe so eloquently states in the comments:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Missing Edge Case(s)

~~(削除) There is a case I noticed you didn't cover. (削除ここまで)~~ C# allows unicode escape sequences. Literals can have \xnnnn escape sequences, while all C# code can handle \unnnn and \Unnnnnnnn escape sequences. The preprocessor of the compiler transforms these latter two into characters, but only in the following cases :
- identifiers
- character literals
- regular string literals This means \u2215 which represents a slash / is not found by your regex.
I'm also not sure whether you can track inline comments in interpolated strings. Should be verified.

Alternative Approach

Source Link

answered Aug 23, 2019 at 16:48

dfhwze

answered Aug 23, 2019 at 16:48

dfhwze

14.1k
3
40
101

Maintainability

No matter the possibilities of a regex, it's not intended for parsing complex languages. Nobody (no sane person) is able to make changes to this regex (even if provided with comments):

^((?:(?:(?:[^"'/\n]|/(?!/))*)(?("(?=(?:\\\\|\\"|[^"])*"))(?:"(?:\\\\|\\"|[^"])*")|(?('(?=(?:\\\\|\\'|[^'])*'))(?:'(?:\\\\|\\'|[^'])*')|(?(/)|.))))*)

As P.Roe so eloquently states in the comments:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Missing Edge Case(s)

There is a case I noticed you didn't cover. C# allows unicode escape sequences. Literals can have /xnnnn escape sequences, while all C# code can handle /unnnn and /Unnnnnnnn escape sequences. The preprocessor of the compiler transforms these latter two into characters. This means /u2215 which represents a slash / is not found by your regex.
I'm also not sure whether you can track inline comments in interpolated strings. Should be verified.

Alternative Approach

lang-cs