1

I am writing a regex to try and filter out invalid urls. This should be simple enough - a million examples are available online, I ended up using this one: ((https?|ftp|file)://)[-A-Za-z0-9+&@#/%?=~_|!:,.;]+[-A-Za-z0-9+&@#/%=~_|].

However, our specific requirements state that the url must end in either "?" or "&". This should also be fairly simple, it can be done by adding (\\?|\\&) to the end of the regex.

However, the requirements are further complicated by the following: if "?" is already present in the string, then the url must end in & and vice versa "with the main items in the preceding statement the other way around."

It should be noted that the regex written above and the general context of this question is within the javascript specifications.

Edit per the request of commenter

Examples of input urls:

No "?" or "&" at all:

https://helloworld.io/foobar returns false

No "?" or "&" at end:

https://helloworld.io/foo&bar returns false

https://helloworld.io/foo?bar returns false

Single special character sound at end:

https://helloworld.io/foobar? returns true

https://helloworld.io/foobar& returns true

Alternating special characters in url:

https://helloworld.io/foo&bar? returns true

https://helloworld.io/foo?bar& returns true

Alternating special characters in url without unique ending:

https://helloworld.io/foo&bar?baz& returns false

https://helloworld.io/foo?bar&baz? returns false

Repeated special character found at end:

https://helloworld.io/foo?bar? returns false

https://helloworld.io/foo&bar& returns false

Alternating special characters with no special character at end:

https://helloworld.io/foo&bar?baz returns false

https://helloworld.io/foo?bar?baz returns false

Second edit in response to another comment:

With this regex most of my problems are solved:

((https?|ftp|file):\/\/)[-A-Za-z0-9+&@#/%?=~_|!:,.;]+[-A-Za-z0-9+&@#/%=~_|](\\?|\\&)

However, I can not test for cases such as this:

https://helloworld.io/foo&bar?baz?bum&

This evaluates as valid, however, given that "&" is present in the string before the last char - it can not end with "&".

asked May 12, 2022 at 15:09
9
  • If a URL contains & then it cannot end with ? Commented May 12, 2022 at 15:10
  • This is true, however, given our specific use case this requirement does not hold. We are filtering the urls to ensure they are ready for us to parse and customize. @anubhava Commented May 12, 2022 at 15:16
  • 1
    can you share sample strings that should and that shouldn't be matched? (including borderline cases) Commented May 12, 2022 at 15:20
  • Per your request I have added some examples to the question. @lemon Commented May 12, 2022 at 15:27
  • Please share your regexp + examples using regex101.com and mention the inputs that don't work, specifically Commented May 12, 2022 at 15:29

2 Answers 2

2

You can use the following regex:

(https|ftp|file):\/\/[^\/]+\/\w+((\?[^&\s]+)?&|(&[^\?\s]+)?\?)(\s|$)

Explanation:

  • (https|ftp|file): prefix
  • :\/\/: colon and double slash
  • [^\\]+: anything other than next slash
  • \/: slash
  • \w+: any alphanumeric character

Then there are two options.

Option 1: (\?[^&\s]+)?&:

  • (\?[^&\s]+)?: optional ? followed by any character other than &
  • &: &

Option 2: (&[^\?\s]+)?\?):

  • (&[^\?\s]+)?: optional & followed by any character other than ?
  • \?: ?

Ending up with: *(\s|$): space or endstring symbol

These will match the examples you provided. For more refinements, point to new examples.

Try it here.

answered May 12, 2022 at 15:41

3 Comments

For some reason I didn't get alerted when you answered - your regex is awesome! It still has a small false positive problem with urls that fit the following pattern: https://helloworld.io/foo&bar?baz?bum& but that is something i can work on. Thank you!
Actually I changed my initial regex to match it, because it didn't match before. Let me do a rollback.
Very minor edit to your final solution: adding ($|\s) to the end makes it perfect! (entire string should be evaluated) @lemon
2

Working from your initial regex:

((https?|ftp|file)://)[-A-Za-z0-9+&@#/%?=~_|!:,.;]+[-A-Za-z0-9+&@#/%=~_|]

Then modifying it for each case:

((https?|ftp|file)://)[-A-Za-z0-9+@#/%?=~_|!:,.;]+[-A-Za-z0-9+@#/%=~_|]&

and

((https?|ftp|file)://)[-A-Za-z0-9+&@#/%=~_|!:,.;]+[-A-Za-z0-9+&@#/%=~_|]\?

Then joining them and de-duplicating the common prefix:

((https?|ftp|file)://)([-A-Za-z0-9+@#/%?=~_|!:,.;]+[-A-Za-z0-9+@#/%=~_|]&|[-A-Za-z0-9+&@#/%=~_|!:,.;]+[-A-Za-z0-9+&@#/%=~_|]\?)

Adding ^, $, and the correct escaping for javascript, this would be:

^((https?|ftp|file):\/\/)([-A-Za-z0-9+@#\/%?=~_|!:,.;]+[-A-Za-z0-9+@#\/%=~_|]&|[-A-Za-z0-9+&@#\/%=~_|!:,.;]+[-A-Za-z0-9+&@#\/%=~_|]\?)$

Tests over on regex101

answered May 12, 2022 at 15:54

3 Comments

So far tests are failing but I really like the idea you are going with by stringing together the or statements ... I'm going to play around with that and if i can crack this nut with your help ill upvote you anyways since I know this one was a pain. @Ouroborus
There's a particular edge case that isn't covered in your examples where a string ends in one character and contains multiple of the other character as in https://helloworld.io/foo?bar?baz&. This solution would return true for those but I'm not sure if that's your intent.
That is my intent (that they return true) - thank you for double checking

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.