Skip to main content
Code Review

Return to Revisions

6 of 6
replaced http://stackoverflow.com/ with https://stackoverflow.com/

What you are doing is way beyond borderline abusive of regex. You may not be aware, but there are ways to parse CSV data without complicated string expressions. Here is a PHP reference.


See this demo:

regex

As you can clearly see, the moment you stop taking into account the second field 204 for filtering, your regex matches everything. This can be problematic in a few ways:

  1. Can you guarantee that all 204 lines will only contain address credentials? Database rows sometimes contain unexpected data due to misuse of the database fields. This is especially true of user-input fields from applications that rely on the database.

  2. Can you guarantee that all lines that are not 204 will never contain any data that you want to replace? See (1) above.

If you are unsure about either of the above, you might want to consider an alternative solution using your preferred language to give you more control over how parse the data. Your reliance on this 204 field could corrupt your data set if you are not careful. Regular expressions are powerful, but sometimes you need a scalpel and not a hammer.


So, let's look at the regex code itself. If you are going to be using PHP, I would recommend first looking into extended/verbose regex mode. This will allow you to document the expression in your code, so that the maintainer of it (which includes you 6 months in the future) will know what it is doing.

It could look something like this:

$LINE_204_PATTERN = 
 '/(?x) #enable verbose regex mode
 (
 [0-9] # any numeric char
 {5} # match 5 times
 ) # = exactly 5 numeric chars
 : # 1 colon
 (.*?) # lazy 0 or more of any char except line breaks
 : # 1 colon
 (.*?) # lazy 0 or more of any char except line breaks
 : # 1 colon
 (.*?) # lazy 0 or more of any char except line breaks
 : # 1 colon
 (.*) # 0 or more of any char except line breaks
 : # 1 colon
 (.*) # 0 or more of any char except line breaks
 /';

When you use ([0-9]{5}) you make the assumption that all IDs are exactly 5 numeric characters long. With that potentially being an auto-increment column in the database, chances are that some will have fewer than 5, and some will have more than 5; your expressions would not match either of those. You may consider using ([0-9]+) instead, which will match 1 or more numeric character. If some of those fields could be null/empty, use * instead of + to match 0 or more.

The other matching groups are really just "match anything, even nothing" groups and there's only so much about them that could be improved, unless you inspect your data set very closely to look for patterns or tendencies.

Phrancis
  • 20.5k
  • 6
  • 69
  • 155
default

AltStyle によって変換されたページ (->オリジナル) /