Return to Question

Tweeted twitter.com/#!/StackCodeReview/status/198506045608435712

occurred May 4, 2012 at 20:15

improved formatting

Source Link

edited Apr 28, 2012 at 8:20

palacsint

edited Apr 28, 2012 at 8:20

palacsint

30.4k
9
82
157

unicode Unicode parsing in php - please review my method?PHP

Firstly, aploigiesapologies if this is not the correct type of question for here, I had it on the stackoverflow but it was closed with a suggestion I post here.

I want to parse form data in PHP to ensure it is safe from SQL injection and email header attacks, and any other scruritysecurity holes I've not considered.

Although I’m using UTF-8, I just want to cater for English plus some of the extra acute, tilde etc. characters that would normally encounter, plus the Euro symbol. Everything else is disallowed and throws an error, as opposed to silently being replaced/removed.

My main question is does this do as I want (it seems to work) or have I missed anything? My understanding of this is that the iconviconv function will remove any invalid sequences (i.e. hack attempts at the bit level) and leave just valid UTF-8, then my regexp checks for the characters I want allowed.

And, I have seen both, say, \x{20}-\x{7e}\x{20}-\x{7e} and \x20-\x7e\x20-\x7e used – what is the difference with and without the braces?

For example if expecting an integer of the 0-90-9 kind, can I just use preg_match("/[^0-9]/", $text) without the /u/u modifier and specify literal characters (from x00x00 to x7ex7e)? And suppose I want to allow 0-90-9 and the Euro in, is this the correct way preg_match("/[^0-9\x{20ac}]/u", $text)? And if I’m expecting a hidden field with "ADD" or "EDIT" is if (!preg_match("/^(ADD|EDIT)$/", $text)) is still valid to test that?

Thanks, Kevin

unicode parsing in php - please review my method?

Firstly, aploigies if this is not the correct type of question for here, I had it on the stackoverflow but it was closed with a suggestion I post here.

I want to parse form data in PHP to ensure it is safe from SQL injection and email header attacks, and any other scrurity holes I've not considered.

Although I’m using UTF-8, I just want to cater for English plus some of the extra acute, tilde etc characters that would normally encounter, plus the Euro symbol. Everything else is disallowed and throws an error, as opposed to silently being replaced/removed.

My main question is does this do as I want (it seems to work) or have I missed anything? My understanding of this is that the iconv function will remove any invalid sequences (i.e. hack attempts at the bit level) and leave just valid UTF-8, then my regexp checks for the characters I want allowed.

And, I have seen both, say, \x{20}-\x{7e} and \x20-\x7e used – what is the difference with and without the braces?

For example if expecting an integer of the 0-9 kind, can I just use preg_match("/[^0-9]/", $text) without the /u modifier and specify literal characters (from x00 to x7e)? And suppose I want to allow 0-9 and the Euro in, is this the correct way preg_match("/[^0-9\x{20ac}]/u", $text)? And if I’m expecting a hidden field with "ADD" or "EDIT" is if (!preg_match("/^(ADD|EDIT)$/", $text)) is still valid to test that?

Thanks, Kevin

Unicode parsing in PHP

Firstly, apologies if this is not the correct type of question for here, I had it on the stackoverflow but it was closed with a suggestion I post here.

I want to parse form data in PHP to ensure it is safe from SQL injection and email header attacks, and any other security holes I've not considered.

My main question is does this do as I want (it seems to work) or have I missed anything? My understanding of this is that the iconv function will remove any invalid sequences (i.e. hack attempts at the bit level) and leave just valid UTF-8, then my regexp checks for the characters I want allowed.

And, I have seen both, say, \x{20}-\x{7e} and \x20-\x7e used – what is the difference with and without the braces?

For example if expecting an integer of the 0-9 kind, can I just use preg_match("/[^0-9]/", $text) without the /u modifier and specify literal characters (from x00 to x7e)? And suppose I want to allow 0-9 and the Euro in, is this the correct way preg_match("/[^0-9\x{20ac}]/u", $text)? And if I’m expecting a hidden field with "ADD" or "EDIT" is if (!preg_match("/^(ADD|EDIT)$/", $text)) is still valid to test that?

Source Link

asked Apr 27, 2012 at 12:24

kevins

asked Apr 27, 2012 at 12:24

kevins

unicode parsing in php - please review my method?

Firstly, aploigies if this is not the correct type of question for here, I had it on the stackoverflow but it was closed with a suggestion I post here.

I’m in the process of converting from Latin 15 to Unicode/UTF-8 and researched several tutorials, and am looking here for a critique of what I have implemented based on them (or, IOW, did I understand it!) :

I want to parse form data in PHP to ensure it is safe from SQL injection and email header attacks, and any other scrurity holes I've not considered.

This is my code so far:

// ensure it's valid unicode / get rid of invalid UTF8 chars
$text = iconv("UTF-8","UTF-8//IGNORE",$text);
// and just allow a basic english...ish.. chars through - no controls, chinese etc
$match_list = "\x{09}\x{0a}\x{0d}\x{20}-\x{7e}"; // basic ascii chars plus CR,LF and TAB 
$match_list .= "\x{a1}-\x{ff}"; // extended latin 1 chars excluding control chars
$match_list .= "\x{20ac}"; // euro symbol
if (preg_match("/[^$match_list]/u", $text) )
 $error_text_array[] = "<b>INVALID UNICODE characters</b>";

This code should only allow the characters shown in yellow here http://solomon.ie/unicode/

Although it seems to work I’m still confused by the regexp and the hex notation. Am I matching the Unicode code points, or, the actual binary UTF-8 representation of those code points? * It appears to be the former.

*my understanding is that a codepoint is basically the virtual location of a specific character in the Unicode "world" or characters, i.e. Euro symbol is the 20ACth character from the start, but it’s actual binary code, and number of bytes, is depended on if you use UTF-8, or 16 or 2 etc. So the codepoint never changes but the bit sequence can.

And, I have seen both, say, \x{20}-\x{7e} and \x20-\x7e used – what is the difference with and without the braces?

I intend to use the above few lines on all form fields, then follow it by further checks depending on the nature of the input.

For example if expecting an integer of the 0-9 kind, can I just use preg_match("/[^0-9]/", $text) without the /u modifier and specify literal characters (from x00 to x7e)? And suppose I want to allow 0-9 and the Euro in, is this the correct way preg_match("/[^0-9\x{20ac}]/u", $text) ? And if I’m expecting a hidden field with "ADD" or "EDIT" is if (!preg_match("/^(ADD|EDIT)$/", $text)) is still valid to test that?

Thanks, Kevin

php unicode

lang-php