1

I have some problem here

$source = "<html><body><h1>&#8220;</h1></body></html>";
$dom = new DOMDocument();
$dom->loadHTML($source);
echo $dom->saveHTML();

Output:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body><h1>&ldquo;</h1></body></html>

Ok, this work correctly. But if I want to extract the nodes like this

$source = "<html><body><h1>&#8220;</h1></body></html>";
$dom = new DOMDocument();
$dom->loadHTML($source);
$h1 = $dom->getElementsByTagName('h1');
echo $dom->saveHTML($h1->item(0));

It output unrecognized text.

<h1>“</h1>

Anyone know how to solve this?

hakre
199k55 gold badges453 silver badges865 bronze badges
asked Feb 23, 2012 at 14:31
1
  • 1
    All those DOM functions return UTF-8 encoded strings, better check the manual. There is nothing to solve but just to properly display, e.g. tell your browser by either properly configuring your response headers or using the menu in your browser where you can specify the charset-encoding if you don't know how to tell the browser automatically. See webstandards.org/learn/articles/askw3c/dec2002 Commented Feb 23, 2012 at 14:37

2 Answers 2

2

Your code example works for me, output is <h1>"</h1>.

&ldquo; <ENTITY TYPE="#8220"/> " Left double quotation mark

Binary UTF-8 sequence of " is:

0xE2 (226) 0x80 (128) 0x9C (156)
 | | `------ Windows-1252: œ
 | `--- most Windows 125x encodings: €
 `--- ISO 8859-1, 2, 3, 4, 9, 10, 14, 15, 16: â

So where do you view that output?

Probably inside your browser on windows? If inside your browser, have you tried adding

header('Content-Type: text/html; charset=utf-8');

on top of your script?

See also: Setting the HTTP charset parameter and Checking HTTP Headers.

answered Feb 23, 2012 at 14:48
1
  • This might be a defect with saveHTML and using the $node parameter (not using entities while saveHTML w/o $node does). Commented Feb 23, 2012 at 15:35
0

you need the second parameter of the domdocument constructor (checkout http://nl.php.net/manual/en/domdocument.construct.php):

$dom = new DOMDocument('1.0', 'utf-8');
answered Feb 23, 2012 at 14:36
1
  • 1
    the HTML source I used to load already encoded, &#8220; . It output correctly when $dom->saveHTML(), but it output to unknown unicode if I print selected nodes, $dom->saveHTML($nodes); Commented Feb 23, 2012 at 14:42

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.