This formats HTML for use in a locally hosted iframe so that you can manipulate the content in the iframe freely, without running into cross domain issues. It uses Goutte to retrieve the HTML. I'd like to improve the code, whether it's to fit a certain design pattern or just more efficient. I don't feel like it's as clean as it could be.
<?php
class HTMLFixerClass {
public $client;
public $crawler;
public $url;
public $originalHTML;
public $validHTML;
public function __construct($url) {
$this->url = $url;
$this->client = new GoutteClient();
$this->crawler = $this->client->request('GET', $url);
if ($this->crawler->filter('html')->count()) {
$this->originalHTML = $this->crawler->filter('html')->html();
$this->validHTML = $this->correct_directories($this->originalHTML);
}
}
public function get_html() {
if ($this->validHTML !== '') {
return "<!DOCTYPE HTML>\n<html>\n" . $this->validHTML . "\n</html>";
} else {
return 'Invalid URL';
}
}
public function correct_directories($html) {
$temp_url = $this->url;
preg_match_all('/<.*?\.{2}\/.*?>/', $html, $matches);
$initial_matches = $matches;
if ($matches) {
//foreach match in the entire doc
for ($i = 0; $i < count($matches[0]); $i++) {
//foreach match in each match ex ../../
preg_match_all('/[\.]{2}[\/]/', $matches[0][$i], $sub_matches);
$replacement_string = '';
for ($j = 0; $j < count($sub_matches[0]); $j++) {
$replacement_string .= '../';
//reduces base url by one directory foreach ../ found in
if ($j == 0) {
preg_match_all('/.*(?=\/.*\/)/', $temp_url, $replacement_url);
} elseif ($j > 0) {
preg_match_all('/.*(?=.*\/)/', $replacement_url[0][0], $replacement_url);
}
}
//in the end $replacement_url[0] should be the desired url for appending to css or script in order to become absolute link
$matches[0][$i] = str_replace($replacement_string, $replacement_url[0][0] . '/', $matches[0][$i]);
}
for ($k = 0; $k < count($matches[0]); $k++) {
$html = str_replace($initial_matches[0][$k], $matches[0][$k], $html);
}
return $html;
} else {
var_dump($matches);
}
}
}
-
\$\begingroup\$ I've rolled back your edit. Please read what you can and cannot do after receiving answers. \$\endgroup\$Kid Diamond– Kid Diamond2015年05月14日 20:15:27 +00:00Commented May 14, 2015 at 20:15
1 Answer 1
SOLID
It crawls HTML code from sites and then it manipulates the returned code, not adhering to the Single Responsibility Principle.
Bad Naming
Don't name your class HTMLFixerClass
, obviously it is a class. Also with OOP, in PHP, the convention is camelCase for variables and PascalCase for class names. E.g. yours would become HtmlFixerClass
.
What does correct_directories()
do? Programmers should be reading this method signature and immediately get an idea of what it is that this method is doing. Currently I have no clue, and it requires me to dig inside the body of the method to figure it out. Especially when it's doing a bunch of regex string manipulations which looks like gibberish at first sight (WTF moment).
Encapsulation
Make your class as private
as possible.
If you have no means of having certain components extended or directly called from the public, you can make those private.
A Constructor Constructs
Your constructor is doing way too much.
A constructor is exactly as the name implies it to be, it constructs the object. You are creating a new instance of an object, crawling from a URL, doing a whole bunch of stuff that shouldn't be in there.
Dependency Injection
Your class is tightly coupled to GoutteClient
.
What if you wanted to use another instance of GoutteClient
? You'd now have to modify your class, which is a bad thing. You can utilize Dependency Injection to make it loosely coupled.
Comments
I do appreciate the comments, but they don't really help all that much. Maybe you should include an example in it, to give the programmer a quicker idea of what's being done.
Also //foreach match in the entire doc
, pretty redundant. It doesn't add any additional information. So I would just remove it.
I would also start commenting your entire class. See PHPDoc for examples.
Coding Standards
Try to adhere to the coding standards defined by the PHP Framework Interop Group. It gives better readability to your code, especially for other programmers who do it as well. Of course, you do not have to be perfect with it.
Conclusion
Trying to fathom the use case of this class from a higher level, I came to realize that a class like this is pretty much unnecessary.
Basically it contains just one relevant utility method that does something helpful, correct_directories()
, although I can't be bothered putting some more effort into actually understanding what it does.
I would probably just make this a standalone function, or contain it in a static utility class HtmlFixer
which would contain every method that fixes anything HTML code related.
As far as crawling the HTML from URLs, this should be done somewhere in the application level, where different objects and functions work together to achieve something.
-
\$\begingroup\$ Thank you for all your advice, I've taken your advice and modified so that the main function correct_directories is named relativeToAbsolute and it is part of a helper class, it will be passed the html and the temporary url variables. \$\endgroup\$zacbrac– zacbrac2015年05月14日 19:05:44 +00:00Commented May 14, 2015 at 19:05