Formatting HTML for use in a locally hosted iframe

Question 1

This formats HTML for use in a locally hosted iframe so that you can manipulate the content in the iframe freely, without running into cross domain issues. It uses Goutte to retrieve the HTML. I'd like to improve the code, whether it's to fit a certain design pattern or just more efficient. I don't feel like it's as clean as it could be.

<?php
class HTMLFixerClass {
 public $client;
 public $crawler;
 public $url;
 public $originalHTML;
 public $validHTML;
 public function __construct($url) {
 $this->url = $url;
 $this->client = new GoutteClient();
 $this->crawler = $this->client->request('GET', $url);
 if ($this->crawler->filter('html')->count()) {
 $this->originalHTML = $this->crawler->filter('html')->html();
 $this->validHTML = $this->correct_directories($this->originalHTML);
 }
 }
 public function get_html() {
 if ($this->validHTML !== '') {
 return "<!DOCTYPE HTML>\n<html>\n" . $this->validHTML . "\n</html>";
 } else {
 return 'Invalid URL';
 }
 }
 public function correct_directories($html) {
 $temp_url = $this->url;
 preg_match_all('/<.*?\.{2}\/.*?>/', $html, $matches);
 $initial_matches = $matches;
 if ($matches) {
 //foreach match in the entire doc
 for ($i = 0; $i < count($matches[0]); $i++) {
 //foreach match in each match ex ../../
 preg_match_all('/[\.]{2}[\/]/', $matches[0][$i], $sub_matches);
 $replacement_string = '';
 for ($j = 0; $j < count($sub_matches[0]); $j++) {
 $replacement_string .= '../';
 //reduces base url by one directory foreach ../ found in
 if ($j == 0) {
 preg_match_all('/.*(?=\/.*\/)/', $temp_url, $replacement_url);
 } elseif ($j > 0) {
 preg_match_all('/.*(?=.*\/)/', $replacement_url[0][0], $replacement_url);
 }
 }
 //in the end $replacement_url[0] should be the desired url for appending to css or script in order to become absolute link
 $matches[0][$i] = str_replace($replacement_string, $replacement_url[0][0] . '/', $matches[0][$i]);
 }
 for ($k = 0; $k < count($matches[0]); $k++) {
 $html = str_replace($initial_matches[0][$k], $matches[0][$k], $html);
 }
 return $html;
 } else {
 var_dump($matches);
 }
 }
}

Question 2

I've rolled back your edit. Please read what you can and cannot do after receiving answers.

Question 3

SOLID

It crawls HTML code from sites and then it manipulates the returned code, not adhering to the Single Responsibility Principle.

Bad Naming

Don't name your class HTMLFixerClass, obviously it is a class. Also with OOP, in PHP, the convention is camelCase for variables and PascalCase for class names. E.g. yours would become HtmlFixerClass.

What does correct_directories() do? Programmers should be reading this method signature and immediately get an idea of what it is that this method is doing. Currently I have no clue, and it requires me to dig inside the body of the method to figure it out. Especially when it's doing a bunch of regex string manipulations which looks like gibberish at first sight (WTF moment).

Encapsulation

Make your class as private as possible.

If you have no means of having certain components extended or directly called from the public, you can make those private.

A Constructor Constructs

Your constructor is doing way too much.

A constructor is exactly as the name implies it to be, it constructs the object. You are creating a new instance of an object, crawling from a URL, doing a whole bunch of stuff that shouldn't be in there.

Dependency Injection

Your class is tightly coupled to GoutteClient.

What if you wanted to use another instance of GoutteClient? You'd now have to modify your class, which is a bad thing. You can utilize Dependency Injection to make it loosely coupled.

Comments

I do appreciate the comments, but they don't really help all that much. Maybe you should include an example in it, to give the programmer a quicker idea of what's being done.

Also //foreach match in the entire doc, pretty redundant. It doesn't add any additional information. So I would just remove it.

I would also start commenting your entire class. See PHPDoc for examples.

Coding Standards

Try to adhere to the coding standards defined by the PHP Framework Interop Group. It gives better readability to your code, especially for other programmers who do it as well. Of course, you do not have to be perfect with it.

Conclusion

Trying to fathom the use case of this class from a higher level, I came to realize that a class like this is pretty much unnecessary.

Basically it contains just one relevant utility method that does something helpful, correct_directories(), although I can't be bothered putting some more effort into actually understanding what it does.

I would probably just make this a standalone function, or contain it in a static utility class HtmlFixer which would contain every method that fixes anything HTML code related.

As far as crawling the HTML from URLs, this should be done somewhere in the application level, where different objects and functions work together to achieve something.

Question 4

Thank you for all your advice, I've taken your advice and modified so that the main function correct_directories is named relativeToAbsolute and it is part of a helper class, it will be passed the html and the temporary url variables.

Kid Diamond Kid Diamond 2,61117 silver badges35 bronze badges · Answer 1 · 2015-05-14 09:04:17Z

SOLID

It crawls HTML code from sites and then it manipulates the returned code, not adhering to the Single Responsibility Principle.

Bad Naming

Don't name your class HTMLFixerClass, obviously it is a class. Also with OOP, in PHP, the convention is camelCase for variables and PascalCase for class names. E.g. yours would become HtmlFixerClass.

What does correct_directories() do? Programmers should be reading this method signature and immediately get an idea of what it is that this method is doing. Currently I have no clue, and it requires me to dig inside the body of the method to figure it out. Especially when it's doing a bunch of regex string manipulations which looks like gibberish at first sight (WTF moment).

Encapsulation

Make your class as private as possible.

If you have no means of having certain components extended or directly called from the public, you can make those private.

A Constructor Constructs

Your constructor is doing way too much.

A constructor is exactly as the name implies it to be, it constructs the object. You are creating a new instance of an object, crawling from a URL, doing a whole bunch of stuff that shouldn't be in there.

Dependency Injection

Your class is tightly coupled to GoutteClient.

What if you wanted to use another instance of GoutteClient? You'd now have to modify your class, which is a bad thing. You can utilize Dependency Injection to make it loosely coupled.

Comments

I do appreciate the comments, but they don't really help all that much. Maybe you should include an example in it, to give the programmer a quicker idea of what's being done.

Also //foreach match in the entire doc, pretty redundant. It doesn't add any additional information. So I would just remove it.

I would also start commenting your entire class. See PHPDoc for examples.

Coding Standards

Try to adhere to the coding standards defined by the PHP Framework Interop Group. It gives better readability to your code, especially for other programmers who do it as well. Of course, you do not have to be perfect with it.

Conclusion

Trying to fathom the use case of this class from a higher level, I came to realize that a class like this is pretty much unnecessary.

Basically it contains just one relevant utility method that does something helpful, correct_directories(), although I can't be bothered putting some more effort into actually understanding what it does.

I would probably just make this a standalone function, or contain it in a static utility class HtmlFixer which would contain every method that fixes anything HTML code related.

As far as crawling the HTML from URLs, this should be done somewhere in the application level, where different objects and functions work together to achieve something.

Thank you for all your advice, I've taken your advice and modified so that the main function correct_directories is named relativeToAbsolute and it is part of a helper class, it will be passed the html and the temporary url variables.

Stack Exchange Network

Formatting HTML for use in a locally hosted iframe

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Formatting HTML for use in a locally hosted iframe

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions