PHP local document crawler

Question 1

I've written a quick "local document crawler" that fetches the title tag and an expandable amount of metatag information from files on a webserver.

I develop in .NET for a living and don't have a clue what I'm doing, but the site I'm helping with only has PHP hosting.

The goal is to gather metadata from files on a server, hopefully cache the output that uses the data, and display it to the user.

We experienced some x-files stuff when the first cache-file was written, and the system itself is rather slow, even when not recursing. (There are about 200 files being read in a request) The x-files stuff being PHP files disappearing from FTP view, which might be due to permissions being automatically set by the hosting provider.

Another thing I really don't understand is why some pages just don't seem to match my regular expressions for the metatags, so if anyone spots the issue you have my thanks.

General class:

<?php
class MetaEnumerator
{
 private $patterns = array(
 "title" => "/<title>([^<]*)<\\x2Ftitle>/ix",
 "keywords" => '/<meta(?=[^>]*name="keywords")\s[^>$]*content="([^"]*)[">]*$/ixu',
 "description" => '/<meta(?=[^>]*name="description")\s[^>$]*content="([^"]*)[">]*$/ixu'
 );
 private $endPattern = "/<\/head>/ixu";
 private $path = "";
 private $recursive = false;
 private $files = null;
 function __construct($path, $recursive) {
 $this->path = $path;
 $this->recursive = $recursive;
 }
 public function AddPattern($key, $pattern)
 {
 $this->patterns[$key] = $pattern;
 }
 public function GetFiles()
 {
 $this->files = array();
 $this->AddItems($this->path);
 usort($this->files, array("MetaEnumerator", "CompareTitle"));
 return $this->files;
 }
 private static function CompareTitle($a, $b) {
 return strcmp($a["title"], $b["title"]);
 }
 private function AddItems($path)
 {
 foreach(scandir($path) as $item) {
 $this->AddItem($path, $item);
 }
 }
 private function AddItem($path, $item)
 {
 $fullPath = "$path/$item";
 if ($this->IsFolder($fullPath, $item) && $this->recursive) {
 $this->AddItems($fullPath);
 }
 else if ($this->IsHtmlFile($fullPath)) {
 $this->AddFile($fullPath);
 }
 }
 private function AddFile($fullPath)
 {
 $fileInfo = $this->GetFileInfo($fullPath);
 array_push($this->files, $fileInfo);
 }
 private function GetFileInfo($file)
 {
 $fileInfo = array();
 $fileInfo["path"] = $file;
 $fileInfo["modified"] = filemtime($file);
 $ptr = fopen($file, "r");
 foreach ($this->patterns as $key => $value) {
 $fileInfo[$key] = $this->FindPattern($ptr, $value);
 }
 fclose($ptr);
 return $fileInfo;
 }
 private function FindPattern($ptr, $pattern)
 {
 $retVal = "";
 rewind($ptr);
 while (($line = fgets($ptr)) !== FALSE) {
 if (preg_match($pattern, $line) > 0) {
 $retVal = preg_replace($pattern, "1ドル", $line);
 break;
 }
 if (preg_match($this->endPattern, $line) > 0) {
 break;
 }
 }
 return $retVal;
 }
 private function IsFolder($path, $item)
 {
 return is_dir($path) && $this->IsPhysical($item);
 }
 private function IsPhysical($folderPath) {
 return $folderPath !== "." && $folderPath !== "..";
 }
 private function IsHtmlFile($filePath)
 {
 $pathInfo = pathinfo($filePath);
 return !is_dir($filePath) && $pathInfo["extension"] == "html";
 }
}

A page using it:
(This hasn't been refactored yet, so lay off with the clean code comments.)

<?
include "../../../utils/MetaEnumerator.php";
$files = scandir("..");
$maxDate = null;
foreach($files as $file) {
 $date = filemtime("../$file");
 if ($maxDate == null || $date > $maxDate) {
 $maxDate = $date;
 }
}
$cacheFile = "thispage.cache";
$cacheDate = file_exists($cacheFile) ? filemtime($cacheFile) : null;
if ($cacheDate >= $maxDate) {
 include($cacheFile);
 exit;
}
else
{
 ob_start();
?>
<html>
<head>
 <title>Our stuff</title>
</head>
<body>
<?
 echo date("d.m.Y",$maxDate);
 function AddTag($enumerator, $name) {
 $metaPrefix = '/<meta(?=[^>]*name="';
 $metaSuffix = '")\s[^>$]*content="([^"]*)[">]*$/ixu';
 $enumerator->AddPattern($name, $metaPrefix.$name.$metaSuffix);
 }
 $enumerator = new MetaEnumerator("..", false);
 AddTag($enumerator, "name");
 AddTag($enumerator, "country");
 AddTag($enumerator, "status");
 AddTag($enumerator, "active");
 $files = $enumerator->GetFiles();
 echo "<table>";
 echo "<tr>";
 echo "<th>Name</th>".
 "<th>Country</th>".
 "<th>Status</th>".
 "<th>Last update</th>";
 echo "</tr>";
 foreach($files as $file) {
 if ($file["name"] == null) continue;
 echo "<tr style=\"vertical-align: top;\">";
 echo "<td><a href=\"".$file["path"]."\" target=\"_blank\">".$file["name"]."</a></td>".
 "<td>".$file["country"]."</td>".
 "<td>".$file["eruption"]."</td>".
 "<td>".date("d.m.Y", $file["modified"])."</td>";
 echo "</tr>";
 }
 echo "</table>";
?>
</body>
</html>
<?
$fp = fopen($cacheFile, 'w');
fwrite($fp, ob_get_contents());
fclose($fp);
ob_end_flush();
}
?>

Question 2

Why use PHP to crawl document? Wouldn't the parsing be much faster if you do with with BASH or other language? (PHP is more like server side scripting, but will not parse fastest i believe)

Question 3

Since the only server tech. available on the given site is PHP and the point is to be able to deploy a html file to a subfolder, and get overview/list/news/sitemap pages updated automatically. Requirements, requirements. ;)

Question 4

I would use a DOMDocument instead of Regex to parse whenever possible

Question 5

There is no benefit to adding the x pattern modifier to your regexes.

Question 6

It is unfortunate that no answers have been received since this was posted more than 10 years ago. Perhaps you've learned a few things since then and/or the code has changed. Regardless hopefully the advice below will be helpful to you and/or others.

Better tool for finding HTML tags

tools hanging on wall

"You can't parse [X]HTML with regex". As was suggested in comments using DOMDocument would likely be a more robust solution (or see this SO answer for a list containing other solutions as well). As this SO answer explains:

"...instead of using the wrong tool for the job (a text parsing tool for a structured document) use the right tool for the job (an HTML parser for parsing HTML)."

For example, one could use DOMDocument::loadHTML() to look for tags. DOMXpath could be used to query the DOM with expressions similar to CSS rules.

Style

I've maintained code that was started prior to 2007 - much of it uses PascalCase A.K.A. StudlyCaps for both class names as well as method names. This is okay to do, however idiomatic PHP uses camelCase for method names, which is in-line with PSR-1

4.3. Methods

Method names MUST be declared in camelCase().

Array pushing

stacking image

The method AddFile() uses array_push().

array_push($this->files, $fileInfo);

This is fine, however when adding a single element to the array, the same can be achieved by assigning the next available key (which can be omitted)

$this->files[] = $fileInfo;

Detecting if file is HTML file

The method IsHtmlFile checks to see if the path is not a directory and has a file with extension that is equal to html. While it may not be likely to happen for your files, it is possible that an HTML file would have an extension other than .html (including .HTML). A more robust technique would be to use mime_content_type() to check for the MIME type 'text/html'.

score 2 · Answer 1 · 2022-06-18 06:02:50Z

It is unfortunate that no answers have been received since this was posted more than 10 years ago. Perhaps you've learned a few things since then and/or the code has changed. Regardless hopefully the advice below will be helpful to you and/or others.

Better tool for finding HTML tags

tools hanging on wall

"You can't parse [X]HTML with regex". As was suggested in comments using DOMDocument would likely be a more robust solution (or see this SO answer for a list containing other solutions as well). As this SO answer explains:

"...instead of using the wrong tool for the job (a text parsing tool for a structured document) use the right tool for the job (an HTML parser for parsing HTML)."

For example, one could use DOMDocument::loadHTML() to look for tags. DOMXpath could be used to query the DOM with expressions similar to CSS rules.

Style

I've maintained code that was started prior to 2007 - much of it uses PascalCase A.K.A. StudlyCaps for both class names as well as method names. This is okay to do, however idiomatic PHP uses camelCase for method names, which is in-line with PSR-1

4.3. Methods

Method names MUST be declared in camelCase().

Array pushing

stacking image

The method AddFile() uses array_push().

array_push($this->files, $fileInfo);

This is fine, however when adding a single element to the array, the same can be achieved by assigning the next available key (which can be omitted)

$this->files[] = $fileInfo;

Detecting if file is HTML file

The method IsHtmlFile checks to see if the path is not a directory and has a file with extension that is equal to html. While it may not be likely to happen for your files, it is possible that an HTML file would have an extension other than .html (including .HTML). A more robust technique would be to use mime_content_type() to check for the MIME type 'text/html'.

Stack Exchange Network

PHP local document crawler

1 Answer 1

Better tool for finding HTML tags

Style

4.3. Methods

Array pushing

Detecting if file is HTML file

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

1 Answer 1

Better tool for finding HTML tags

Style

Array pushing

Detecting if file is HTML file

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related