paquettg/php-html-parser

An HTML DOM parser. It allows you to manipulate HTML. Find tags on an HTML page with selectors just like jQuery.

Sunday, February 11, 2018
by paquettg
Repository
56 Watchers
1081 Stars
641,890 Installations

PHP
66 Dependents
0 Suggesters
202 Forks
47 Open issues
11 Versions
13 % Grown

The README.md

PHP Html Parser

, _(*1)

PHPHtmlParser is a simple, flexible, html parser which allows you to select tags using any css selector, like jQuery. The goal is to assist in the development of tools which require a quick, easy way to scrap html, whether it's valid or not!, _(*2)

Install

Install the latest version using composer., _(*3)

$ composer require paquettg/php-html-parser

This package can be found on packagist and is best loaded using composer. We support php 7.2, 7.3, and 7.4., _(*4)

Basic Usage

You can find many examples of how to use the DOM parser and any of its parts (which you will most likely never touch) in the tests directory. The tests are done using PHPUnit and are very small, a few lines each, and are a great place to start. Given that, I'll still be showing a few examples of how the package should be used. The following example is a very simplistic usage of the package., _(*5)

// Assuming you installed from Composer:
require "vendor/autoload.php";
use PHPHtmlParser\Dom;

$dom = new Dom;
$dom->loadStr('

Hey bro, click here
 :), _(*6)

');
$a = $dom->find('a')[0];
echo $a->text; // "click here"

The above will output "click here". Simple, no? There are many ways to get the same result from the DOM, such as $dom->getElementsbyTag('a')[0] or $dom->find('a', 0), which can all be found in the tests or in the code itself., _(*7)

Support PHP Html Parser Financially

Get supported Monolog and help fund the project with the Tidelift Subscription., _(*8)

Tidelift delivers commercial support and maintenance for the open source dependencies you use to build your applications. Save time, reduce risk, and improve code health, while paying the maintainers of the exact dependencies you use., _(*9)

Loading Files

You may also seamlessly load a file into the DOM instead of a string, which is much more convenient and is how I expect most developers will be loading the HTML. The following example is taken from our test and uses the "big.html" file found there., _(*10)

// Assuming you installed from Composer:
require "vendor/autoload.php";
use PHPHtmlParser\Dom;

$dom = new Dom;
$dom->loadFromFile('tests/data/big.html');
$contents = $dom->find('.content-border');
echo count($contents); // 10

foreach ($contents as $content)
{
    // get the class attr
    $class = $content->getAttribute('class');

    // do something with the html
    $html = $content->innerHtml;

    // or refine the find some more
    $child   = $content->firstChild();
    $sibling = $child->nextSibling();
}

This example loads the html from big.html, a real page found online, and gets all the content-border classes to process. It also shows a few things you can do with a node but it is not an exhaustive list of the methods that a node has available., _(*11)

Loading URLs

Loading a URL is very similar to the way you would load the HTML from a file., _(*12)

// Assuming you installed from Composer:
require "vendor/autoload.php";
use PHPHtmlParser\Dom;

$dom = new Dom;
$dom->loadFromUrl('http://google.com');
$html = $dom->outerHtml;

// or
$dom->loadFromUrl('http://google.com');
$html = $dom->outerHtml; // same result as the first example

loadFromUrl will, by default, use an implementation of the \Psr\Http\Client\ClientInterface to do the HTTP request and a default implementation of \Psr\Http\Message\RequestInterface to create the body of the request. You can easily implement your own version of either the client or request to use a custom HTTP connection when using loadFromUrl., _(*13)

// Assuming you installed from Composer:
require "vendor/autoload.php";
use PHPHtmlParser\Dom;
use App\Services\MyClient;

$dom = new Dom;
$dom->loadFromUrl('http://google.com', null, new MyClient());
$html = $dom->outerHtml;

As long as the client object implements the interface properly, it will use that object to get the content of the url., _(*14)

Loading Strings

Loading a string directly is also easily done., _(*15)

// Assuming you installed from Composer:
require "vendor/autoload.php";
use PHPHtmlParser\Dom;

$dom = new Dom;
$dom->loadStr('<html>String</html>');
$html = $dom->outerHtml;

Options

You can also set parsing option that will effect the behavior of the parsing engine. You can set a global option array using the setOptions method in the Dom object or a instance specific option by adding it to the load method as an extra (optional) parameter., _(*16)

// Assuming you installed from Composer:
require "vendor/autoload.php";
use PHPHtmlParser\Dom;
use PHPHtmlParser\Options;

$dom = new Dom;
$dom->setOptions(
    // this is set as the global option level.
    (new Options())
        ->setStrict(true)
);

$dom->loadFromUrl('http://google.com', 
    (new Options())->setWhitespaceTextNode(false) // only applies to this load.
);

$dom->loadFromUrl('http://gmail.com'); // will not have whitespaceTextNode set to false.

At the moment we support 12 options., _(*17)

Strict, _(*18)

Strict, by default false, will throw a StrickException if it find that the html is not strictly compliant (all tags must have a closing tag, no attribute with out a value, etc.)., _(*19)

whitespaceTextNode, _(*20)

The whitespaceTextNode, by default true, option tells the parser to save textnodes even if the content of the node is empty (only whitespace). Setting it to false will ignore all whitespace only text node found in the document., _(*21)

enforceEncoding, _(*22)

The enforceEncoding, by default null, option will enforce an character set to be used for reading the content and returning the content in that encoding. Setting it to null will trigger an attempt to figure out the encoding from within the content of the string given instead., _(*23)

cleanupInput, _(*24)

Set this to false to skip the entire clean up phase of the parser. If this is set to true the next 3 options will be ignored. Defaults to true., _(*25)

removeScripts, _(*26)

Set this to false to skip removing the script tags from the document body. This might have adverse effects. Defaults to true., _(*27)

removeStyles, _(*28)

Set this to false to skip removing of style tags from the document body. This might have adverse effects. Defaults to true., _(*29)

preserveLineBreaks, _(*30)

Preserves Line Breaks if set to true. If set to false line breaks are cleaned up as part of the input clean up process. Defaults to false., _(*31)

removeDoubleSpace, _(*32)

Set this to false if you want to preserve whitespace inside of text nodes. It is set to true by default., _(*33)

removeSmartyScripts, _(*34)

Set this to false if you want to preserve smarty script found in the html content. It is set to true by default., _(*35)

htmlSpecialCharsDecode, _(*36)

By default this is set to false. Setting this to true will apply the php function htmlspecialchars_decode too all attribute values and text nodes., _(*37)

selfClosing, _(*38)

This option contains an array of all self closing tags. These tags must be self closing and the parser will force them to be so if you have strict turned on. You can update this list with any additional tags that can be used as a self closing tag when using strict. You can also remove tags from this array or clear it out completly., _(*39)

noSlash, _(*40)

This option contains an array of all tags that can not be self closing. The list starts off as empty but you can add elements as you wish., _(*41)

Static Facade

You can also mount a static facade for the Dom object., _(*42)

PHPHtmlParser\StaticDom::mount();

Dom::loadFromFile('tests/big.hmtl');
$objects = Dom::find('.content-border');

The above php block does the same find and load as the first example but it is done using the static facade, which supports all public methods found in the Dom object., _(*43)

Modifying The Dom

You can always modify the dom that was created from any loading method. To change the attribute of any node you can just call the setAttribute method., _(*44)

use PHPHtmlParser\Dom;

$dom = new Dom;
$dom->loadStr('


Hey bro, click here
 :), _(*45)



');
$a = $dom->find('a')[0];
$a->setAttribute('class', 'foo');
echo $a->getAttribute('class'); // "foo"

You may also get the PHPHtmlParser\Dom\Tag class directly and manipulate it as you see fit., _(*46)

use PHPHtmlParser\Dom;

$dom = new Dom;
$dom->loadStr('

Hey bro, click here
 :), _(*47)

');
/** @var Dom\Node\AbstractNode $a */
$a   = $dom->find('a')[0];
$tag = $a->getTag();
$tag->setAttribute('class', 'foo');
echo $a->getAttribute('class'); // "foo"

It is also possible to remove a node from the tree. Simply call the delete method on any node to remove it from the tree. It is important to note that you should unset the node after removing it from the `DOM``, it will still take memory as long as it is not unset., _(*48)

use PHPHtmlParser\Dom;

$dom = new Dom;
$dom->loadStr('

Hey bro, click here
 :), _(*49)

');
/** @var Dom\Node\AbstractNode $a */
$a   = $dom->find('a')[0];
$a->delete();
unset($a);
echo $dom; // '


Hey bro, 
 :), _(*50)



');

You can modify the text of TextNode objects easily. Please note that, if you set an encoding, the new text will be encoded using the existing encoding., _(*51)

use PHPHtmlParser\Dom;

$dom = new Dom;
$dom->loadStr('

Hey bro, click here
 :), _(*52)

');
/** @var Dom\Node\InnerNode $a */
$a   = $dom->find('a')[0];
$a->firstChild()->setText('biz baz');
echo $dom; // '

Hey bro, biz baz
 :), _(*53)

'

The Versions

11/02 2018

dev-dev

dev-dev https://github.com/paquettg/php-html-parser

An HTML DOM parser. It allows you to manipulate HTML. Find tags on an HTML page with selectors just like jQuery.

Sources Download

MIT

The Requires

php >=5.6
ext-mbstring *
paquettg/string-encode ~0.1.0

The Development Requires

by Gilles Paquette

parser html dom

06/04 2016

dev-master

9999999-dev https://github.com/paquettg/php-html-parser

An HTML DOM parser. It allows you to manipulate HTML. Find tags on an HTML page with selectors just like jQuery.

Sources Download

MIT

The Requires

paquettg/string-encode ~0.1.0
php >=5.6

The Development Requires

by Gilles Paquette

parser html dom

06/04 2016

1.7.0

1.7.0.0 https://github.com/paquettg/php-html-parser

An HTML DOM parser. It allows you to manipulate HTML. Find tags on an HTML page with selectors just like jQuery.

Sources Download

MIT

The Requires

php >=5.4
paquettg/string-encode ~0.1.0

The Development Requires

by Gilles Paquette

parser html dom

20/03 2016

1.6.9

1.6.9.0 https://github.com/paquettg/php-html-parser

An HTML DOM parser. It allows you to manipulate HTML. Find tags on an HTML page with selectors just like jQuery.

Sources Download

MIT

The Requires

php >=5.4
paquettg/string-encode ~0.1.0

The Development Requires

by Gilles Paquette

parser html dom

08/11 2015

1.6.8

1.6.8.0 https://github.com/paquettg/php-html-parser

An HTML DOM parser. It allows you to manipulate HTML. Find tags on an HTML page with selectors just like jQuery.

Sources Download

MIT

The Requires

php >=5.4
paquettg/string-encode ~0.1.0

The Development Requires

by Gilles Paquette

parser html dom

09/12 2014

1.6.4

1.6.4.0 https://github.com/paquettg/php-html-parser

An HTML DOM parser. It allows you to manipulate HTML. Find tags on an HTML page with selectors just like jQuery.

Sources Download

MIT

The Requires

php >=5.4
paquettg/string-encode 0.1.0

The Development Requires

by Gilles Paquette

parser html dom

15/04 2014

1.6.3

1.6.3.0 https://github.com/paquettg/php-html-parser

An HTML DOM parser. It allows you to manipulate HTML. Find tags on an HTML page with selectors just like jQuery.

Sources Download

MIT

The Requires

php >=5.4
paquettg/string-encode 0.1.0

The Development Requires

phpunit/phpunit 3.7.*

by Gilles Paquette

parser html dom

08/04 2014

1.6.2

1.6.2.0 https://github.com/paquettg/php-html-parser

An HTML DOM parser. It allows you to manipulate HTML. Find tags on an HTML page with selectors just like jQuery.

Sources Download

MIT

The Requires

php >=5.4
paquettg/string-encode 0.1.0

The Development Requires

phpunit/phpunit 3.7.*

by Gilles Paquette

parser html dom

03/01 2014

1.6.1

1.6.1.0 https://github.com/paquettg/php-html-parser

An HTML DOM parser. It allows you to manipulate HTML. Find tags on an HTML page with selectors just like jQuery.

Sources Download

MIT

The Requires

The Development Requires

phpunit/phpunit 3.7.*

by Gilles Paquette

parser html dom

12/12 2013

1.6.0

1.6.0.0 https://github.com/paquettg/php-html-parser

An HTML DOM parser. It allows you to manipulate HTML. Find tags on an HTML page with selectors just like jQuery.

Sources Download

MIT

The Requires

php >=5.4

The Development Requires

phpunit/phpunit 3.7.*

by Gilles Paquette

parser html dom

04/05 2013

1.5.1

1.5.1.0 https://github.com/sunra/php-simple-html-dom-parser

Composer adaptation of: A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way! Require PHP 5+. Supports invalid HTML. Find tags on an HTML page with selectors just like jQuery. Extract contents from HTML in a single line.

Sources Download

MIT

The Requires

php >=5.3.2

by Sunra

parser html dom

library php-html-parser

An HTML DOM parser. It allows you to manipulate HTML. Find tags on an HTML page with selectors just like jQuery.

paquettg/php-html-parser

The README.md

PHP Html Parser

Install

Basic Usage

Support PHP Html Parser Financially

Loading Files

Loading URLs

Loading Strings

Options

Static Facade

Modifying The Dom

The Versions

dev-dev

The Requires

The Development Requires

by Gilles Paquette

dev-master

The Requires

The Development Requires

by Gilles Paquette

1.7.0

The Requires

The Development Requires

by Gilles Paquette

1.6.9

The Requires

The Development Requires

by Gilles Paquette

1.6.8

The Requires

The Development Requires

by Gilles Paquette

1.6.4

The Requires

The Development Requires

by Gilles Paquette

1.6.3

The Requires

The Development Requires

by Gilles Paquette

1.6.2

The Requires

The Development Requires

by Gilles Paquette

1.6.1

The Requires

The Development Requires

by Gilles Paquette

1.6.0

The Requires

The Development Requires

by Gilles Paquette

1.5.1

The Requires

by Sunra