2017 © Pedro Peláez
 

library php-readability

Automatic article extraction from HTML

image

j0k3r/php-readability

Automatic article extraction from HTML

  • Tuesday, June 5, 2018
  • by j0k3r
  • Repository
  • 8 Watchers
  • 106 Stars
  • 41,208 Installations
  • PHP
  • 3 Dependents
  • 0 Suggesters
  • 28 Forks
  • 3 Open issues
  • 22 Versions
  • 6 % Grown

The README.md

Readability

CI Coverage Status Total Downloads License, (*1)

This is an extract of the Readability class from this full-text-rss fork. It can be defined as a better version of the original php-readability., (*2)

Differences

The default php-readability lib is really old and needs to be improved. I found a great fork of full-text-rss from @Dither which improve the Readability class., (*3)

  • I've extracted the class from its fork to be able to use it out of the box
  • I've added some simple tests
  • and changed the CS, run php-cs-fixer and added a namespace

But the code is still really hard to understand / read ..., (*4)

Requirements

By default, this lib will use the Tidy extension if it's available. Tidy is only used to cleanup the given HTML and avoid problems with bad HTML structure, etc .. It'll be suggested by Composer., (*5)

Also, if you got problem from parsing a content without Tidy installed, please install it and try again., (*6)

Usage

use Readability\Readability;

$url = 'http://www.medialens.org/index.php/alerts/alert-archive/alerts-2013/729-thatcher.html';

// you can use whatever you want to retrieve the html content (Guzzle, Buzz, cURL ...)
$html = file_get_contents($url);

$readability = new Readability($html, $url);
// or without Tidy
// $readability = new Readability($html, $url, 'libxml', false);
$result = $readability->init();

if ($result) {
    // display the title of the page
    echo $readability->getTitle()->textContent;
    // display the *readability* content
    echo $readability->getContent()->textContent;
} else {
    echo 'Looks like we couldn\'t find the content. :(';
}

If you want to debug it, or check what's going on, you can inject a logger (which must follow Psr\Log\LoggerInterface, Monolog for example):, (*7)

use Readability\Readability;
use Monolog\Logger;
use Monolog\Handler\StreamHandler;

$url = 'http://www.medialens.org/index.php/alerts/alert-archive/alerts-2013/729-thatcher.html';
$html = file_get_contents($url);

$logger = new Logger('readability');
$logger->pushHandler(new StreamHandler('path/to/your.log', Logger::DEBUG));

$readability = new Readability($html, $url);
$readability->setLogger($logger);

The Versions

05/06 2018

dev-master

9999999-dev

Automatic article extraction from HTML

  Sources   Download

Apache-2.0

The Requires

 

The Development Requires

html content extraction article article extraction content extraction

05/06 2018

1.1.10

1.1.10.0

Automatic article extraction from HTML

  Sources   Download

Apache-2.0

The Requires

 

The Development Requires

html content extraction article article extraction content extraction

30/06 2017

1.1.9

1.1.9.0

Automatic article extraction from HTML

  Sources   Download

Apache-2.0

The Requires

 

The Development Requires

html content extraction article article extraction content extraction

19/05 2017

1.1.8

1.1.8.0

Automatic article extraction from HTML

  Sources   Download

Apache-2.0

The Requires

 

The Development Requires

html content extraction article article extraction content extraction

18/03 2017

1.1.7

1.1.7.0

Automatic article extraction from HTML

  Sources   Download

Apache-2.0

The Requires

 

The Development Requires

html content extraction article article extraction content extraction

02/02 2017

1.1.6

1.1.6.0

Automatic article extraction from HTML

  Sources   Download

Apache-2.0

The Requires

 

The Development Requires

html content extraction article article extraction content extraction

14/01 2017

1.1.5

1.1.5.0

Automatic article extraction from HTML

  Sources   Download

Apache-2.0

The Requires

 

The Development Requires

html content extraction article article extraction content extraction

11/01 2017

1.1.4

1.1.4.0

Automatic article extraction from HTML

  Sources   Download

Apache-2.0

The Requires

 

The Development Requires

html content extraction article article extraction content extraction

21/10 2016

1.1.3

1.1.3.0

Automatic article extraction from HTML

  Sources   Download

Apache-2.0

The Requires

 

The Development Requires

html content extraction article article extraction content extraction

03/10 2016

1.1.2

1.1.2.0

Automatic article extraction from HTML

  Sources   Download

Apache-2.0

The Requires

 

The Development Requires

html content extraction article article extraction content extraction

01/03 2016

1.1.1

1.1.1.0

Automatic article extraction from HTML

  Sources   Download

Apache-2.0

The Requires

 

The Development Requires

html content extraction article article extraction content extraction

01/03 2016

1.1.0

1.1.0.0

Automatic article extraction from HTML

  Sources   Download

Apache-2.0

The Requires

 

The Development Requires

html content extraction article article extraction content extraction

10/11 2015

v1.0.9

1.0.9.0

Automatic article extraction from HTML

  Sources   Download

Apache-2.0

The Requires

  • php >=5.3.3

 

html content extraction article article extraction content extraction

23/09 2015

v1.0.8

1.0.8.0

Automatic article extraction from HTML

  Sources   Download

Apache-2.0

The Requires

  • php >=5.3.3
  • ext-tidy >=1.2

 

html content extraction article article extraction content extraction

20/09 2015

v1.0.7

1.0.7.0

Automatic article extraction from HTML

  Sources   Download

Apache-2.0

The Requires

  • php >=5.3.3
  • ext-tidy >=1.2

 

html content extraction article article extraction content extraction

15/09 2015

v1.0.6

1.0.6.0

Automatic article extraction from HTML

  Sources   Download

Apache-2.0

The Requires

  • php >=5.3.3
  • ext-tidy >=1.2

 

html content extraction article article extraction content extraction

14/09 2015

v1.0.5

1.0.5.0

Automatic article extraction from HTML

  Sources   Download

Apache-2.0

The Requires

  • php >=5.3.3
  • ext-tidy >=1.2

 

html content extraction article article extraction content extraction

24/08 2015

v1.0.4

1.0.4.0

Automatic article extraction from HTML

  Sources   Download

Apache-2.0

The Requires

  • php >=5.3.3
  • ext-tidy >=1.2

 

html content extraction article article extraction content extraction

19/08 2015

v1.0.3

1.0.3.0

Automatic article extraction from HTML

  Sources   Download

Apache-2.0

The Requires

  • php >=5.3.3
  • ext-tidy >=1.2

 

html content extraction article article extraction content extraction

11/06 2015

v1.0.2

1.0.2.0

Automatic article extraction from HTML

  Sources   Download

Apache-2.0

The Requires

  • php >=5.3.3
  • ext-tidy >=1.2

 

html content extraction article article extraction content extraction

29/04 2015

v1.0.1

1.0.1.0

Automatic article extraction from HTML

  Sources   Download

Apache-2.0

The Requires

  • php >=5.4
  • ext-tidy >=1.2

 

html content extraction article article extraction content extraction

12/12 2014

v1.0

1.0.0.0

Automatic article extraction from HTML

  Sources   Download

Apache-2.0

The Requires

  • php >=5.4
  • ext-tidy >=1.2

 

html content extraction article article extraction content extraction