Arachnid Web Crawler
This library will crawl all unique internal links found on a given website
up to a specified maximum page depth., (*1)
This library is using symfony/panther & FriendsOfPHP/Goutte libraries to scrap site pages and extract main SEO-related info, including:
title
, h1 elements
, h2 elements
, statusCode
, contentType
, meta description
, meta keyword
and canonicalLink
., (*2)
This library is based on the original blog post by Zeid Rashwani here:, (*3)
http://zrashwani.com/simple-web-spider-php-goutte, (*4)
Josh Lockhart adapted the original blog post's code (with permission)
for Composer and Packagist and updated the syntax to conform with
the PSR-2 coding standard., (*5)
, (*6)
, (*7)
How to Install
You can install this library with Composer. Drop this into your composer.json
manifest file:, (*8)
{
"require": {
"zrashwani/arachnid": "dev-master"
}
}
Then run composer install
., (*9)
Getting Started
Basic Usage:
Here's a quick demo to crawl a website:, (*10)
traverse();
// Get link data
$links = $crawler->getLinksArray(); //to get links as objects use getLinks() method
print_r($links);
```
### Enabling Headless Browser mode:
Headless browser mode can be enabled, so it will use Chrome engine in the background which is useful to get contents of JavaScript-based sites.
`enableHeadlessBrowserMode` method set the scraping adapter used to be `PantherChromeAdapter` which is based on [Symfony Panther](https://github.com/symfony/panther) library:
```php
$crawler = new \Arachnid\Crawler($url, $linkDepth);
$crawler->enableHeadlessBrowserMode()
->traverse()
->getLinksArray();
```
In order to use this, you need to have [chrome-driver](https://sites.google.com/a/chromium.org/chromedriver/) installed on your machine, you can use `dbrekelmans/browser-driver-installer` to install chromedriver locally:
```
composer require --dev dbrekelmans/bdi
./vendor/bin/bdi driver:chromedriver drivers
```
## Advanced Usage:
Set additional options to underlying http client, by specifying array of options in constructor
or creating Http client scrapper with desired options:
```php
array('username', 'password')];
$crawler = new \Arachnid\Crawler('http://github.com', 2, $clientOptions);
//or by creating and setting scrap client
$options = array(
'verify_host' => false,
'verify_peer' => false,
'timeout' => 30,
);
$scrapperClient = CrawlingFactory::create(CrawlingFactory::TYPE_HTTP_CLIENT, $options);
$crawler->setScrapClient($scrapperClient);
```
You can inject a [PSR-3][psr3] compliant logger object to monitor crawler activity (like [Monolog][monolog]):
```php
pushHandler(new \Monolog\Handler\StreamHandler(sys_get_temp_dir().'/crawler.log'));
$crawler->setLogger($logger);
?>
You can set crawler to visit only pages with specific criteria by specifying callback closure using filterLinks
method:, (*11)
<?php
//filter links according to specific callback as closure
$links = $crawler->filterLinks(function($link) {
//crawling only links with /blog/ prefix
return (bool)preg_match('/.*\/blog.*$/u', $link);
})
->traverse()
->getLinks();
You can use LinksCollection
class to get simple statistics about the links, as following:, (*12)
<?php
$links = $crawler->traverse()
->getLinks();
$collection = new LinksCollection($links);
//getting broken links
$brokenLinks = $collection->getBrokenLinks();
//getting links for specific depth
$depth2Links = $collection->getByDepth(2);
//getting external links inside site
$externalLinks = $collection->getExternalLinks();
How to Contribute
- Fork this repository
- Create a new branch for each feature or improvement
- Apply your code changes along with corresponding unit test
- Send a pull request from each feature branch
It is very important to separate new features or improvements into separate feature branches,
and to send a pull request for each branch. This allows me to review and pull in new features
or improvements individually., (*13)
All pull requests must adhere to the PSR-2 standard., (*14)
System Requirements
Authors
License
MIT Public License, (*15)