2017 © Pedro Peláez
 

library arachnid

A crawler to find all unique internal pages on a given website

image

codeguy/arachnid

A crawler to find all unique internal pages on a given website

  • Monday, December 26, 2016
  • by codeguy
  • Repository
  • 20 Watchers
  • 145 Stars
  • 3,121 Installations
  • PHP
  • 2 Dependents
  • 0 Suggesters
  • 47 Forks
  • 5 Open issues
  • 7 Versions
  • 9 % Grown

The README.md

Arachnid Web Crawler

This library will crawl all unique internal links found on a given website up to a specified maximum page depth., (*1)

This library is using symfony/panther & FriendsOfPHP/Goutte libraries to scrap site pages and extract main SEO-related info, including: title, h1 elements, h2 elements, statusCode, contentType, meta description, meta keyword and canonicalLink., (*2)

This library is based on the original blog post by Zeid Rashwani here:, (*3)

http://zrashwani.com/simple-web-spider-php-goutte, (*4)

Josh Lockhart adapted the original blog post's code (with permission) for Composer and Packagist and updated the syntax to conform with the PSR-2 coding standard., (*5)

Build Status codecov, (*6)

, (*7)

How to Install

You can install this library with Composer. Drop this into your composer.json manifest file:, (*8)

{
    "require": {
        "zrashwani/arachnid": "dev-master"
    }
}

Then run composer install., (*9)

Getting Started

Basic Usage:

Here's a quick demo to crawl a website:, (*10)

    traverse();

    // Get link data
    $links = $crawler->getLinksArray(); //to get links as objects use getLinks() method
    print_r($links);
```

### Enabling Headless Browser mode:

Headless browser mode can be enabled, so it will use Chrome engine in the background which is useful to get contents of JavaScript-based sites.

`enableHeadlessBrowserMode` method set the scraping adapter used to be `PantherChromeAdapter` which is based on [Symfony Panther](https://github.com/symfony/panther) library: 
```php
    $crawler = new \Arachnid\Crawler($url, $linkDepth);
    $crawler->enableHeadlessBrowserMode()
            ->traverse()
            ->getLinksArray();
```

In order to use this, you need to have [chrome-driver](https://sites.google.com/a/chromium.org/chromedriver/) installed on your machine, you can use `dbrekelmans/browser-driver-installer` to install chromedriver locally: 
```
composer require --dev dbrekelmans/bdi
./vendor/bin/bdi driver:chromedriver drivers
```
    
## Advanced Usage:

   Set additional options to underlying http client, by specifying array of options in constructor 
or creating Http client scrapper with desired options:

```php
     array('username', 'password')];
        $crawler = new \Arachnid\Crawler('http://github.com', 2, $clientOptions);
           
        //or by creating and setting scrap client
        $options = array(
            'verify_host' => false,
            'verify_peer' => false,
            'timeout' => 30,
        );
                        
        $scrapperClient = CrawlingFactory::create(CrawlingFactory::TYPE_HTTP_CLIENT, $options);
        $crawler->setScrapClient($scrapperClient);
```

   You can inject a [PSR-3][psr3] compliant logger object to monitor crawler activity (like [Monolog][monolog]):
```php
    pushHandler(new \Monolog\Handler\StreamHandler(sys_get_temp_dir().'/crawler.log'));
    $crawler->setLogger($logger);
    ?>

You can set crawler to visit only pages with specific criteria by specifying callback closure using filterLinks method:, (*11)

    <?php
    //filter links according to specific callback as closure
    $links = $crawler->filterLinks(function($link) {
                        //crawling only links with /blog/ prefix
                        return (bool)preg_match('/.*\/blog.*$/u', $link); 
                    })
                    ->traverse()
                    ->getLinks();

You can use LinksCollection class to get simple statistics about the links, as following:, (*12)

    <?php
    $links = $crawler->traverse()
                     ->getLinks();
    $collection = new LinksCollection($links);

    //getting broken links
    $brokenLinks = $collection->getBrokenLinks();

    //getting links for specific depth
    $depth2Links = $collection->getByDepth(2);

    //getting external links inside site
    $externalLinks = $collection->getExternalLinks();

How to Contribute

  1. Fork this repository
  2. Create a new branch for each feature or improvement
  3. Apply your code changes along with corresponding unit test
  4. Send a pull request from each feature branch

It is very important to separate new features or improvements into separate feature branches, and to send a pull request for each branch. This allows me to review and pull in new features or improvements individually., (*13)

All pull requests must adhere to the PSR-2 standard., (*14)

System Requirements

  • PHP 7.2.0+

Authors

License

MIT Public License, (*15)

The Versions

26/12 2016

dev-master

9999999-dev http://github.com/codeguy/arachnid

A crawler to find all unique internal pages on a given website

  Sources   Download

MIT

The Requires

 

The Development Requires

search spider scrape crawl

25/12 2016

1.1

1.1.0.0 http://github.com/codeguy/arachnid

A crawler to find all unique internal pages on a given website

  Sources   Download

MIT

The Requires

 

The Development Requires

search spider scrape crawl

02/11 2015

1.0.4

1.0.4.0 http://github.com/codeguy/arachnid

A crawler to find all unique internal pages on a given website

  Sources   Download

MIT

The Requires

 

search spider scrape crawl

12/09 2015

1.0.3

1.0.3.0 http://github.com/codeguy/arachnid

A crawler to find all unique internal pages on a given website

  Sources   Download

MIT

The Requires

 

search spider scrape crawl

10/01 2014

v1.0.2

1.0.2.0 http://github.com/codeguy/arachnid

A crawler to find all unique internal pages on a given website

  Sources   Download

MIT

The Requires

 

search spider scrape crawl

06/01 2014

1.0.1

1.0.1.0 http://github.com/codeguy/arachnid

A crawler to find all unique internal pages on a given website

  Sources   Download

MIT

The Requires

 

search spider scrape crawl

06/01 2014

1.0.0

1.0.0.0 http://github.com/codeguy/arachnid

A crawler to find all unique internal pages on a given website

  Sources   Download

MIT

The Requires

 

search spider scrape crawl