PHPOAIPMH
A PHP OAI-PMH harvester client library
, (*1)
This library provides an interface to harvest OAI-PMH metadata
from any OAI 2.0 compliant endpoint., (*2)
Features:
* PSR-12 Compliant
* Composer-compatible
* Unit-tested
* Prefers Guzzle (v6, v7, or v5) for HTTP transport layer, but can fall back to cURL, or implement your own
* Easy-to-use iterator that hides all the HTTP junk necessary to get paginated records, (*3)
Installation Options
Install via Composer by including the following in your composer.json file:, (*4)
{
"require": {
"caseyamcl/phpoaipmh": "^3.0",
"guzzlehttp/guzzle": "^7.0"
}
}
Or, drop the src
folder into your application and use a PSR-4 autoloader to include the files., (*5)
Note: Guzzle v6.0 or v7.0 is recommended, but if you do not wish to use Guzzle v6 for whatever reason, you can
use any one of the following:, (*6)
- Guzzle 5.0 - You can use Guzzle v5 instead of v6.
- cURL - This library will fall back to using cURL if Guzzle is not installed.
- Build your own - You can use a different HTTP client library by passing your own
implementation of the
Phpoaipmh\HttpAdapter\HttpAdapterInterface
to the Phpoaipmh\Client
constructor.
Upgrading
There are several backwards-incompatible API improvements in major version changes. See UPGRADE.md for
information about how to upgrade your code to use the new version., (*7)
Usage
Setup a new endpoint client:, (*8)
// Quick and easy 'build' method
$myEndpoint = \Phpoaipmh\Endpoint::build('http://some.service.com/oai');
// Or, create your own client instance and pass it to `Endpoint::__construct()`
$client = new \Phpoaipmh\Client('http://some.service.com/oai');
$myEndpoint = new \Phpoaipmh\Endpoint($client);
Get basic information:, (*9)
// Result will be a SimpleXMLElement object
$result = $myEndpoint->identify();
var_dump($result);
// Results will be iterator of SimpleXMLElement objects
$results = $myEndpoint->listMetadataFormats();
foreach($results as $item) {
var_dump($item);
}
Retrieving records
// Recs will be an iterator of SimpleXMLElement objects
$recs = $myEndpoint->listRecords('someMetaDataFormat');
// The iterator will continue retrieving items across multiple HTTP requests.
// You can keep running this loop through the *entire* collection you
// are harvesting. All OAI-PMH and HTTP pagination logic is hidden neatly
// behind the iterator API.
foreach($recs as $rec) {
var_dump($rec);
}
Limiting record retrieval by date/time
Simply pass instances of DateTimeInterface
to Endpoint::listRecords()
or Endpoint::listIdentifiers()
as
arguments two and three, respectively., (*10)
If you want one and not another, you can pass null
for either argument., (*11)
// Retrieve records from Jan 1, 2018 through October 1, 2018
$recs = $myEndpoint->listRecords('someMetaDataFormat', new \DateTime('2018-01-01'), new \DateTime('2018-10-01'));
foreach($recs as $rec) {
var_dump($rec);
}
Setting date/time granularity
This library will attempt to retrieve granularity automatically from the OAI-PMH
Identify
endpoint, but in case you want to set it your self manually, you can pass
an instance of Granularity
to the Endpoint
constructor:, (*12)
use Phpoaipmh\Client,
Phpoaipmh\Endpoint,
Phpoaipmh\Granularity;
$client = new Client('http://some.service.com/oai');
$myEndpoint = new Endpoint($client, Granularity::DATE_AND_TIME);
Record sets
Some OAI-PMH endpoints sub-divide records into sets., (*13)
You can list the record sets available for a given endpoint by calling Endpoint::listSets()
:, (*14)
foreach ($myEndpoint->listSets() as $set) {
var_dump($set);
}
You can specify the set you wish to retrieve by passing the set name as the fourth argument to
Endpoint::listIdentifiers()
or Endpoint::listRecords()
:, (*15)
foreach ($myEndpoint->listRecords('someMetadataFormat', null, null 'someSetName') as $record) {
var_dump($record);
}
Getting total record count
Some endpoints provide a total record count for your query. If the endpoint
provides this, you can access this value by calling: RecordIterator::getTotalRecordCount()
., (*16)
If the endpoint does not provide this count, then RecordIterator::getTotalRecordCount()
returns null
., (*17)
$iterator = $myEndpoint->listRecords('someMetaDataFormat');
echo "Total count is " . ($iterator->getTotalRecordCount() ?: 'unknown');
Handling Results
Depending on the verb you use, the library will send back either a SimpleXMLELement
or an iterator containing SimpleXMLElement
objects., (*18)
- For
identify
and getRecord
, a SimpleXMLElement
object is returned
- For
listMetadataFormats
, listSets
, listIdentifiers
, and listRecords
a Phpoaipmh\ResponseIterator
is returned
The Phpoaipmh\ResponseIterator
object encapsulates the logic to iterate through paginated sets of records., (*19)
Handling Errors
This library will throw different exceptions under different circumstances:, (*20)
- HTTP request errors will generate a
Phpoaipmh\Exception\HttpException
- Response body parsing issues (e.g. invalid XML) will generate a
Phpoaipmh\Exception\MalformedResponseException
- OAI-PMH protocol errors (e.g. invalid verb or missing params) will generate a
Phpoaipmh\Exception\OaipmhException
All exceptions extend the Phpoaipmh\Exception\BaseoaipmhException
class., (*21)
Customizing Default Request Options
You can customize the default request options (for example, request timeout) for both cURL and Guzzle
clients by building the adapter objects manually., (*22)
If you're using Guzzle v6, you can set default options by building your own
Guzzle client and setting parameters in the constructor:, (*23)
use GuzzleHttp\Client as GuzzleClient;
use Phpoaipmh\Client;
use Phpoaipmh\Endpoint;
use Phpoaipmh\HttpAdapter\GuzzleAdapter;
$guzzle = new GuzzleAdapter(new GuzzleClient([
'connect_timeout' => 2.0,
'timeout' => 10.0
]));
$myEndpoint = new Endpoint(new Client('http://some.service.com/oai', $guzzle));
If you're using cURL, you can set request options by passing them in as an
array of key/value items to CurlAdapter::setCurlOpts()
:, (*24)
use Phpoaipmh\Client,
Phpoaipmh\HttpAdapter\CurlAdapter;
$adapter = new CurlAdapter();
$adapter->setCurlOpts([CURLOPT_TIMEOUT => 120]);
$client = new Client('http://some.service.com/oai', $adapter);
$myEndpoint = new Endpoint($client);
If you're using Guzzle v5, you can set default options by building your own
Guzzle client,, (*25)
use Phpoaipmh\Client,
Phpoaipmh\HttpAdapter\GuzzleAdapter;
$adapter = new GuzzleAdapter();
$adapter->getGuzzleClient()->setDefaultOption('timeout', 120);
$client = new Client('http://some.service.com/oai', $adapter);
$myEndpoint = new Endpoint($client);
Dealing with XML Namespaces
Many OAI-PMH XML documents make use of XML Namespaces. For non-XML experts, it can be confusing to implement
these in PHP. SitePoint has a brief but excellent overview of how to use Namespaces in SimpleXML., (*26)
The Phpoaipmh\RecordIterator
iterator contains some helper methods:, (*27)
-
getNumRequests()
- Returns the number of HTTP requests made thus far
-
getNumRetrieved()
- Returns the number of individual records retrieved
-
reset()
- Resets the iterator, which will restart the record retrieval from scratch.
Handling 503 Retry-After
Responses
Some OAI-PMH endpoints employ rate-limiting so that you can only make X number
of requests in a given time period. These endpoints will return a 503 Retry-AFter
HTTP status code if your code generates too many HTTP requests too quickly., (*28)
Guzzle v6
If you have installed Guzzle v6, then you can use the
Guzzle-Retry-Middleware library
to automatically handle OAI-PMH endpoint rate limiting rules., (*29)
First, include the middleware as a dependency in your app:, (*30)
composer require caseyamcl/guzzle_retry_middleware
Then, when loading the Phpoaipmh libraries, build a Guzzle client manually, and add
the middleware to the stack. Example:, (*31)
use GuzzleRetry\GuzzleRetryMiddleware;
use GuzzleHttp\Client as GuzzleClient;
use GuzzleHttp\HandlerStack;
// Setup the the Guzzle client with the retry middleware
$stack = HandlerStack::create();
$stack->push(GuzzleRetryMiddleware::factory());
$guzzleClient = new GuzzleClient(['handler' => $stack]);
// Setup the Guzzle adpater and PHP OAI-PMH client
$guzzleAdapter = new \Phpoaipmh\HttpAdapter\GuzzleAdapter($guzzleClient);
$client = new \Phpoaipmh\Client('http://some.service.com/oai', $guzzleAdapter);
This will create a client that automatically retries requests when OAI-PMH endpoints send
503
rate-limiting responses., (*32)
The Retry middleware contains a number of options. Refer to the README for that package
for details., (*33)
Guzzle v5
If you have installed Guzzle v5, then you can use the
Retry-Subscriber to automatically
handle OAI-PMH endpoint rate-limiting rules., (*34)
First, include the retry-subscriber as a dependency in your composer.json
:, (*35)
require: {
/* ... */
"guzzlehttp/retry-subscriber": "~2.0"
}
Then, when loading the Phpoaipmh libraries, instantiate the Guzzle adapter
manually, and add the subscriber as indicated in the code below:, (*36)
// Create a Retry Guzzle Subscriber
$retrySubscriber = new \GuzzleHttp\Subscriber\Retry\RetrySubscriber([
'delay' => function($numRetries, \GuzzleHttp\Event\AbstractTransferEvent $event) {
$waitSecs = $event->getResponse()->getHeader('Retry-After') ?: '5';
return ($waitSecs * 1000) + 1000; // wait one second longer than the server said to
},
'filter' => \GuzzleHttp\Subscriber\Retry\RetrySubscriber::createStatusFilter(),
]);
// Manually create a Guzzle HTTP adapter
$guzzleAdapter = new \Phpoaipmh\HttpAdapter\GuzzleAdapter();
$guzzleAdapter->getGuzzleClient()->getEmitter()->attach($retrySubscriber);
$client = new \Phpoaipmh\Client('http://some.service.com/oai', $guzzleAdapter);
This will create a client that automatically retries requests when OAI-PMH endpoints send
503
rate-limiting responses., (*37)
Sending Arbitrary Query Parameters
If you wish to send arbitrary HTTP query parameters with your requests, you can
send them via the \Phpoaipmh\Client
class:, (*38)
$client = new \Phpoaipmh\Client('http://some.service.com/oai');
$client->request('Identify', ['some' => 'extra-param']);
Alternatively, if you wish to send arbitrary parameters while taking advantage of the
convenience of the \Phpoaipmh\Endpoint
class, you can use the Guzzle Param Middleware
library:, (*39)
First, include the middleware as a dependency in your app:, (*40)
$ composer require emarref/guzzle-param-middleware
Then, when loading the Phpoaipmh libraries, build a Guzzle client manually, and add
the middleware to the stack. Example:, (*41)
use Emarref\Guzzle\Middleware\ParamMiddleware
use GuzzleHttp\Client as GuzzleClient;
use GuzzleHttp\HandlerStack;
use GuzzleHttp\Middleware;
use Psr\Http\Message\RequestInterface;
// Setup the the Guzzle stack
$stack = HandlerStack()::create();
$stack->push(new ParamMiddleware(['api_key' => 'xyz123']));
// Setup Guzzle client, adapter, and PHP OAI-PMH client
$guzzleClient = new GuzzleClient(['handler' => $stack])
$guzzleAdapter = new \Phpoaipmh\HttpAdapter\GuzzleAdapter($guzzleClient)
$client = new \Phpoaipmh\Client('http://some.service.com/oai', $guzzleAdapter);
This will add the specified query parameters to all requests for the client., (*42)
Sending arbitrary query parameters with Guzzle v5
If you are using Guzzle v5, you can use the Guzzle event system:, (*43)
// Create a function or class to add parameters to a request
$addParamsListener = function(\GuzzleHttp\Event\BeforeEvent $event) {
$req = $event->getRequest();
$req->getQuery()->add('api_key', 'xyz123');
// You could do other things to the request here, too, like adding a header..
$req->addHeader('Some-Header', 'some-header-value');
};
// Manually create a Guzzle HTTP adapter
$guzzleAdapter = new \Phpoaipmh\HttpAdapter\GuzzleAdapter();
$guzzleAdapter->getGuzzleClient()->getEmitter()->on('before', $addParamsListener);
$client = new \Phpoaipmh\Client('http://some.service.com/oai', $guzzleAdapter);
Implementation Tips
Harvesting data from a OAI-PMH endpoint can be a time-consuming task, especially when there are lots of records.
Typically, this kind of task is done via a CLI script or background process that can run for a long time.
It is not normally a good idea to make it part of a web request., (*44)
Credits
License
MIT License; see LICENSE file for details, (*45)