This is a simple PHP Wrapper for Apache Tika., (*1)
It allows the developer to retrieve text, metadata and language from complex
documents., (*2)
It supports opendocument, office .doc and .docx, pdf, images, videos and
a lot more !, (*3)
See http://tika.apache.org/1.1/formats.html for details., (*4)
Install with composer
Add the package dependency enzim/tika-wrapper
in your composer.json, (*5)
{
"require": {
"ninoskopac/php-tika-wrapper": "~1.0"
}
}
Install the new package with composer, and that's it!, (*6)
php composer.phar install
For convenience, the package include the tika-app jar file, which is
quite big (25MB). The download may take time!, (*7)
See http://packagist.org for more details. (Don't forget to add
require 'vendor/.composer/autoload.php';
in your autoloading php file)., (*8)
Example installation/usage
See example/ (more docs to come soon) for an example:, (*9)
git clone git@github.com:pierroweb/PhpTikaWrapper.git
cd PhpTikaWrapper
cd example/with-composer
curl -s http://getcomposer.org/installer | php
php composer.phar install
php usage.php
Usage
In your own project, assuming you have an opendocument test.odt in the
current directory, (*10)
<?php
use Enzim\Lib\TikaWrapper\TikaWrapper;
$testFile = __DIR__."/test.odt";
$plaintext = TikaWrapper::getText($testFile);
$metadataArray = TikaWrapper::getMetaData($testFile);
$language = TikaWrapper::getLanguage($testFile);
Available methods (they all take a string, the full path of the file, as argument), (*11)
-
getText($file)
returns a string containing the document
in plain-text
-
getTextMain($file)
returns a string containing only the
main text of the doc
-
getXHTML($file)
returns a string containing an XHTML
(xml-valid) conversion of the document
-
getHTML($file)
returns a string containing an HTML
conversion of the document
-
getContentType($file)
returns the content type of the
document. Example outputs for opendocument, docx and pdf:, (*12)
application/vnd.oasis.opendocument.text
application/vnd.openxmlformats-officedocument.wordprocessingml.document
application/pdf
-
getLanguage($file)
returns the language of the
documeent. Example output: en
for english, fr
for french, etc, (*13)
-
getMetaData($file)
returns a PHP array with the
metadata. Ex:, (*14)
Array
(
[Character Count] => 41
[Content-Length] => 8686
[Content-Type] => application/vnd.oasis.opendocument.text
[Creation-Date] => 2012-04-12T11:44:14
[Edit-Time] => PT00H00M39S
[Image-Count] => 0
[Object-Count] => 0
[Page-Count] => 1
[Paragraph-Count] => 2
[Table-Count] => 0
[Word-Count] => 9
[creator] => *******
[date] => 2012-04-12T11:44:52
[editing-cycles] => 1
[generator] => OpenOffice.org/3.2$Linux OpenOffice.org_project/320m12$Build-9483
[initial-creator] => Pierre B
[nbCharacter] => 41
[nbImg] => 0
[nbObject] => 0
[nbPage] => 1
[nbPara] => 2
[nbTab] => 0
[nbWord] => 9
[resourceName] => test.odt
[xmpTPg:NPages] => 1
)
TODO
- set a pretty print option (to use option -r for html/xhtml)
- allows the use of tika-server transparently to avoid loading the JVM on
each request
Support for Tika-server
Supported by this lib: https://github.com/vaites/php-apache-tika, (*15)
Credits
- http://tika.apache.org
- It uses the Symfony Process component
http://symfony.com/doc/current/components/process.html