2017 © Pedro Peláez
 

library php-tika-wrapper

This is a simple PHP Wrapper for Apache Tika (using the tika-app jar)

image

ninoskopac/php-tika-wrapper

This is a simple PHP Wrapper for Apache Tika (using the tika-app jar)

  • Sunday, April 23, 2017
  • by NinoSkopac
  • Repository
  • 10 Watchers
  • 42 Stars
  • 824 Installations
  • PHP
  • 3 Dependents
  • 0 Suggesters
  • 22 Forks
  • 1 Open issues
  • 6 Versions
  • 22 % Grown

The README.md

This is a simple PHP Wrapper for Apache Tika., (*1)

It allows the developer to retrieve text, metadata and language from complex documents., (*2)

Supported formats

It supports opendocument, office .doc and .docx, pdf, images, videos and a lot more !, (*3)

See http://tika.apache.org/1.1/formats.html for details., (*4)

Install with composer

Add the package dependency enzim/tika-wrapper in your composer.json, (*5)

    {
        "require": {
            "ninoskopac/php-tika-wrapper": "~1.0" 
        }   
    }

Install the new package with composer, and that's it!, (*6)

php composer.phar install

For convenience, the package include the tika-app jar file, which is quite big (25MB). The download may take time!, (*7)

See http://packagist.org for more details. (Don't forget to add require 'vendor/.composer/autoload.php'; in your autoloading php file)., (*8)

Example installation/usage

See example/ (more docs to come soon) for an example:, (*9)

    git clone git@github.com:pierroweb/PhpTikaWrapper.git
    cd PhpTikaWrapper

    cd example/with-composer
    curl -s http://getcomposer.org/installer | php
    php composer.phar install
    php usage.php

Usage

In your own project, assuming you have an opendocument test.odt in the current directory, (*10)

    <?php
    use Enzim\Lib\TikaWrapper\TikaWrapper;

    $testFile = __DIR__."/test.odt";

    $plaintext = TikaWrapper::getText($testFile);
    $metadataArray = TikaWrapper::getMetaData($testFile);
    $language = TikaWrapper::getLanguage($testFile);

Available methods (they all take a string, the full path of the file, as argument), (*11)

  • getText($file) returns a string containing the document in plain-text
  • getTextMain($file) returns a string containing only the main text of the doc
  • getXHTML($file) returns a string containing an XHTML (xml-valid) conversion of the document
  • getHTML($file) returns a string containing an HTML conversion of the document
  • getContentType($file) returns the content type of the document. Example outputs for opendocument, docx and pdf:, (*12)

    application/vnd.oasis.opendocument.text
    application/vnd.openxmlformats-officedocument.wordprocessingml.document
    application/pdf
  • getLanguage($file) returns the language of the documeent. Example output: en for english, fr for french, etc, (*13)

  • getMetaData($file) returns a PHP array with the metadata. Ex:, (*14)

    Array  
    (
        [Character Count] => 41
        [Content-Length] => 8686
        [Content-Type] => application/vnd.oasis.opendocument.text
        [Creation-Date] => 2012-04-12T11:44:14
        [Edit-Time] => PT00H00M39S
        [Image-Count] => 0
        [Object-Count] => 0
        [Page-Count] => 1
        [Paragraph-Count] => 2
        [Table-Count] => 0
        [Word-Count] => 9
        [creator] => *******
        [date] => 2012-04-12T11:44:52
        [editing-cycles] => 1
        [generator] => OpenOffice.org/3.2$Linux OpenOffice.org_project/320m12$Build-9483
        [initial-creator] => Pierre B
        [nbCharacter] => 41
        [nbImg] => 0
        [nbObject] => 0
        [nbPage] => 1
        [nbPara] => 2
        [nbTab] => 0
        [nbWord] => 9
        [resourceName] => test.odt
        [xmpTPg:NPages] => 1
    )

TODO

  • set a pretty print option (to use option -r for html/xhtml)
  • allows the use of tika-server transparently to avoid loading the JVM on each request

Support for Tika-server

Supported by this lib: https://github.com/vaites/php-apache-tika, (*15)

Credits

  • http://tika.apache.org
  • It uses the Symfony Process component http://symfony.com/doc/current/components/process.html

The Versions

23/04 2017

dev-master

9999999-dev

This is a simple PHP Wrapper for Apache Tika (using the tika-app jar)

  Sources   Download

The Requires

 

The Development Requires

pdf tika doc odt docx apache tika text processing

20/04 2017

1.0.4

1.0.4.0

This is a simple PHP Wrapper for Apache Tika (using the tika-app jar)

  Sources   Download

The Requires

 

The Development Requires

pdf tika doc odt docx apache tika text processing

20/04 2017

1.0.3

1.0.3.0

This is a simple PHP Wrapper for Apache Tika (using the tika-app jar)

  Sources   Download

The Requires

 

The Development Requires

pdf tika doc odt docx apache tika text processing

20/04 2017

1.0.2

1.0.2.0

This is a simple PHP Wrapper for Apache Tika (using the tika-app jar)

  Sources   Download

The Requires

 

The Development Requires

pdf tika doc odt docx apache tika text processing

20/04 2017

1.0.1

1.0.1.0

This is a simple PHP Wrapper for Apache Tika (using the tika-app jar)

  Sources   Download

The Requires

 

The Development Requires

pdf tika doc odt docx apache tika text processing

04/01 2017

1.0

1.0.0.0

This is a simple PHP Wrapper for Apache Tika (using the tika-app jar)

  Sources   Download

The Requires

 

The Development Requires

pdf tika doc odt docx apache tika text processing