2017 © Pedro Peláez
 

library tesa

A simple library to sanitize text elements

image

onoi/tesa

A simple library to sanitize text elements

  • Saturday, January 27, 2018
  • by mwjames
  • Repository
  • 1 Watchers
  • 2 Stars
  • 74,529 Installations
  • PHP
  • 2 Dependents
  • 0 Suggesters
  • 1 Forks
  • 0 Open issues
  • 2 Versions
  • 7 % Grown

The README.md

Tesa (text sanitizer)

Build Status Code Coverage Scrutinizer Code Quality Latest Stable Version Packagist download count Dependency Status, (*1)

The library contains a small collection of helper classes to support sanitization of text or string elements of arbitrary length with the aim to improve search match confidence during a query execution that is required by Semantic MediaWiki project and is deployed independently., (*2)

Requirements

  • PHP 5.3 / HHVM 3.5 or later
  • Recommended to enable the ICU extension

Installation

The recommended installation method for this library is by adding the following dependency to your composer.json., (*3)

{
    "require": {
        "onoi/tesa": "~0.1"
    }
}

Usage

use Onoi\Tesa\SanitizerFactory;
use Onoi\Tesa\Transliterator;
use Onoi\Tesa\Sanitizer;

$sanitizerFactory = new SanitizerFactory();

$sanitizer = $sanitizerFactory->newSanitizer( 'A string that contains ...' );

$sanitizer->reduceLengthTo( 200 );
$sanitizer->toLowercase();

$sanitizer->replace(
    array( "'", "http://", "https://", "mailto:", "tel:" ),
    array( '' )
);

$sanitizer->setOption( Sanitizer::MIN_LENGTH, 4 );
$sanitizer->setOption( Sanitizer::WHITELIST, array( 'that' ) );

$sanitizer->applyTransliteration(
    Transliterator::DIACRITICS | Transliterator::GREEK
);

$text = $sanitizer->sanitizeWith(
    $sanitizerFactory->newGenericTokenizer(),
    $sanitizerFactory->newNullStopwordAnalyzer(),
    $sanitizerFactory->newNullSynonymizer()
);

  • SanitizerFactory is expected to be the sole entry point for services and instances when used outside of this library
  • IcuWordBoundaryTokenizer is a preferred tokenizer in case the ICU extension is available
  • NGramTokenizer is provided to increase CJK match confidence in case the back-end does not provide an explicit ngram tokenizer
  • StopwordAnalyzer together with a LanguageDetector is provided as a means to reduce ambiguity of frequent "noise" words from a possible search index
  • Synonymizer currently only provides an interface

Contribution and support

If you want to contribute work to the project please subscribe to the developers mailing list and have a look at the contribution guidelinee. A list of people who have made contributions in the past can be found here., (*4)

Tests

The library provides unit tests that covers the core-functionality normally run by the continues integration platform. Tests can also be executed manually using the composer phpunit command from the root directory., (*5)

Release notes

  • 0.1.0 Initial release (2016-08-07)
    • Added SanitizerFactory with support for a
    • Tokenizer, LanguageDetector, Synonymizer, and StopwordAnalyzer interface

Acknowledgments

  • The Transliterator uses the same diacritics conversion table as http://jsperf.com/latinize (except the German diaeresis ä, ü, and ö)
  • The stopwords used by the StopwordAnalyzer have been collected from different sources, each json file identifies its origin
  • CdbStopwordAnalyzer relies on wikimedia/cdb to avoid using an external database or cache layer (with extra stopwords being available here)
  • JaTinySegmenterTokenizer is based on the work of Taku Kudo and his tiny_segmenter.js
  • TextCatLanguageDetector uses the wikimedia/textcat library to make predictions about a language

License

GNU General Public License 2.0 or later., (*6)

The Versions

27/01 2018

dev-master

9999999-dev https://github.com/onoi/tesa

A simple library to sanitize text elements

  Sources   Download

GPL-2.0+ GPL-2.0-or-later

The Requires

 

transliteration

07/08 2016

0.1.0

0.1.0.0 https://github.com/onoi/tesa

A simple library to sanitize text elements

  Sources   Download

GPL-2.0+

The Requires

 

transliteration