Tesa (text sanitizer)
, (*1)
The library contains a small collection of helper classes to support sanitization
of text or string elements of arbitrary length with the aim to improve
search match confidence during a query execution that is required by Semantic MediaWiki
project and is deployed independently., (*2)
Requirements
- PHP 5.3 / HHVM 3.5 or later
- Recommended to enable the ICU extension
Installation
The recommended installation method for this library is by adding
the following dependency to your composer.json., (*3)
{
"require": {
"onoi/tesa": "~0.1"
}
}
Usage
use Onoi\Tesa\SanitizerFactory;
use Onoi\Tesa\Transliterator;
use Onoi\Tesa\Sanitizer;
$sanitizerFactory = new SanitizerFactory();
$sanitizer = $sanitizerFactory->newSanitizer( 'A string that contains ...' );
$sanitizer->reduceLengthTo( 200 );
$sanitizer->toLowercase();
$sanitizer->replace(
array( "'", "http://", "https://", "mailto:", "tel:" ),
array( '' )
);
$sanitizer->setOption( Sanitizer::MIN_LENGTH, 4 );
$sanitizer->setOption( Sanitizer::WHITELIST, array( 'that' ) );
$sanitizer->applyTransliteration(
Transliterator::DIACRITICS | Transliterator::GREEK
);
$text = $sanitizer->sanitizeWith(
$sanitizerFactory->newGenericTokenizer(),
$sanitizerFactory->newNullStopwordAnalyzer(),
$sanitizerFactory->newNullSynonymizer()
);
-
SanitizerFactory
is expected to be the sole entry point for services and instances
when used outside of this library
-
IcuWordBoundaryTokenizer
is a preferred tokenizer in case the ICU extension is available
-
NGramTokenizer
is provided to increase CJK match confidence in case the
back-end does not provide an explicit ngram tokenizer
-
StopwordAnalyzer
together with a LanguageDetector
is provided as a means to
reduce ambiguity of frequent "noise" words from a possible search index
-
Synonymizer
currently only provides an interface
Contribution and support
If you want to contribute work to the project please subscribe to the
developers mailing list and have a look at the contribution guidelinee. A list
of people who have made contributions in the past can be found here., (*4)
Tests
The library provides unit tests that covers the core-functionality normally run by the
continues integration platform. Tests can also be executed manually using the
composer phpunit
command from the root directory., (*5)
Release notes
- 0.1.0 Initial release (2016-08-07)
- Added
SanitizerFactory
with support for a
-
Tokenizer
, LanguageDetector
, Synonymizer
, and StopwordAnalyzer
interface
Acknowledgments
- The
Transliterator
uses the same diacritics conversion table as http://jsperf.com/latinize
(except the German diaeresis ä, ü, and ö)
- The stopwords used by the
StopwordAnalyzer
have been collected from different sources, each json
file identifies its origin
-
CdbStopwordAnalyzer
relies on wikimedia/cdb
to avoid using an external database or cache
layer (with extra stopwords being available here)
-
JaTinySegmenterTokenizer
is based on the work of Taku Kudo and his tiny_segmenter.js
-
TextCatLanguageDetector
uses the wikimedia/textcat
library to make predictions about a language
License
GNU General Public License 2.0 or later., (*6)