mirror of
https://github.com/Ponce/slackbuilds
synced 2024-11-16 19:50:19 +01:00
4c4269a4c7
Signed-off-by: David Spencer <idlemoor@slackbuilds.org>
20 lines
1 KiB
Text
20 lines
1 KiB
Text
Libtextcat is a library with functions that implement the
|
|
classification technique described in Cavnar & Trenkle, "N-Gram-Based
|
|
Text Categorization". It was primarily developed for language
|
|
guessing, a task on which it is known to perform with near-perfect
|
|
accuracy.
|
|
|
|
The central idea of the Cavnar & Trenkle technique is to calculate a
|
|
"fingerprint" of a document with an unknown category, and compare this
|
|
with the fingerprints of a number of documents of which the categories
|
|
are known. The categories of the closest matches are output as the
|
|
classification. A fingerprint is a list of the most frequent n-grams
|
|
occurring in a document, ordered by frequency. Fingerprints are
|
|
compared with a simple out-of-place metric. See the article for more
|
|
details.
|
|
|
|
Considerable effort went into making this implementation fast and
|
|
efficient. The language guesser processes over 100 documents/second on
|
|
a simple PC, which makes it practical for many uses. It was developed
|
|
for use in our webcrawler and search engine software, in which it it
|
|
handles millions of documents a day.
|