diff --git a/contrib/tsearch2/README.tsearch2 b/contrib/tsearch2/README.tsearch2 index c3581b9be6b3070ee264b739675ad28e298e3364..890868e59af7f2eba0083b648d292ade8b25ac5d 100644 --- a/contrib/tsearch2/README.tsearch2 +++ b/contrib/tsearch2/README.tsearch2 @@ -1,95 +1,106 @@ Tsearch2 - full text search extension for PostgreSQL - [10][Online version] of this document is available - - This module is sponsored by Delta-Soft Ltd., Moscow, Russia. - - Notice: This version is fully incompatible with old tsearch (V1), - which was deprecated in 7.4 and obsoleted in 8.0. - - The Tsearch2 contrib module contains an implementation of a new data - type tsvector - a searchable data type with indexed access. In a - nutshell, tsvector is a set of unique words along with their - positional information in the document, organized in a special - structure optimized for fast access and lookup. Actually, each word - entry, besides its position in the document, could have a weight - attribute, describing importance of this word (at a specific) position - in document. A set of bit-signatures of a fixed length, representing - tsvectors, are stored in a search tree (developed using PostgreSQL - GiST), which provides online update of full text index and fast query - lookup. The module provides indexed access methods, queries, - operations and supporting routines for the tsvector data type and easy - conversion of text data to tsvector. Table driven configuration allows - creation of custom configuration optimized for specific searches using + [1]Online version of this document is available + + Tsearch2 - is the full text engine, fully integrated into PostgreSQL + RDBMS. + +Main features + + * Full online update + * Supports multiple table driven configurations + * flexible and rich linguistic support (dictionaries, stop words), + thesaurus + * full multibyte (UTF-8) support + * Sophisticated ranking functions with support of proximity and + structure information (rank, rank_cd) + * Index support (GiST and Gin) with concurrency and recovery support + * Rich query language with query rewriting support + * Headline support (text fragments with highlighted search terms) + * Ability to plug-in custom dictionaries and parsers + * Template generator for tsearch2 dictionaries with [2]snowball + stemmer support + * It is mature (5 years of development) + + Tsearch2, in a nutshell, provides FTS operator (contains) for the new + data types, representing document (tsvector) and query (tsquery). + Table driven configuration allows creation of custom searches using standard SQL commands. - - Configuration allows you to: - * specify the type of lexemes to be indexed and the way they are - processed. - * specify dictionaries to be used along with stop words recognition. - * specify the parser used to process a document. - - See [11]Documentation Roadmap for links to documentation. + + tsvector is a searchable data type, representing document. It is a set + of unique words along with their positional information in the + document, organized in a special structure optimized for fast access + and lookup. Each entry could be labelled to reflect its importance in + document. + + tsquery is a data type for textual queries with support of boolean + operators. It consists of lexemes (optionally labelled) with boolean + operators between. + + Table driven configuration allows to specify: + * parser, which used to break document onto lexemes + * what lexemes to index and the way they are processed + * dictionaries to be used along with stop words recognition. OpenFTS vs Tsearch2 - OpenFTS is a middleware between application and database, so it uses - tsearch2 as a storage, while database engine is used as a query executor - (searching). Everything else (parsing of documents, query processing, - linguistics) carry outs on client side. That's why OpenFTS has its own - configuration table (fts_conf) and works with its own set of dictionaries. - OpenFTS is more flexible, because it could be used in multi-server - architecture with separated machines for repository of documents - (documents could be stored in file system), database and query engine. + [3]OpenFTS is a middleware between application and database. OpenFTS + uses tsearch2 as a storage and database engine as a query executor + (searching). Everything else, i.e. parsing of documents, query + processing, linguistics, carry outs on client side. That's why OpenFTS + has its own configuration table (fts_conf) and works with its own set + of dictionaries. OpenFTS is more flexible, because it could be used in + multi-server architecture with separate machines for repository of + documents (documents could be stored in filesystem), database and + query engine. + + See [4]Documentation Roadmap for links to documentation. Authors * Oleg Bartunov <oleg@sai.msu.su>, Moscow, Moscow University, Russia - * Teodor Sigaev <teodor@sigaev.ru>, Moscow, Delta-Soft Ltd.,Russia - + * Teodor Sigaev <teodor@sigaev.ru>, Moscow,Moscow University,Russia + Contributors - * Robert John Shepherd and Andrew J. Kopciuch submitted - "Introduction to tsearch" (Robert - tsearch v1, Andrew - tsearch + * Robert John Shepherd and Andrew J. Kopciuch submitted + "Introduction to tsearch" (Robert - tsearch v1, Andrew - tsearch v2) - * Brandon Craig Rhodes wrote "Tsearch2 Guide" and "Tsearch2 + * Brandon Craig Rhodes wrote "Tsearch2 Guide" and "Tsearch2 Reference" and proposed new naming convention for tsearch V2 - -Features Added with Tsearch2 - * Relevance ranking of search results - * Table driven configuration - * Morphology support (ispell dictionaries, snowball stemmers) - * Headline support (text fragments with highlighted search terms) - * Ability to plug-in custom dictionaries and parsers - * Synonym dictionary - * Generator of templates for dictionaries (built-in snowball stemmer - support) - * Statistics of indexed words is available - +Sponsors + + * ABC Startsiden - compound words support + * University of Mannheim for UTF-8 support (in 8.2) + * jfg:networks ([5]http:www.jfg-networks.com/) for Gin - Generalized + Inverted index (in 8.2) + * Georgia Public Library Service and LibLime, Inc. for Thesaurus + dictionary + * PostGIS community - GiST Concurrency and Recovery + + The authors are grateful to the Russian Foundation for Basic Research + and Delta-Soft Ltd., Moscow, Russia for support. + Limitations - * Lexeme should be not longer than 2048 bytes - * The number of lexemes is limited by 2^32. Note, that actual - capacity of tsvector is depends on whether positional information - is stored or not. - * tsvector - the size is limited by approximately 2^20 bytes. - * tsquery - the number of entries (lexemes and operations) < 32768 - * Positional information - + maximal position of lexeme < 2^14 (16384) - + lexeme could have maximum 256 positions - + * Length of lexeme < 2K + * Length of tsvector (lexemes + positions) < 1Mb + * The number of lexemes < 4^32 + * 0< Positional information < 16383 + * No more than 256 positions per lexeme + * The number of nodes ( lexemes + operations) in tsquery < 32768 + References * GiST development site - - [12]http://www.sai.msu.su/~megera/postgres/gist - * OpenFTS home page - [13]http://openfts.sourceforge.net/ + [6]http://www.sai.msu.su/~megera/postgres/gist + * GiN development - [7]http://www.sigaev.ru/gin/ + * OpenFTS home page - [8]http://openfts.sourceforge.net/ * Mailing list - - [14]http://sourceforge.net/mailarchive/forum.php?forum=openfts-gen - eral - - [15]Documentation Roadmap - + [9]http://sourceforge.net/mailarchive/forum.php?forum=openfts-gene + ral + Documentation Roadmap * Several docs are available from docs/ subdirectory @@ -97,113 +108,103 @@ Documentation Roadmap + "Tsearch2 Guide" by Brandon Rhodes + "Tsearch2 Reference" by Brandon Rhodes * Readme.gendict in gendict/ subdirectory - + [16][Gendict tutorial] - - Online version of documentation is always available from Tsearch V2 - home page - - [17]http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ - + + Also, check [10]Gendict tutorial + * Check [11]tsearch2 Wiki pages for various documentation + Support - Authors urgently recommend people to use [18][openfts-general] or - [19][pgsql-general] mailing lists for questions and discussions. - -Caution + Authors urgently recommend people to use [12]openfts-general or + [13]pgsql-general mailing lists for questions and discussions. - In spite of apparent easy full text searching with our tsearch module - (authors hope it's so), any serious search engine require profound - study of various aspects, such as stop words, dictionaries, special - parsers. Tsearch module was designed to facilitate both those cases. - Development History + Latest news + + To the PostgreSQL 8.2 release we added: + * multibyte (UTF-8) support + * Thesaurus dictionary + * Query rewriting + * rank_cd relevation function now support different weights of + lexemes + * GiN support adds scalability of tsearch2 + Pre-tsearch era - Development of OpenFTS began in 2000 after realizing that we - needed a search engine optimized for online updates and able to - access metadata from the database. This is essential for online + Development of OpenFTS began in 2000 after realizing that we + need a search engine optimized for online updates with access + to metadata from the database. This is essential for online news agencies, web portals, digital libraries, etc. Most search - engines available utilize an inverted index which is very fast - for searching but very slow for online updates. Incremental - updates of an inverted index is a complex engineering task - while we needed something light, free and with the ability to - access metadata from the database. The last requirement is very - important because in a real life application a search engine - should always consult metadata ( topic, permissions, date - range, version, etc.). We extensively use PostgreSQL as a - database backend and have no intention to move from it, so the - problem was to find a data structure and a fast way to access - it. PostgreSQL has rather unique data type for storing sets - (think about words) - arrays, but lacks index access to them. A - document is parsed into lexemes, which are identified in - various ways (e.g. stemming, morphology, dictionary), and as a - result is reduced to an array of integer numbers. During our - research we found a paper of Joseph Hellerstein which - introduced an interesting data structure suitable for sets - - RD-tree (Russian Doll tree). It looked very attractive, but - implementing it in PostgreSQL seemed difficult because of our - ignorance of database internals. Further research lead us to - the idea to use GiST for implementing RD-tree, but at that time - the GiST code had for a long while remained untouched and - contained several bugs. After work on improving GiST for - version 7.0.3 of PostgreSQL was done, we were able to implement - RD-Tree and use it for index access to arrays of integers. This - implementation was ideally suited for small arrays and - eliminated complex joins, but was practically useless for - indexing large arrays. The next improvement came from an idea - to represent a document by a single bit-signature, a so-called - superimposed signature (see "Index Structures for Databases - Containing Data Items with Set-valued Attributes", 1997, Sven - Helmer for details). We developeded the contrib/intarray module - and used it for full text indexing. - + engines available utilize an inverted index which is very fast + for searching but very slow for online updates. Incremental + updates of an inverted index is a complex engineering task + while we needed something light, free and with the ability to + access metadata from the database. The last requirement was + very important because in a real life application search engine + should always consult metadata ( topic, permissions, date + range, version, etc.). We extensively use PostgreSQL as a + database backend and have no intention to move from it, so the + problem was to find a data structure and a fast way to access + it. PostgreSQL has rather unique data type for storing sets + (think about words) - arrays, but lacks index access to them. + During our research we found a paper of Joseph Hellerstein, who + introduced an interesting data structure suitable for sets - + RD-tree (Russian Doll tree). Further research lead us to the + idea to use GiST for implementing RD-tree, but at that time the + GiST code was intouched for a long time and contained several + bugs. After work on improving GiST for version 7.0.3 of + PostgreSQL was done, we were able to implement RD-Tree and use + it for index access to arrays of integers. This implementation + was ideally suited for small arrays and eliminated complex + joins, but was practically useless for indexing large arrays. + The next improvement came from an idea to represent a document + by a single bit-signature, a so-called superimposed signature + (see "Index Structures for Databases Containing Data Items with + Set-valued Attributes", 1997, Sven Helmer for details). We + developeded the contrib/intarray module and used it for full + text indexing. + tsearch v1 It was inconvenient to use integer id's instead of words, so we - introduced a new data type called 'txtidx' - a searchable data - type (textual) with indexed access. This was a first step of - our work on an implementation of a built-in PostgreSQL full + introduced a new data type called 'txtidx' - a searchable data + type (textual) with indexed access. This was a first step of + our work on an implementation of a built-in PostgreSQL full text search engine. Even though tsearch v1 had many features of - a search engine it lacked configuration support and relevance - ranking. People were encouraged to use OpenFTS, which provided - relevance ranking based on coordinate information and flexible - configuration. OpenFTS v.0.34 is the last version based on + a search engine it lacked configuration support and relevance + ranking. People were encouraged to use OpenFTS, which provided + relevance ranking based on positional information and flexible + configuration. OpenFTS v.0.34 is the last version based on tsearch v1. - + tsearch V2 - People recognized tsearch as a powerful tool for full text - searching and insisted on adding ranking support, better - configurability, etc. We already thought about moving most of - the features of OpenFTS to tsearch, and in the early 2003 we - decided to work on a new version of tsearch - tsearch v2. We've - abandoned auxiliary index tables which were used by OpenFTS to - store coordinate information and modified the txtidx type to - store them internally. Also, we've added table-driven - configuration, support of ispell dictionaries, snowball - stemmers and the ability to specify which types of lexemes to - index. Also, it's now possible to generate headlines of - documents with highlighted search terms. These changes make - tsearch more user friendly and turn it into a really powerful - full text search engine. After announcing the alpha version, we - received a proposal from Brandon Rhodes to rename tsearch - functions to be more consistent. So, we have renamed txtidx - type to tsvector and other things as well. - - To allow users of tsearch v1 smooth upgrade, we named the module as - tsearch2. - - Future release of OpenFTS (v.0.35) will be based on tsearch2. Brave - people could download it from OpenFTS CVS (see link from [20][OpenFTS - page] + People recognized tsearch as a powerful tool for full text + searching and insisted on adding ranking support, better + configurability, etc. We already thought about moving most of + the features of OpenFTS to tsearch, and in the early 2003 we + decided to work on a new version of tsearch. We abandoned + auxiliary index tables which were used by OpenFTS to store + positional information and modified the txtidx type to store + them internally. We added table-driven configuration, support + of ispell dictionaries, snowball stemmers and the ability to + specify which types of lexemes to index. Now, it's possible to + generate headlines of documents with highlighted search terms. + These changes make tsearch more user friendly and turn it into + a really powerful full text search engine. Brandon Rhodes + proposed to rename tsearch functions for consistency and we + renamed txtidx type to tsvector and other things as well. To + allow users of tsearch v1 smooth upgrade, we named the module + as tsearch2. Since version 0.35 OpenFTS uses tsearch2. References - 10. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme.html - 11. http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Readme#Documentation_Roadmap - 12. http://www.sai.msu.su/~megera/postgres/gist - 13. http://openfts.sourceforge.net/ - 14. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general - 15. http://www.sai.msu.su/~megera/oddmuse/index.cgi?action=anchor&id=Documentation_Roadmap#Documentation_Roadmap - 16. http://www.sai.msu.su/~megera/oddmuse/index.cgi?Gendict - 17. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ - 18. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general - 19. http://archives.postgresql.org/pgsql-general/ - 20. http://openfts.sourceforge.net/ + 1. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme.html + 2. http://snowball.tartarus.org/ + 3. http://openfts.sourceforge.net/ + 4. file://localhost/u/megera/WWW/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme82.html#dm + 5. http:www.jfg-networks.com/ + 6. http://www.sai.msu.su/~megera/postgres/gist + 7. http://www.sigaev.ru/gin/ + 8. http://openfts.sourceforge.net/ + 9. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general + 10. http://www.sai.msu.su/~megera/wiki/Gendict + 11. http://www.sai.msu.su/~megera/wiki/Tsearch2 + 12. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general + 13. http://archives.postgresql.org/pgsql-general/