Strategy for creating Oracle Text Index

Table of Contents

Purpose

Purpose of this note is to explain how to manage and control Oracle Text index creation process, storage preference and space usage.

Scope and application

Note is intended for Oracle DBAs that want to control how and where Text index is created, and to find out what is size of Text index.

This document to NOT deal with every aspect of Text index, it concentrates on Text Index creation process, Storage preference and space usage.
For more information about Datastore Types, Filter Types, Lexer Types, Wordlist Type,Section Group Types and Stoplists, you should check:
Oracle Text Application Developer's Guide , and
Oracle Text Reference

In this note we use TXTSUP_UTIL - PL/SQL package to quickly and easily diagnose and find information about Oralce Text index.
You can find source code of this package and instructions how to install it in folowing note:
Note:150546.1 TXTSUP_UTIL - PL/SQL package for diagnosing Oracle Text Index

Text Index creation process

It is not uncommon to Text index a document set with ten's of million's of documents at an average size of ten's of Mb (or much greater for DOC, PPT or PDF documents). The size of the Text index will be a similar order of magnitude to the total volume of filtered indexed data. Index creation time can vary from couple of hours to couple of days. We do not have formula that would predict how long index creation will take, because there are so many unquantifiable variables (nature of documents, I/O performance ..)

If you have DOC, PPT or PDF and you want to get feeling what will be size of filtered document you can use ctxhx utility which is located in %ORACLE_HOME/ctx/bin directory. Usage is:  ctxhx InputFile OutputFile [options]  By default output format is HTML, but is also can be plain text, if you type ctxhx you will get more information about ctxhx options.

When creating Oracle Text index on large document set it is recommended to start by creating a Text index on just 10 or 100 documents, to check do you receive expected results (index creation time, space usage, tokens, stopwords, sections, themes ...). Follow the outlined procedure, and once you are sure that you are getting expected results increase the number of documents indexed to 1000, 10,000, 100,000...

Recommended Text Index creation procedure

1. Prepare table on which you whant to create Text Index

If you are creatin Text index on samll table, which lett say have less than 100 rows, you can start experimineting on this table directly, since we can expect that index creation will take less than 15 minutes. But if you are creating Oracle Text index on large document set it is recommended to start by creating test table which have same structure as original table and just 10 or 100 documents. There is number of ways you can create a test table which is subset of original table. Let say that you have QUICK table created as this:

create table quick  (
    quick_id                number
    constraint quick_pk     primary key,
    text                    varchar2(80)  );

You can create table using 'create table as select' and restrict number of rows selected:

create table quick_test as
select from quick
where rownum <100;

You can also use sample_clause which lets you instruct Oracle to select from a random sample of rows from the table, rather than from the entire table. Sample clause ave form of: SAMPLE [BLOCK] ( sample_percent ) Where BLOCK instructs Oracle to perform random block sampling instead of random row sampling, sample_percent is a number specifying the percentage of the total row or block count to be included in the sample. The value must be in the range .000001 to (but not including) 100.

Lets create quick_test table which have 10% of rows from quick table:

create table quick_test as
select from quick sample (10);

Or lets create quick_test table which have 10% of blocks from quick table:

create table quick_test as
select from quick sample block (10);

Now you continue using test table in index creation proces, which will help you check do you receive expected results (index creation time, space usage, tokens, stopwords, sections, themes ...).

2. Set appropriate preference for Text Index

When you create an index and specify no parameter string, an index is created with default parameters. However it is always recommended to override the defaults and customize your index to suit your query application. Detail coverage of all aspects of text index is out of cope of this document, we will outline all preference types you can use to customize your index and concentrate on Storage preference and space usage. For more information on all aspects of Text index, have look on Oracle Text Application Developer's Guide, and  Oracle Text Reference, and referenced notes.

Oracle Text index is 'DOMAIN' index, oracle provided INDEXTYPE build using Extensible Indexing framework. Text index is not singe object in database, it is implemented using number of underlying 'normal' tables and indexes. Oracle DBA can use storage preference to specify tablespace and creation parameters for tables associated with a Text index.

Oracle provides a single storage type called BASIC_STORAGE which can be used to specify the tablespace and creation parameters for internal database tables and indexes that constitute a Text index. The clause you specify is added to the internal CREATE TABLE (CREATE INDEX for the i_index _clause) statement at index creation. You can specify most allowable clauses, such as storage, LOB storage, or partitioning. However, you cannot specify an index organized table clause.

BASIC_STORAGE has the following attributes:
 
Attribute Attribute Value

i_table_clause 

Parameter clause for dr$indexname$I table creation. Specify storage and tablespace clauses to add to the end of the internal CREATE TABLE statement.

The I table is the index data table. 

k_table_clause 

Parameter clause for dr$indexname$K table creation. Specify storage and tablespace clauses to add to the end of the internal CREATE TABLE statement.

The K table is the keymap table. 

r_table_clause 

Parameter clause for dr$indexname$R table creation. Specify storage and tablespace clauses to add to the end of the internal CREATE TABLE statement. 

The R table is the rowid table. 

The default clause is: 'LOB(DATA) STORE AS (CACHE)'

n_table_clause 

Parameter clause for dr$indexname$N table creation. Specify storage and tablespace clauses to add to the end of the internal CREATE TABLE statement.

The N table is the negative list table. 

i_index_clause 

Parameter clause for dr$indexname$X index creation. Specify storage and tablespace clauses to add to the end of the internal CREATE INDEX statement. The default clause is: 'COMPRESS e'

p_table_clause 

Parameter clause for the substring index if you have enabled SUBSTRING_INDEX in the BASIC_WORDLIST. 

Specify storage and tablespace clauses to add to the end of the internal CREATE INDEX statement. The P table is an index-organized table so the storage clause you specify must be appropriate to this type of table. 

Storage Default Behavior

By default, BASIC_STORAGE attributes are not set. In such cases, the Text index tables are created in the index owner's default tablespace. Consider the following statement, issued by user SCOTT, with no BASIC_STORAGE attributes set:

create index SCOTT.idx
on scott.tab(b)
indextype is ctxsys.context;

In this example, the text index is created in SCOTT's default tablespace, with following storage parameters.

  i_index_clause      compress 2
  r_table_clause      lob (data) store as (cache)

To create storage preference you should use Create_Preference procedure in  CTX_DDL package. In following example we will create preference with name 'my_text_storage'. Please note that it is very important beside specifying your own preferences for tablespaces and other storage parameters, not to skip clauses 'compress 2' for i_index_clause and 'lob (data) store as (cache)' for r_table_clause. If you skip this, that will degrade performance of text index.

-------------- Copy from here and save as ctx_storage_test.sql -----------------------

-- Create a user to work with Oracle Text
create user textuser identified by textuser
default tablespace users
temporary tablespace temp;

-- You must grant 'ctxapp' role to textuser
grant connect, resource, ctxapp to textuser;

connect textuser/textuser;

drop table quick;

create table quick  (
    quick_id        number
    constraint quick_pk     primary key,
    text                    varchar2(80)  );

insert into quick values ( 1, 'The cat sat on the mat' );
insert into quick values ( 2, 'The quick brown fox jumps over the lazy dog' );
insert into quick values ( 3, 'The dog barked like a dog' );
commit;

/* Drop my_text_storage preference, if it already exsist */
begin
  Ctx_Ddl.Drop_Preference ( 'my_text_storage' );
end;
/

/* Drop quick_text index, if it already exsist */
drop index quick_text;

/* Create my_text_storage storage preference, and add attributes to it */

begin
  Ctx_Ddl.Create_Preference('my_text_storage', 'BASIC_STORAGE');
  /*  I_TABLE is the Token Table, main table of Text index   */
  ctx_ddl.set_attribute('my_text_storage','I_TABLE_CLAUSE',
  'tablespace users storage (initial 10M next 10M)');

  /*  K_TABLE is the Mapping Table  */
  ctx_ddl.set_attribute('my_text_storage', 'K_TABLE_CLAUSE',
  'tablespace users storage (initial 10M next 10M)');

  /* R_TABLE is the Denormalized Mapping Table (the "rowid row")
  Note: It is very important to add 'lob (data) store as (cache)' clausule */
  ctx_ddl.set_attribute('my_text_storage', 'R_TABLE_CLAUSE',
  'tablespace users storage (initial 1M) lob (data) store as (cache)');

  /* N_TABLE is the Negative List Table, it has rows only when deletes (or updates)
      have been done since the Text index was last sync'd */
  ctx_ddl.set_attribute('my_text_storage', 'N_TABLE_CLAUSE',
  'tablespace users storage (initial 1M)');

  /* I_INDEX is I_TABLE  Token Table
      Note: It is very important not to skip 'compress 2' clausule */
  ctx_ddl.set_attribute('my_text_storage', 'I_INDEX_CLAUSE',
  'tablespace users storage (initial 1M) compress 2');

  /* P_TABLE is the Pattern Table, it exists when SUBSTRING_INDEX is TRUE
      and accelerates %ation queries  */
  ctx_ddl.set_attribute('my_text_storage', 'P_TABLE_CLAUSE',
  'tablespace users storage (initial 1M)');
end;
/

-- Create quick_text Text index using my_text_storage storage preference
create index quick_text on quick ( text )
  indextype is ctxsys.context
  parameters ('storage my_text_storage');

-- Test quick_text index
col text format a45
col s format 999
select text, score(42) s from quick
  where contains ( text, 'dog', 42 ) >= 0
  order by s desc;
---------------- End of ctx_storage_test.sql -------------------------

It is recommended to always set appropriate storage parameters for Text index prior to creating it using above outlined procedure. Bare in mind that size of the Text index will be a similar order of magnitude to the total volume of filtered Text data. It is recommended to create separate tablespace for storing Text indexes. Since I_INDEX ($X) is index on I_TABLE, it is recommended  to create in on different tablespace/disk from I_TABLE. The SORT_AREA_SIZE and the size of the temporary segment have the affect on the creation of the I_INDEX ($X) index.

Size of each individual object (I_TABLE, R_TABLE ...) depends on nature of you text documents, text index preferences (Themes, Section Group,  Stoplists ...), nature of application (query only,  updates intensive, deletes ...). Since there are many parameters that can not be predicted up front prior to index creation, it is recommended to start by creating a Text index on just 10 or 100 documents, monitor text index space usage, using CTX_UTIL.INDEX_SIZE, tune you storage preference, and once you are sure that you are getting expected results increase the number of documents indexed. For more information on index size check index size section.
 

3. Turn on index logging and check the output

A log file can be created so that you can see the progress of Text index creation. The CTX_OUTPUT.START_LOG('<filename>') procedure begins logging to the file specified in '<filename>' and CTX_OUTPUT.END_LOG stops logging. By default, CTX_OUTPUT.START_LOG creates the specified log file in the $ORACLE_HOME/ctx/log directory. Alternatively, the default location for the Text index log files can be changed with the CTX_ADM procedure when connected as CTXSYS, as follows:

connect CTXSYS/CTXSYS
execute CTXSYS.CTX_ADM.SET_PARAMETER ('LOG_DIRECTORY','/tmp');

The user creating the index would start logging with the following command:

execute  CTXSYS.CTX_OUTPUT.START_LOG('textidx.log');

When something goes wrong during indexing -- a core dump or a hang, for instance, you can usually work around it by removing or ignoring the faulty documents. In previous versions it was very difficult to find out which documents were faulty. Starting from 9i v9.0.1 CTX_OUTPUT has a new rowid printing event. When set, the rowid of each row will be printed to the logfile.

exec ctx_output.add_event(CTX_OUTPUT.EVENT_INDEX_PRINT_ROWID);

In that case log would show:
16:29:00 05/14/01 populate index: CTXSYS.LOGTESTX
16:29:00 05/14/01 Begin document indexing
16:29:00 05/14/01 INDEXING ROWID AAABklAACAAACynAAA
16:29:00 05/14/01 INDEXING ROWID AAABklAACAAACynAAB

execute  CTXSYS.CTX_OUTPUT.STOP_LOG;
 

4. Check CTX_USER_INDEX_ERRORS for errors

If there are error generated during Text index creation process, you can find the errors in the view CTX_USER_INDEX_ERRORS for user indexes, and from CTX_INDEX_ERRORS for all indexes. Please note that in Text release 8.1.6 or earlier, you might not be able to see rows in the CTX_%INDEX_ERRORS during index creation process, because the inserted rows are not commited immediately when they have been inserted.

  set linesize 120
  col err_text format a30 word_wrap
  SELECT ERR_INDEX_NAME,
           to_date(ERR_TIMESTAMP,'YYYY-MM-DD:HH24:MI:SS'),
           ERR_TEXTKEY,
           ERR_TEXT
  FROM ctx_user_index_errors
  ORDER BY ERR_INDEX_NAME, ERR_TIMESTAMP ;

In most cases ERR_TEXTKEY column will have the corresponding ROWID which you can use to find the rows on your text table and investigate problem. If the row cannot be found, investigate ERR_TEXT in CTX_USER_INDEX_ERRORS which should have  more information about problems with the index creation.  To find the row that is source of the problem you can use following select:

  SELECT <primary_key_column>
  FROM   <your text table>
  WHERE  ROWID IN (SELECT err_textkey
                   FROM   ctx_user_index_errors
                   WHERE  err_index_name = '<your text index>');

If you ctx_user_index_errors if full of old errors you can delete them using:

  DELETE FROM ctx_user_index_errors;
  commit;

5. Review all the parameters of text index

As already discussed index cretin time, size of text index, performance of text queries, depends on number of factors and preferences set. You can use CTX_UTIL.INDEX_PARAMETERS procedure to investigate parameters of the index. Procedure accepts index name, and produces report which lists all parameters of index. Lets see one example:

set serveroutput on Buffer 100000;
execute TXTSUP_UTIL.INDEX_PARAMETERS ('QUICK_TEXT');

***************************************************
* Atributes of Text Index:  QUICK_TEXT            *
***************************************************
==================================
direct_datastore
--- [no attributes]
==================================
null_filter
--- [no attributes]
==================================
basic_lexer
--- index_themes        YES
==================================
null_section_group
--- [no attributes]
==================================
basic_stoplist
--- stop_word           Corp
--- stop_word           Mr
--- stop_word           Mrs
--- stop_word           Ms
--- stop_word           Mz
--- stop_word           a
--- ....
--- (Some stop word are removed, to reduce size of this note)
--- ....
--- stop_word           who
--- stop_word           will
--- stop_word           with
--- stop_word           would
==================================
basic_storage
--- i_index_clause      tablespace users storage (initial 1M) compress 2
--- i_table_clause      tablespace users storage (initial 10M next 10M)
--- k_table_clause      tablespace users storage (initial 10M next 10M)
--- n_table_clause      tablespace users storage (initial 1M)
--- p_table_clause      tablespace users storage (initial 1M)
--- r_table_clause      tablespace users storage (initial 1M) lob (data) store as (cache)
==================================
basic_wordlist
--- fuzzy_match         GENERIC
--- stemmer             ENGLISH
***************************************************
* Atributes of Text Index:  QUICK_TEXT            *
***************************************************

Review all parameters of Text index, and remove all the functionality from index that your users will not use, and which will slow down performance of index creation and querying. Tune all other aspects of text index functionality that you user will use. We will try to outline most important aspects. Please take note that changes made to preference, do NOT impact you text index directly, because preferences are used by deep copy at index create time, and that a Text index has no dependency on the preferences mentioned in its CREATE INDEX
statement. In order to new preferences take effect, you must drop and recreate index using new preferences. Stoplist is only exception, it is possible to add stopword to already existing index but it is not possible to remove stopword. Because of this we recommend to start by creating a Text index on just 10 or 100 documents, review index parameter and then increase the number of documents indexed to 1000, 10,000...

6. Check that the token table is sensibly populated

set serveroutput on Buffer 100000;
select TOKEN_TEXT, TOKEN_TYPE
from DR$
QUICK_TEXT$I


7. Measure the size parameters of the Text index objects

Measure the size parameters (numbers of rows and total volumes) of the Text index objects: the $I, $K, $R tables and the $X B*-tree index. (We recommend optimizing before making these measurements

set serveroutput on Buffer 100000;
execute TXTSUP_UTIL.INDEX_SIZE('QUICK_TEXT');

============================================================
Space usage of Text Index:  TEXT.QUICK_TEXT
============================================================
TABLE_NAME             =    DR$QUICK_TEXT$I
TABLESPACE NAME        =    USERS
ALLOCATED BLOCKS       =                  16
ALLOCATED BYTES        =             131,072 (       0.13Mb)
USED BLOCKS            =                   2
USED BYTES             =              16,384 (       0.02Mb)
------------------------------------------------------------
TABLE_NAME             =    DR$QUICK_TEXT$R
TABLESPACE NAME        =    USERS
ALLOCATED BLOCKS       =                  16
ALLOCATED BYTES        =             131,072 (       0.13Mb)
USED BLOCKS            =                   2
USED BYTES             =              16,384 (       0.02Mb)
------------------------------------------------------------
INDEX_NAME             =    SYS_IOT_TOP_36893
INDEX_TYPE             =    IOT - TOP
TABLE_NAME             =    DR$QUICK_TEXT$K
TABLESPACE NAME        =    USERS
ALLOCATED BLOCKS       =                  16
ALLOCATED BYTES        =             131,072 (       0.13Mb)
USED BLOCKS            =                   2
USED BYTES             =              16,384 (       0.02Mb)
------------------------------------------------------------
INDEX_NAME             =    SYS_IOT_TOP_36898
INDEX_TYPE             =    IOT - TOP
TABLE_NAME             =    DR$QUICK_TEXT$N
TABLESPACE NAME        =    USERS
ALLOCATED BLOCKS       =                  16
ALLOCATED BYTES        =             131,072 (       0.13Mb)
USED BLOCKS            =                   2
USED BYTES             =              16,384 (       0.02Mb)
------------------------------------------------------------
INDEX_NAME             =    SYS_IL0000036890C00006$$
INDEX_TYPE             =    LOB
TABLE_NAME             =    DR$QUICK_TEXT$I
TABLESPACE NAME        =    USERS
ALLOCATED BLOCKS       =                  16
ALLOCATED BYTES        =             131,072 (       0.13Mb)
USED BLOCKS            =                   2
USED BYTES             =              16,384 (       0.02Mb)
------------------------------------------------------------
INDEX_NAME             =    SYS_IL0000036895C00002$$
INDEX_TYPE             =    LOB
TABLE_NAME             =    DR$QUICK_TEXT$R
TABLESPACE NAME        =    USERS
ALLOCATED BLOCKS       =                  16
ALLOCATED BYTES        =             131,072 (       0.13Mb)
USED BLOCKS            =                   2
USED BYTES             =              16,384 (       0.02Mb)
------------------------------------------------------------
INDEX_NAME             =    DR$QUICK_TEXT$X
INDEX_TYPE             =    NORMAL
TABLE_NAME             =    DR$QUICK_TEXT$I
TABLESPACE NAME        =    USERS
ALLOCATED BLOCKS       =                  16
ALLOCATED BYTES        =             131,072 (       0.13Mb)
USED BLOCKS            =                   2
USED BYTES             =              16,384 (       0.02Mb)
------------------------------------------------------------
============================================================
TOTAL ALLOCATED BLOCKS =                 112
TOTAL ALLOCATED BYTES  =             917,504 (       0.88Mb)
TOTAL USED BLOCKS      =                  14
TOTAL USED BYTES       =             114,688 (       0.11Mb)
============================================================

8. Check that you can execute Text queries with expected results
Test all aspects of text index functionality, sectioning, fuzzy, themes ...

-- Simple text query
SELECT text FROM quick
WHERE CONTAINS ( text,'sat on the mat' ) > 0;

9. Increase number of documents and repeat indexing procedure

Now when you finished first round of outlined procedure with small number of documents let say 1000, and if you are shure that you are getting expected results increase the number of documents indexed to 10,000, 100,000... and repeat indexing procedure

truncate table quick_test;
 

10. Develop and implement apropriate strategy for maintaining an Oracle Text Index

DML operations to the base table refer to when documents are inserted, updated or deleted from the base table. This section describes how you can monitor, synchronize, and optimize
the Oracle Text CONTEXT index when DML operations occur

Managing DML Operations for a CONTEXT Index. in Oracle Text Application Developer's Guide
Note.104262.1 DML Processing in interMedia Text
Note.132689.1 Strategy for maintaining an InterMedia Text Index


Related documents

<Note:150546.1>  TXTSUP_UTIL - Oracle Text diagnostics utility
<Note:116565.1> When to use DBMS_SPACE.UNUSED_SPACE or DBMS_SPACE.FREE_BLOCKS Procedures
<Note:115586.1> How to Deallocate Unused Space from a Table, Index or Cluster.
<Note:96724.1 > Using the CTX_OUTPUT for Logging and Testing extproc

Links

Oracle MetaLink  http://metalink.oracle.com
Oracle Text home page on Oracle Technology Network:   http://otn.oracle.com/products/text/


  .