Home » Server Options » Text & interMedia » Problems with CTX_DOC.SNIPPET on HTML documents (XP, Oracle 10g)
Problems with CTX_DOC.SNIPPET on HTML documents [message #354803] Tue, 21 October 2008 06:22 Go to next message
dugjason
Messages: 13
Registered: June 2008
Location: UK
Junior Member
Hi,

I am running CTX_DOC.SNIPPET on the following HTML document, retrieved from my Oracle database:

<p class="Bodytext">
As <span class="myProduct">myProduct</span> is an Internet-based
application, all you need to do to start it is to go to
http://<span class="myProduct">myProduct</span> installation&gt;.
</p>
<ul>
	<li>&nbsp; Start
	your usual browser.
	</li>
</ul>


I have set the entity_translation parameter of ctx_doc.snippet => FALSE, which removes most of the HTML tags from my text, but in this case, it still leaves "<p class="Bodytext">", and the <span> tags in the final output.

I have had a play around with it, but it always seems to display some HTML in the output, although I would like it to be displayed as plain text.

I am not sure if the key to this is by editing the parameters of ctx_doc.snippet, or if I will have to edit my index to perhaps generate a plain text version of the HTML document?

Below are my index, and ctx_doc.snippet call:

Index:
create index help_text
on help_page (page_content)
indextype is ctxsys.context
parameters ( 'TRANSACTIONAL SYNC(EVERY "SYSDATE+15/1440")');


Snippet:
ctx_doc.snippet('help_text', to_char(page_results.page_id), p_string, '<B>', '</B>', false);


Any help will be greatly appreciated!
Re: Problems with CTX_DOC.SNIPPET on HTML documents [message #355095 is a reply to message #354803] Wed, 22 October 2008 11:33 Go to previous messageGo to next message
Barbara Boehmer
Messages: 9077
Registered: November 2002
Location: California, USA
Senior Member
I am using 11g and get slightly different results, but adding CTXSYS.AUTO_FILTER to the parameters during index creation seems to clean it up. Please see the reproduction and solution below.

-- test environment:
SCOTT@orcl_11g> CREATE TABLE help_page
  2    (page_id       NUMBER PRIMARY KEY,
  3  	page_content  CLOB)
  4  /

Table created.

SCOTT@orcl_11g> SET DEFINE OFF
SCOTT@orcl_11g> INSERT INTO help_page VALUES
  2  (1,
  3  '<p class="Bodytext">
  4  As <span class="myProduct">myProduct</span> is an Internet-based
  5  application, all you need to do to start it is to go to
  6  http://<span class="myProduct">myProduct</span> installation&gt;.
  7  </p>
  8  <ul>
  9  	     <li>&nbsp; Start
 10  	     your usual browser.
 11  	     </li>
 12  </ul>')
 13  /

1 row created.

SCOTT@orcl_11g> create index help_text
  2  on help_page (page_content)
  3  indextype is ctxsys.context
  4  parameters
  5    ( 'TRANSACTIONAL SYNC(EVERY "SYSDATE+15/1440")')
  6  /

Index created.

SCOTT@orcl_11g> VARIABLE g_test CLOB
SCOTT@orcl_11g> COLUMN g_test FORMAT A45 WORD_WRAPPED


-- reproduction:
SCOTT@orcl_11g> DECLARE
  2    p_string VARCHAR2(30) := 'application';
  3  BEGIN
  4    CTX_DOC.SET_KEY_TYPE ('PRIMARY_KEY');
  5    :g_test := ctx_doc.snippet
  6  		    ('help_text',
  7  		     '1',
  8  		     p_string,
  9  		     '<B>',
 10  		     '</B>',
 11  		     false);
 12  END;
 13  /

PL/SQL procedure successfully completed.

SCOTT@orcl_11g> PRINT g_test

G_TEST
---------------------------------------------
myProduct">myProduct</span> is an
Internet-based
<B>application</B>, all you need to do to
start it is to go


-- solution:
SCOTT@orcl_11g> DROP INDEX help_text
  2  /

Index dropped.

SCOTT@orcl_11g> create index help_text
  2  on help_page (page_content)
  3  indextype is ctxsys.context
  4  parameters
  5    ( 'TRANSACTIONAL SYNC(EVERY "SYSDATE+15/1440")
  6  	  FILTER	CTXSYS.AUTO_FILTER')
  7  /

Index created.

SCOTT@orcl_11g> DECLARE
  2    p_string VARCHAR2(30) := 'application';
  3  BEGIN
  4    CTX_DOC.SET_KEY_TYPE ('PRIMARY_KEY');
  5    :g_test := ctx_doc.snippet
  6  		    ('help_text',
  7  		     '1',
  8  		     p_string,
  9  		     '<B>',
 10  		     '</B>',
 11  		     false);
 12  END;
 13  /

PL/SQL procedure successfully completed.

SCOTT@orcl_11g> PRINT g_test

G_TEST
---------------------------------------------
As  myProduct is an Internet-based
<B>application</B>, all you need to do to
start it is to go


SCOTT@orcl_11g> 



Re: Problems with CTX_DOC.SNIPPET on HTML documents [message #355463 is a reply to message #355095] Fri, 24 October 2008 03:56 Go to previous message
dugjason
Messages: 13
Registered: June 2008
Location: UK
Junior Member
That's brilliant thank you. There are still some slightly odd outputs, but I think that is more down to the input string being a bit messy.
Thanks again
Previous Topic: Refresh of indexes on CLOB fields.
Next Topic: xml search + single quote handling (merged)
Goto Forum:
  


Current Time: Fri Mar 29 10:38:51 CDT 2024