26
www.semantec.de ´Google-ized´ search in your business data Author: Krasen Paskalev Certified Oracle 8i/9i DBA Seniour Oracle Consultant Semantec GmbH Semantec GmbH Benzstr. 32 D-71083 Herrenberg, Germany www.semantec.de Search within your Oracle table data like searching the web with Google

Www.semantec.de ´Google-ized´ search in your business data Author: Krasen Paskalev Certified Oracle 8i/9i DBA Seniour Oracle Consultant Semantec GmbH Benzstr

Embed Size (px)

Citation preview

www.semantec.de

´Google-ized´ search in your business data

Author:

Krasen Paskalev

Certified Oracle 8i/9i DBA Seniour Oracle Consultant

Semantec GmbH

Semantec GmbHBenzstr. 32D-71083 Herrenberg, Germanywww.semantec.de

Search within your Oracle table datalike searching the web with Google

2

www.semantec.de

Agenda

Motivation Applications contain valuable data How difficult it is to search for it How easy it is in Google

What makes a good search engine Semantec‘s Direct Info – demo Direct Info concepts and architectural

elements

3

www.semantec.de

Applications contain valuable data

4

www.semantec.de

Classical approach -Instring search with LIKE

Too complex to use Too slow – often

results in full table scan

No advanced search expressions

No text fragments CAT finds also:

APPLICATION VACATION

Not flexible – expensive to add or remove searchable fields

5

www.semantec.de

How easy it is in Google

Results presented in pages

Link to open the document

Highlighted text fragments

Full document location

(document context)

6

www.semantec.de

How to search here?

0..n

0..n

0..n

1..n

0..n0..n

0..n

0..n

0..n

0..n

0..n

0..n

0..n

0..n

0..n

0..n

0..n

0..n

0..n

0..n

0..n

0..n

0..n

0..n

0..n

0..n

0..n

0..n

0..n

0..n

0..n

0..n

0..n

0..n

0..n0..n

0..n

0..n

0..n

0..n

0..n

0..n

0..n

0..n

0..n 0..n

0..n

0..n

0..n

0..n

0..n

0..n0..n

0..*

0..*0..*

0..*

0..n

0..n

0..n

0..n

0..n

0..n

0..n

0..n

0..n

0..n

0..n0..n

0..n

0..n

0..n

0..n

0..n

0..n

0..*

0..*

0..*

0..*

0..*

0..*

0..*

0..*

0..*

0..*

0..*

0..*

0..n

0..n0..n

0..n

0..*

0..*

0..*

0..*

0..*

0..*

0..* 0..*

0..*

0..*

CUR

CUR_CDHP_CODE

<pk>

CUT

CUR_CDMTH_IDCUT_RATECUT_UPDATE

<pk,fk2><pk,fk1>

KEY_1 <pk>

EXPT

EXPT_IDIND_CDACT_IDDLG_IND_CDCUR_CDWSC_NMATT_IDEXPT_DATEEXP_AMTEXP_DESCINT_DETAILSACT_DEF_BM_ID

<pk><fk2><fk6,fk7><fk5><fk1><fk3><fk4>

<fk6>

PK_EXPT <pk>

ORG

ORG_CDLOC_IDORG_ORG_CDORT_IDIND_CDORG_NMORG_DESCORG_OPEN_FLORG_EWM_FLORG_REP_FLORG_SD_FL

<pk><fk4><fk1><fk2><fk3>

CUG

YEA_IDCUR_CDCUR_TARGET_RATE

<pk,fk2><pk,fk1>

KEY_1 <pk>

OLR

GEOGRAPHY_NMELV_CDCUR_CDRATE

<pk,fk1><pk,fk2><fk3>

KEY_1 <pk>

UPF

UPF_CDUPF_NMUPF_SU

<pk>

USR

USR_LOGINUPF_CDIND_CDUSR_PASSWORDUSR_START_DATEUSR_END_DATE

<pk><fk1><ak,fk2>

UPF_TGF

TGF_CODEUPF_CDUPF_TGF_RUPF_TGF_W

<pk,fk2><pk,fk1>

TGF

TGF_CODETGF_DESCRTGF_ORDER

<pk>

GEOGRAPHY

GEOGRAPHY_NMREGION_NAMEGEOGRAPHY_DESCGEOGRAPHY_OPEN_FL

<pk><fk>

KEY_1 <pk>

OMW

GEOGRAPHY_NMMTH_IDOMW_HOURS

<pk,fk1><pk,fk2>

KEY_1 <pk>

IND

IND_CDGEOGRAPHY_NMORG_CDORG_ORG_CDELV_CDINDIV_LASTNAMEINDIV_FIRSTNAMEINDIV_FTEEMAILEFFICIENCYREPORTINGIND_FIELD_FL

<pk><fk1><fk2><fk3><fk4>

KEY_1 <pk>

YEA

YEA_IDYEA_START_DTYEA_STOP_DTYEA_CURFY_FL

<pk>

MTH

MTH_IDYEA_IDMTH_CDMTH_NMMTH_OP_TRACK_DTMTH_CL_TRACK_DTMTH_CL_ADJUST_DTMTH_CURRENT_FLMTH_CURRENT_BILLING_FLAG

<pk><fk>

ENT

ENT_CDCUR_CDENT_COUNTRYCBS_INSTSAP_INSTGEOGRAPHY_NM

<pk><fk1>

<fk2><fk3><fk4>

KEY_1 <pk>

LOC

LOC_IDENT_CDLOC_CODELOC_DESCLOC_BL_SUBENTLOC_BL_DEPTLOC_EX_OPER_CTLOC_EX_PTYPELOC_EX_PLINELOC_EX_SPLINELOC_EX_DISTRICTLOC_OPEN_FLLOC_COMPL_FLLOC_FIELD_FL

<pk><fk><ak>

KEY_1LOC_CODE_UK

<pk><ak>

CUSTOMERS

CUSTOMER_IDASM_CDLOC_IDCUSTOMER_NAMESUB_CUSTOMER_NAMEDESCRIPT IONCUSTOMER_KEY

<pk><fk1><fk2><ak><ak>

CUST_SUB_CUST_AKPK_CUST_ID

<ak><pk>

PRJ

PRJ_IDIND_CDORG_CDCUSTOMER_IDPRJ_NAMEPRJ_DESCPRJ_OP_FLPRJ_OPEN_DATEPRJ_CLOSE_DATE

<pk><fk1><fk2><fk3><ak>

PK_PRJPRJ_CUST_NAME

<pk><ak>

IND_TO_PRJ

PRJ_IDIND_CD

<pk,fk1><pk,fk2>

ORT

ORT_IDORT_NMORT_REVENUE_FLORT_T IME_FLORT_DESC

<pk>

KEY_1 <pk>

BLM

BLM_NMBLM_DESC

<pk>

PK_BLM <pk>

ACT_DEF

ACT_DEF_NMACT_DEF_DESCACT_DEF_OP_FLACT_DEF_EXP_FL

<pk>

PK_ACT_DEF <pk>

ADFSV

ADFSV_IDACT_DEF_NMPAR_CDADFSV_VALUE

<pk><ak,fk1><ak,fk2>

IND_TO_ACT

ACT_DEF_NMIND_CD

<pk,fk1><pk,fk2>

ACT

ACT_IDCUR_CDBLM_NMFRM_NAMEACT_DEF_NMACT_OP_FLCSVC_IDORG_CDX_BILLING_FL

<pk><fk3><fk5><fk6><ak,fk2>

<ak,fk1><fk4>

PK_ACTACT_CSVC_UK

<pk><ak>

PRB

PRB_IDMTH_IDACT_IDPRB_ORITGT_AMTPRB_CURFOR_AMTPRB_CURCOM_AMTPRB_CUREXIT_AMTPRB_ACTUALS_AMTPRB_CALCUL_AMTPRB_ADJUST_AMTPRB_ADJUST_DESCPRB_BILLED_AMTACT_DEF_BM_ID

<pk><fk1><fk2>

<fk2>

KEY_1 <pk>

COM

COM_IDASG_IDCOM_UPDATECOM_YAMTCOM_DESCCOM_USER

<pk><fk>

ACTSV

ACTSV_IDACT_IDPAR_CDACTSV_VALUE

<pk><ak,fk1><ak,fk2>

ATT

ATT_IDACT_IDATT_NMATT_OP_FL

<pk><ak,fk><ak>

KEY_1 <pk>

PAT

PAT_CDPAT_NAMEPAT_DESC

<pk>

PAR

PAR_CDPAT_CDPAR_DESCPAR_ADF_FLPAR_SDF_FLPAR_PRJ_FLPAR_CSVC_FLPAR_ACT_FLPAR_SPP_FLPAR_OPEN_FLPAR_DEF

<pk><fk>

KEY_1 <pk>

PDV

PAR_CDPDV_VALUE

<pk,fk><pk>

FRM

FRM_NAMEFRM_DESCFRM_TEXTFRM_TEXT_INTERNALFRM_OPEN_FL

<pk>

FRM_PAR

FRM_NAMEPAR_CD

<pk,fk1><pk,fk2>

TEC

TEC_NAME <pk>

KEY_1 <pk>

CSVC

CSVC_IDORG_CDPRJ_IDIND_CDTEC_NAMESVC_DEF_NMCSVC_DESCCSVC_OPEN_FLWBS_IDCSVC_ACC_NUM

<pk><fk5><ak,fk1><fk2><ak,fk3><ak,fk4>

SVC_DEF_TO_TEC

TEC_NAMESVC_DEF_NM

<pk,fk2><pk,fk1>

ELV

ELV_CDELV_DESC

<pk>

SVC_DEF

SVC_DEF_NMSVC_DEF_ACC_NUMSVC_DEF_OP_FL

<pk>SVC_TO_ACT

SVC_DEF_NMACT_DEF_NM

<pk,fk1><pk,fk2>

IND_TO_CSVC

SVC_DEF_NMIND_CD

<pk,fk1><pk,fk2>

SDFSV

SDFSV_IDPAR_CDSVC_DEF_NMSDFSV_VALUE

<pk><ak,fk1><ak,fk2>

WSC

WSC_NMWSC_DESCWSC_UPLIFTWSC_OP_FL

<pk>

KEY_1 <pk>

CSVCSV

CSVCSV_IDCSVC_IDPAR_CDCSVCSV_VALUE

<pk><ak,fk1><ak,fk2>

T_INT_MAP_SERVICES

GEOGRAPHY_NMMSCSCSSCSVC_DEF_NMACT_DEF_NM

<pk><pk><pk><pk><fk1><fk2>

KEY_1 <pk>

INTERFACE_SESSION

IDINTERFACE_CODESTART_T IMEFILE_NAMENUM_IMPORTEDNUM_REJECTED

<pk><fk>

PK_INT_SESSION <pk>

T_INT_MAP_TECH

TEC_NAMEPRODUCT

<fk><pk>

KEY_1 <pk>

ORG_TO_PRJ

PRJ_IDORG_CD

<pk,fk1><pk,fk2>

KEY_1 <pk>

INTERFACE_LOAD

IDEXPT_ID

<pk,fk><pk>

PK_INT_LOAD <pk>

CBS_INST

CBS_INST <pk>

KEY_1 <pk>SAP_INST

SAP_INSTABBR

<pk><ak>

SAP_CRITERIA

FIELDVALUE

<pk><pk>

DELEGATES

MGR_CDDLG_CD

<pk,fk1><pk,fk2>

REGISTRY

USR_LOGINSETT ING_NAMEVALUE

<pk,fk1><pk,fk2>

REGISTRY_KEYS

SETTING_NAMEDEFAULT_VALUE

<pk>

REGION

REGION_NAMEREGION_SHORT

<pk>

KEY_1 <pk>

T IMES_INTERFACES

INTERFACE_CODEINTERFACE_NAMEWSC_NMMAPPING_TYPEINCOMING_DIRLOG_DIR

<pk>

<fk>

PK_T IMES_INT <pk>

IND_TO_LOAD

IND_CDINTERFACE_CODEINT_IDLOAD_FL

<pk,fk1><pk,fk2>

PK_IND_TO_LOAD <pk>

PRJSV

PRJSV_IDPRJ_IDPAR_CDPRJSV_VALUE

<pk><ak,fk1><ak,fk2>

KEY_1PRJSV_AK

<pk><ak>

USR_TEMPLATES

USR_LOGINACT_IDWSC_NMACT_DEF_BM_ID

<pk,fk2><pk,fk3><pk,fk1><fk3>

KEY_1 <pk>

T_INT_MAP_BM

GEOGRAPHY_NMSSCBILL_SIT_NM

<pk><pk>

KEY_1 <pk>

ACT_DEF_BM_TO_ACT

ACT_IDACT_DEF_BM_IDFRM_NAMEWBS_IDDEF_BILL_SIT_FL

<pk,fk1><pk,fk2><fk3>

PK_BILL_SIT_TO_ACT <pk>

ACT_DEF_BM

ACT_DEF_BM_IDACT_DEF_NMFRM_NAMEGEOGRAPHY_NMBLM_NMBILL_SIT_NMDEF_SIT_FLBILL_SIT_OP_FL

<pk><ak><fk1><ak,fk2><fk3><ak>

PK_ACT_DEF_BMUK_ACT_DEF_BM

<pk><ak>

7

www.semantec.de

Motivation What makes a good search engine Semantec‘s Direct Info – demo Direct Info concepts and

architectural elements

Agenda

8

www.semantec.de

Fast search Order by relevance Options to narrow and judge the hits

Advanced search expressions More information about the object hit

Text fragments with highlighted keywords Keyword context – where is the keyword found Object context - extended object information

Search by object type Search within specific object attribute

Direct access to the object found Accessible – to wide user group

What makes a good search engine?

9

www.semantec.de

Motivation What makes a good search engine Semantec‘s Direct Info – demo Direct Info concepts and

architectural elements

Agenda

10

www.semantec.de

Direct Info

Framework developed by Semantec

Builds on Oracle Text platform Built with pure PL/SQL All code is stored in Oracle

11

www.semantec.de

Data Model

customers

idcodecustomer_typefirst_namelast_nameother_namesprofessiontitlenationalitydate_of_birthcompany_namebusiness_sectorbusiness_phoneprivate_phonemobile_phonefaxemailweb_siteremarks

NUMBERVARCHAR2(20)VARCHAR2(40)VARCHAR2(40)VARCHAR2(40)VARCHAR(80)VARCHAR(80)VARCHAR2(10)VARCHAR2(2)DATEVARCHAR2(80)VARCHAR2(40)VARCHAR2(20)VARCHAR2(20)VARCHAR2(20)VARCHAR2(20)VARCHAR2(80)VARCHAR2(80)VARCHAR2(1000)

addresses

idcustomer_idcountry_codepostal_codecitystreetpo_box

NUMBERNUMBERVARCHAR2(2)VARCHAR2(80)VARCHAR2(80)VARCHAR2(80)VARCHAR2(10)

countries

codename

VARCHAR2(2)VARCHAR2(80)

services

idnamedescription

numberVARCHAR2(100)VARCHAR2(1000)

bank_accounts

idcustomer_idcountry_codebank_namebank_codeaccount_noremarks

NUMBERNUMBERVARCHAR2(2)VARCHAR2(80)VARCHAR2(20)VARCHAR2(20)VARCHAR2(1000)

contracts

idcustomer_idbegin_dateend_dateservice_idremarks

NUMBERNUMBERDATEDATENUMBERVARCHAR2(1000)

12

www.semantec.de

Motivation What makes a good search engine Semantec‘s Direct Info – demo Direct Info concepts and

architectural elements What is Oracle Text Indexing data Search results presentation

Agenda

13

www.semantec.de

What is Oracle Text?

Formerly known as ConText (8.0) and interMedia Text (8i)

Uses standard SQL to index, search and analyze text and documents stored in the Oracle database, in files and on the Web

Allows advanced searching including keyword search, pattern matching, boolean expressions, etc.

Supports multiple languages

14

www.semantec.de

Oracle Text Index Usage

CREATE INDEX DOC_INDEX_01 ON DOC_TABLE_01(location)

INDEXTYPE IS CTXSYS.CONTEXT

PARAMETERS ('DATASTORE USER_DATASTORE_01');

SELECT doc_name FROM DOC_TABLE_01

WHERE CONTAINS(location,'mouse AND wireless', 1) > 0

ORDER BY score(1) DESC

Oracle Text index creation:

Oracle Text index search:

15

www.semantec.de

Boolean expressions,Proximity search

AND (&) – mouse AND wireless OR (|) – mouse OR wireless NOT (~) – mouse NOT wireless ACCUMulate (,) – mouse, monitor, cd NEAR – NEAR((mouse,wireless),5)

16

www.semantec.de

Expansion operators

Allow to expand the word list searched for Wildcard (%, _) – only portion of the word

_ing -> sing king ping monito% -> monitor monitoring

Soundex (!) – words that sound similarly !sing -> sing sink

Fuzzy – words that are spelled similarly fuzzy(sing,70,10,weight) -> sing king sink

Stem ($) – words having the same linguistic root

$sing -> sing sang sung

17

www.semantec.de

Thesauri examples

Theme search – ABOUT(economics) Broader term – BT(cat) -> animal Narrower term – NT(animal) -> cat dog Associative relation – RT(cat) ->

kitten

Translated term – TR(cat) -> cat gato Synonym – SYN(cat) -> cat tiger

18

www.semantec.de

DatastoreDirect and Multi-column

documentsdoc_name author text

documentsdoc_name author text

Direct Multi-column

<doc_name>

...

<author>

...

<text>

...

Allowed datatypes:• CHAR

• VARCHAR

• VARCHAR2

• BLOB

• CLOB

• BFILE

• XMLType

19

www.semantec.de

DatastoreDetail and Nested

documentsdoc_name author

doc_detailsdoc_name seq_no text

Detail

{{

documentsdoc_name author doc_nst doc_nst

seq_no text

Nested

20

www.semantec.de

Indexing data - Data Model

customers

idcodecustomer_typefirst_namelast_nameother_namesprofessiontitlenationalitydate_of_birthcompany_namebusiness_sectorbusiness_phoneprivate_phonemobile_phonefaxemailweb_siteremarks

NUMBERVARCHAR2(20)VARCHAR2(40)VARCHAR2(40)VARCHAR2(40)VARCHAR(80)VARCHAR(80)VARCHAR2(10)VARCHAR2(2)DATEVARCHAR2(80)VARCHAR2(40)VARCHAR2(20)VARCHAR2(20)VARCHAR2(20)VARCHAR2(20)VARCHAR2(80)VARCHAR2(80)VARCHAR2(1000)

addresses

idcustomer_idcountry_codepostal_codecitystreetpo_box

NUMBERNUMBERVARCHAR2(2)VARCHAR2(80)VARCHAR2(80)VARCHAR2(80)VARCHAR2(10)

countries

codename

VARCHAR2(2)VARCHAR2(80)

services

idnamedescription

numberVARCHAR2(100)VARCHAR2(1000)

bank_accounts

idcustomer_idcountry_codebank_namebank_codeaccount_noremarks

NUMBERNUMBERVARCHAR2(2)VARCHAR2(80)VARCHAR2(20)VARCHAR2(20)VARCHAR2(1000)

contracts

idcustomer_idbegin_dateend_dateservice_idremarks

NUMBERNUMBERDATEDATENUMBERVARCHAR2(1000)

21

www.semantec.de

Indexing DataOracle Text Features

User datastore – PL/SQL procedure delivers the contents to be indexed

AUTO_SECTION_GROUP – Instructs Oracle to create separate section for each XML tag and index only its value

22

www.semantec.de

Indexing dataPutting it all together

<customer> <id>50</id> <code>635</code> <customer_type>Person</customer_type> <personal_data> <title /> <first_name>Jurgen</first_name> <last_name>Claus</last_name> <other_names /> <profession>Software Engineer</profession> <nationality>Germany</nationality> <date_of_birth>28.05.1935</date_of_birth> </personal_data> <addresses> <address> <country>Germany</country> <postal_code>80995</postal_code> <city>München</city> <street>Dachauer Str. 665</street> <po_box /> </address> <address> <country>Germany</country> ...

customers

idcodecustomer_typefirst_namelast_nameother_namesprofessiontitlenationalitydate_of_birthcompany_namebusiness_sectorbusiness_phoneprivate_phonemobile_phonefaxemailweb_siteremarks

NUMBERVARCHAR2(20)VARCHAR2(40)VARCHAR2(40)VARCHAR2(40)VARCHAR(80)VARCHAR(80)VARCHAR2(10)VARCHAR2(2)DATEVARCHAR2(80)VARCHAR2(40)VARCHAR2(20)VARCHAR2(20)VARCHAR2(20)VARCHAR2(20)VARCHAR2(80)VARCHAR2(80)VARCHAR2(1000)

addresses

idcustomer_idcountry_codepostal_codecitystreetpo_box

NUMBERNUMBERVARCHAR2(2)VARCHAR2(80)VARCHAR2(80)VARCHAR2(80)VARCHAR2(10)

countries

codename

VARCHAR2(2)VARCHAR2(80)

services

idnamedescription

numberVARCHAR2(100)VARCHAR2(1000)

bank_accounts

idcustomer_idcountry_codebank_namebank_codeaccount_noremarks

NUMBERNUMBERVARCHAR2(2)VARCHAR2(80)VARCHAR2(20)VARCHAR2(20)VARCHAR2(1000)

contracts

idcustomer_idbegin_dateend_dateservice_idremarks

NUMBERNUMBERDATEDATENUMBERVARCHAR2(1000)

Data + MetadataExtraction

Data Indexing

Oracle TextIndex

23

www.semantec.de

How easy it is in Google

Results presented in pages

Link to open the document

Highlighted text fragments

Full document location

(document context)

24

www.semantec.de

Search Results PresentationResults presented in pages

Link to open the customer edit application

Location of the keyword found

Extended customer info in balloon window

Most important info: Address and contacts

Highlighted text fragments

25

www.semantec.de

Summary

Direct Info uses Oracle Text as a solid platform for creating an advanced full text search solution

Powerful text search capabilities Advanced results presentation

features Rich features to judge the results Plugable into existing applications

26

www.semantec.de

Want to know more?Semantec GmbH.Krasen Paskalev, Armin SingerBenzstr. 32D-71083 Herrenberg, Germany

+49(7032)9130-0+49(7032)9130-12+49(7032)[email protected]@semantec.dewww.semantec.de

Company:Name:

Address:

Telephone:Telephone:

Fax:E-Mail:

Internet:

Meet us here -> booth C10 on the ground floor