13
ELSEVIER Data & Knowledge Engineering 21 (1997) 183-195 I DATA & KNOWLEDGE ENGINEERING Semantic integration of conceptual schemas Isabelle Mirbel 13S, CNRS-URA 1376, 250 Avenue Albert Einstein, Sophia-Antipolis 06560, Valbonne, France Abstract Our goal is to work out an integration process which makes it possible to give a global design schema obtained from several schemas, each of them describing the same reality viewed in different ways, in order to obtain the fullest view. Problems and conflicts arise during the schema integration. They are due to the several ways of representing the semantic knowledge and of structuring knowledge (using the same design model). When the detection and solution of structural problems are model dependent, the detection and solution of semantic problems are not model dependent. To represent the semantic of the words which are used in a schema, we have defined a model of a thesaurus drawn from the domain dealing with the meaning of words: linguistics. In this paper, we will show the interest in using this fuzzy thesaurus when design schema are being integrated. Keywords: Design schema; Integration; Semantics; Linguistics I. Introduction When designing information systems, it often happens that several designers work together on the same real world. This may be necessary because of the extent of the work to be done, it may also be motivated by the necessity of taking into account the points of view of the designers, who belong to various domains: not all these people perceive the real world the same way. Our goal is to work out a process which provides a global design schema obtained from several schemas, each depicting the same reality according to different ways of perceiving things, in order to obtain the fullest view of the part of the real world being examined. The first integration tools proposed were based on the relational model [2], or on the entity- relationship model, the extended entity-relationship model, and other semantic models [19]. Some research work on object-oriented models has been published recently [20]. Our approach is in line with this type of work. Our goal is not to define a new object-oriented model. So, we have built our integration process on the most often used notions [17,3,4] (attribute, method, class, reference link, inheritance). Problems and conflicts arise during the schema integration. They are due to the various existing ways of representing knowledge and of structuring knowledge (using one model only) [1,7,16]. The semantic which is provided by design schemas is increasingly used when design schemas are being integrated [10,14]. Our approach made the semantic aspect of design schemas privileged. 0169-023X/97/$17.00 © 1997 Elsevier Science B.V. All rights reserved PII S0 169-023X(96)00032-8

Semantic integration of conceptual schemas

Embed Size (px)

Citation preview

ELSEVIER Data & Knowledge Engineering 21 (1997) 183-195

I DATA & KNOWLEDGE ENGINEERING

Semantic integration of conceptual schemas

Isabelle Mirbel

13S, CNRS-URA 1376, 250 Avenue Albert Einstein, Sophia-Antipolis 06560, Valbonne, France

Abstract

Our goal is to work out an integration process which makes it possible to give a global design schema obtained from several schemas, each of them describing the same reality viewed in different ways, in order to obtain the fullest view. Problems and conflicts arise during the schema integration. They are due to the several ways of representing the semantic knowledge and of structuring knowledge (using the same design model). When the detection and solution of structural problems are model dependent, the detection and solution of semantic problems are not model dependent. To represent the semantic of the words which are used in a schema, we have defined a model of a thesaurus drawn from the domain dealing with the meaning of words: linguistics. In this paper, we will show the interest in using this fuzzy thesaurus when design schema are being integrated.

Keywords: Design schema; Integration; Semantics; Linguistics

I. In troduct ion

When designing information systems, it often happens that several designers work together on the same real world. This may be necessary because of the extent of the work to be done, it may also be motivated by the necessity of taking into account the points of view of the designers, who belong to various domains: not all these people perceive the real world the same way. Our goal is to work out a process which provides a global design schema obtained from several schemas, each depicting the same reality according to different ways of perceiving things, in order to obtain the fullest view of the part of the real world being examined.

The first integration tools proposed were based on the relational model [2], or on the en t i ty - relationship model, the extended ent i ty-relat ionship model, and other semantic models [19]. Some research work on object-oriented models has been published recently [20]. Our approach is in line with this type of work. Our goal is not to define a new object-oriented model. So, we have built our integration process on the most often used notions [17,3,4] (attribute, method, class, reference link, inheritance).

Problems and conflicts arise during the schema integration. They are due to the various existing ways of representing knowledge and of structuring knowledge (using one model only) [1,7,16]. The semantic which is provided by design schemas is increasingly used when design schemas are being integrated [10,14]. Our approach made the semantic aspect of design schemas privileged.

0169-023X/97/$17.00 © 1997 Elsevier Science B.V. All rights reserved PII S0 1 69-023X(96)00032-8

184 I. Mirbel / Data & Knowledge Engineering 21 (1997) 183-195

2. The comparison criteria

In order to integrate the design schemas, we must compare their elements (attributes, methods, classes and links). We compare pairs of elements which are composed of one element from the first schema and one element from the second schema. In order to find pairs of elements which are integrable, we have defined several comparison criteria: • An element has a name which specifies the semantic it provides. However, we can find different

names which depict the same concept (synonyms), and similar names which depict different concepts (homonyms).

• An element has a specific function in its schema. It is either an attribute, or a method, or a class, or a reference link.

So, we have defined three criteria to compare the elements of the design schemas to be integrated: • a semantic-likeness criterion, to find synonym elements; • a semantic-ambiguity criterion, to find homonym elements; • a structural-distance criterion, to compare the elements' functions.

2.1. The semantic criteria

When the detection and solving of structural problems are model dependent, the detection and solving of semantic problems are not model dependent. To represent the semantic of the words used in the schemas, we have defined a model, inspired from a domain especially concerned by meaning of words: linguistics [8]. For linguists, a knowledge model is a multidimensional space, in which intersecting axes represent conceptual primitives. A concept, a knowledge unit, may be represented and identified uniquely by referring to its coordinates on each axis. Listing the values of the concept with respect to each axis is equivalent to defining its position in the knowledge space [ 18]. In our data structure, we distinguish words (or terms) from concepts (or meanings), following Quillian's works; he was the first to make this distinction [9]. Each concept is represented by a sentence, the most explicit one, which can sometimes be a definition. We make the distinction between words and concepts in order to be able to work at a level (the concept level) where knowledge can be defined very precisely, without ambiguity, without synonym and homonym problems. Concepts are linked together by conceptual relationships of a semantic kind. Interpreting relationships between words and concepts enable meanings to be given to words.

The use of the thesaurus allows quantifying word likeness with a degree, and to qualify it by a kind of likeness, in order to semantically locate one word from another. Quantifying the likeness allows one possible integration to be chosen among several. Qualifying the likeness allows knowing what kind of integration must be made between the two elements.

2.1.1. The thesaurus structure We define two types of relationships in our thesaurus: conceptual relationship of a semantic type,

that links concepts together; and interpreting relationships, that links words and concepts together in order to explain their meanings. We will present each of them.

Conceptual relationship of a semantic type: we use two types of conceptual relationships which are the generic relationship and the aggregation relationship. Those relationships are the most often used. Many other relationships exist which link concepts together [ 18,6]. But keeping them inside our

1. Mirbel I Data & Knowledge Engineering 21 (1997) 183-195 185

thesaurus will bring nothing more, because we cannot exploit them. We must not forget that this thesaurus is only a tool for schema integration, which enables us to know the proximity between words in order to know how to integrate them.

In order to capture more semantic, we graduate the membership of these relationships into likelihood degrees, varying between 0 and 1. So we can represent some categories with ill-defined boundaries, situations between everything and nothing, like the quasi-generic relationship [18], or the fact that a component is optional or not [18], and to what extent.

We distinguish several types of aggregation relationships. D.A. Cruse [8] and J. Lyons have shown in their work the existence of transitive and non-transitive aggregation relationships [ 13], that is to say the preservation, or not, of the part features. (This feature will be used in the thesaurus exploitation phase).

The interpreting relationships: the interpreting relationships allow words and concepts to be linked together. As was done previously, to have a better representation of the real world, we define them as fuzzy relationships which indicate the probability of the concept being a meaning for the word.

Thanks to this type of relationship, we can know the semantic likeness of words and ambiguous words. Two semantically-close words are words linked to concepts which are linked together. An ambiguous word is a word with several interpreting relationships, that is to say several meanings.

Fig. 1 summarizes all the notions described above.

2.1.2. The thesaurus exploitation The thesaurus exploitation The use of the thesaurus enables us to know which words are

ambiguous, and when they can be used with different meanings. It also allows the detection of similar words in the schemas under study. Two lexical relationships often appear in linguistic papers: the synonym relationship and the homonym relationship [ 12]. In our structure, lexical relationships do not

Collce 3t

. 1t I 0.~ I 0.9 C Concept Concept Concept E C \ E D

Concept nn~ S ~.~ Concept

o R D S

I l ConceptL

0.5 Concept

K

M "i .71

Concept F

o.71 I 0.7 Concopt Concept I

J

word 1 word 2 word 5 word 4 word 3

/ t aggregation relationship

I generic relationship

Fig. 1. The thesaurus structure.

186 I. Mirbe l / Da ta & K n o w l e d g e Engineer ing 21 (1997) 1 8 3 - 1 9 5

appear explicitly, because they can be deduced from interpreting relationships. In addition, in order to capture more semantic, we speak more globally about semantic likeness and ambiguity. We deduce weighted knowledge from conceptual and interpreting relationships.

The ambiguous word treatment: to know a word's degree of ambiguity, we must worry about its connection to several concepts, that is to say to several meanings. In this case, it can be considered as ambiguous. But, we think it will be more interesting to qualify this ambiguity. A word is linked to concepts by an interpreting relationship which has a probability coefficient, varying between 0 and 1, which indicates to what extent the word is linked to the concept. We can use the information carried by these coefficients to detect words having a tendency to ambiguity. Therefore, we compare the probability coefficients of all the interpreting relationships connected to the word. If the coefficients have close values, the different meanings are used as often as each other. If the values are very different, some meanings are more used than the others. In this last case, the word is less ambiguous than in the first one. Therefore, the a word's degree of ambiguity is defined by the formula written by Shanon in information theory.

So, when we find the same word in several design schemas, we then know its degree of ambiguity, which helps when carrying out the schema integration. It will avoid integrating a word used with different meanings in the different schemas.

The semantic likeness of word treatment: To know if two words have semantic likeness is more difficult than determining the ambiguity of one of them, because it requires looking at the concepts linked to the two words examined. The most interesting case is when the first word has some concepts which are not directly linked to the concepts which are linked to the second word. In this case, it is first necessary to evaluate the proximity between concepts which are linked to examined words. Therefore, we will present the cases where transitivity is allowed. In the next table, we recapitulate the various cases where different relationships succeed each other: Let c o, c b et c . be the examined concepts, such as 3 f ( c a, cb) and~(c b, c,,). In rows, we find the different types of relationships that can link c a to c b. In columns, we find the different types of relationships that can link c b and c,.

generic aggregation

relationship relationship

i . , f , . f3 . . . . - I f3A{e ' , } - I f jA FUNC . . . . fJAFUNC .......

generic f s -' f s ' fa . . . . . . . . fa ..... c ....

relationship fs fs UA,I ....... f a2 ....

aggregation f a .... i fA t , ,

....... _, s; , ' , .... S A L , , . . . . . . . . . . .

relationship fA ~ ~, fA ~ ~,

fA ...... fA ...... fA . . . . . . .

When we substitute two relationships by a single one, it has to be associated to a fuzzy coefficient, calculated from the two initial ones. Therefore, we have differentiated several cases: let c a and c<. be the two examined concepts, and c b be the concept linked to c a and c,..

I. Mirbel I Data & Knowledge Engineering 21 (1997) 183-195 187

• If the three concepts are linked together with the same type of relationship, then the semantic distance between the two concepts is:

dist = f (c ~, cA) X f (c b, c .) • If the two relationships are transitives, we see that the aggregation relationship carries more

semantic than the generic one. Therefore, the semantic distance depends on the aggregation coefficient:

dist = 0.9 ×fix , y),

where f(x, y) represents the aggregation relationship.

Now, we can calculate the degree of likeness of the two words from the deduced relationship between them. Let m a and m b be the two examined words. Let Ca1 ={c~1, Ca2 ..... Cam } be the set of concepts linked to m a, and C b ={cb/, cb2 ..... Cbm } the set of concepts linked to m b. In C, and Q , concepts from m~ and m b are directly linked two by two by the new deduced relationships (Vk~[1, m], 3f=fs(Cak, Cbk) or 3 f = f s ~(ca~, Cbk) or 3f=fa(C~k, Cbk) or 3 f = f a ~(Cak, Cbk)). We define the semantic likeness degree D between m~ and m b by:

D(m a, m~) = max,.,k~c,. %k~chmin(fcpt(ma, c ~k), f (c ak, Cbk), fcpt(mb, Cbk))

where fcpt represents interpreting relationship, and f deduced ones. And we are able also to qualify this likeness by a type of likeness, which could be: synonym, but

also generalization, specialization, composed or component. More information about this thesaurus can be found in [15].

2.2. The structural-distance criteria

The structural distance between two elements having different functions results from the analysis of their complexity difference. Indeed, we define the element functions as follows: • the attribute function indicates the notion of static information; • the method function indicates the notion of dynamic information; • the class function indicates the notion of a set of some static and some dynamic items of

information; • the link function indicates the notion of a relationship between the sets of some static and some

dynamic items of information. We give relative weight to the complexities of these elements in order to locate the elements of the schemas in their relationship to one another. The complexity scale is the following:

Let ce be the complexity of the attribute function,/3 be the complexity of the method function, y be the complexity of the class function and 6 be the complexity of the link function. These functions are defined such as: ce=0, 6 = 1, and a < / 3 < y < 6 . We can calculate the structural distance by carrying out a substraction from the complexity values of the functions of the two elements being compared.

3. The integration process

The process is divided into four steps: • The first step consists in finding pairs of integrable elements according to the criteria defined

above. This is the likeness step;

188 1. Mirbel I Data & Knowledge Engineering 21 (1997) 183-195

• The next step consists in studying the different possible integration choices, with regard to the likeness found in the previous step;

• The third step consists in integrating the schemas; • The last step consists in the superposing of a vocabulary level for each designer onto the result

schema.

3.1. The likeness step

3.1.1. The different situations The schema comparison is structural and semantic. Several situations can be found whilst

comparing two elements. The two elements can be: • semantically and structurally near (1), • semantically near, but structurally distant (2), • structurally near, but semantically distant (3), • semantically and structurally distant (4).

We can remove case 4, because it depicts pairs of elements that have nothing in common. We also think that case 3 is not significant, because we cannot only base ourselves on the element structure to decide whether to integrate them; the schemas provide a semantic dimension which must remain a priority. Case 1 is the case we most often detect. It regroups pairs of elements which appear with close names and similar structures. Case 2 is a case which is also often detected, and which we must not neglect. Indeed, we can find two design schemas where the same concept appears. And it is possible that the two designers could depict it with different structures: the first one could depict it as an attribute, the other, as a class. All cases are possible. For example, we illustrate the case of semantic similarity between an attribute and a class in Fig. 2.

The integration of pairs of elements in case 1 needs simple recognition of their similarities, so that they do not appear several times in the result schema. The integration of pairs of elements of case 2 needs a preliminary change in the schema, in order to remove the detected structural conflicts. That is why we start by dealing with the cases of semantic likeness and structural distance (2) before even starting to deal with the cases o f semantic and structural closeness (1).

3.1.2. The likeness of semantically-near and structurally-distant elements Here, we look at structurally different pairs of elements: attribute/method, attribute/class, attribute/

link, method/class, method/link, class/link; we try to find pairs with a semantic similarity. The likeness phase is divided into:

• a semantic-likeness phase • a structural-likeness phase

Flower

color flowering time species

Schema 2

Spec ies Name fea ture

or igin

Schema 1

Flower [ ~ Spec ies ~ _ color [/[~ame

flowering time species V [ feature

I I origin

Fig. 2. The solution of a conflict between an attribute and a class.

Sche.____ ma 2

L [ Species gJ }Name

I | r°risin

I. Mirbel / Data & Knowledge Engineering 21 (1997) 183-195 1 8 9

• a confrontation step between semantic and structural likenesses We will present these different phases.

The semantic-likeness step: in the likeness phase, we consider pairs (one element from each schema, of course) of elements which are close to one another. We also consider ambiguous elements. • The detection of semantically-close elements:

First case: The elements of the pair under study have the same name; we give the maximum likeness coefficient to these pairs: 1. And the type of likeness is Sy (for synonym). Second case: the elements of the pair under study have close names. With the help of the thesaurus and the semantic likeness criterion, we associate each pair with a likeness coefficient and a type of likeness. Once the first step dealing with semantic likeness is completed, we have a list of pairs of elements which are candidates for integration. Each pair (elt~l, elts2) has a coefficient which allows the likeness between its elements (coeff, e,~,,, ) to be quantified, and a type (type~em,~,,)

which allows it to be qualified. Thus, we obtain a list of results expressed in the following way: (elt~.l, elt,.2, coeff.,.e,, ~,,,, type.~em l,, ) This step allows us to detect any or possible synonyms. However, some elements may have a tendency to be used with different meanings. They may tend to be ambiguous. That is why the second step of the semantic-likeness step deals with this point.

• Detection of ambiguous elements: Some elements may appear in several pairs and consequently may be a candidate for integration several times. This can result from the fact that a word may have several meanings, that is to say it can be ambiguous. If an element from the first schema can be brought closer to two elements of the second schema, it seems wise to examine their ambiguity in order to privilege the integration of the less ambiguous ones. That is why we calculate the ambiguity coefficient of each element in the pairs. This coefficient is calculated with the help of the thesaurus and the semantic-ambiguity criterion presented above. We associate with each pair of elements which is a candidate for integration, an ambiguity coefficient which is the average of the ambiguity coefficients of its elements. Thus, to some extent, we don't have to integrate homonym elements.

We put together the semantic likeness and the semantic ambiguity of each pair in a global semantic coefficient such as:

c o e f f s e m =

n × coeff . . . . c/os + p X (1 - coeffse,, ,,h)

n + p

where n > p (likeness has priority over ambiguity). At the end of the semantic-integration step, we obtain a set of pairs of elements which are candidates for integration. Each pair looks like: (elt,~, elL2, coeff . . . . type ...... ).

The structural-likeness step: in this step, we work on pairs resulting from the previous step. We will associate with each pair a coefficient and a type of structural likeness. Using the structural- distance criteria defined above. We incorporate the information concerning the structure to that concerning the semantic of each pair of elements in a global coefficient, such as:

m × coeffse m + q × (1 - coeff,r,c) c o e f f =

m + q

190 1. Mirbel / Data & Knowledge Engineering 21 (1997) 183-195

where m>q (the semantic dimension has priority over the structural dimension). At the end of this step, we obtain a set of pairs of elements which are candidates for integration. Each pair looks like: (eltsl, elts2, coeff, type . . . . . types,r), (where types, r indicates the functions of eltsl and elt,2 in the schemas they belong to).

Confrontat ion between semantic and structural similarities: through the previous step, we defined several types of structural and semantic likeness in order to qualify the similarities between the elements. It is now necessary to confront these types of similarities to make sure they are compatible. Indeed, the semantic-likeness types known as Sy and Gen/Spe are compatible with all the structural-distance types. But, it is not so with the case of the type known a s f°und/cent, which is not compatible with all the structural types. Just as it seems normal to consider that there may exist a semantic-likeness of a C°""d/cent type between an attribute and a class, so it seems difficult to consider it being partly an attribute and a method or partly a method and a link. That is why we isolated these two incompatibilities through the confrontation step. The pairs corresponding to this description are removed from the sets of pairs of elements which are candidates for integration.

3.1.3. Likeness of semantically-and-structurally close elements Here, we look at the pairs of elements structurally close to one another: attribute/attribute,

method/method, class/class, link/link. And we try to bring together the elements that have similar names. Here only the semantic-likeness step (presented in the previous paragraph) is necessary.

3.2. Definition of integration choices

At the end of the previous step, some elements of the first schema may appear close to several elements of the second schema, and vice-versa. At this stage of our reasoning, it seems difficult for us to choose which likeness must be privileged, in order to achieve the best integration. Therefore, we decided not to neglect any likenesses and to try every possible likenesses for a given element. But this cannot be done in one single integration (we cannot integrate an element of the first schema with several of the second schema and vice-versa). So, we split all the likeness resulting from the previous step into several sets, each one containing at most one likeness per element of the first schema, and at most one likeness per element of the second schema. If we apply this method to all the elements which appear several times in the likeness set, we build a tree. Each leaf of this tree is a likeness set, where each pair deals with different elements. This type of set is called a coherent likeness set (see Fig. 3). We can now produce as many solutions as there are different sets. And for each solution, we carry out the integration following the choices dictated by the likenesses of each set. The quality of each result schema depends on the likeness coefficient of the pairs of its original set.

3.3. The integration step

If we find a common concept (via the similarity of the names) expressed with different structures in the two schemas under study, we can say that the two schemas are clashing. That is why the integration step is preceded by a conflict-solving step. The integration step is divided like this: • schema changes in order to remove conflicts • class integration • link-hierarchy integration.

I. Mirbel I Data & Knowledge Engineering 21 (1997) 183-195 1 9 1

Likeness list

In tegra t ion choice for A

Int e~rat ion choice for D

A B . l r

(D.C... .) I In tegra t ion choice (D.G.. . .) I for E (E.F,. . .)

I (D,C,...) (D.G.~..) (D.G,.. .)

7 \ ---, I

(E.F,...)~_ (E,F....}_ ~__ _(E.i...._' )_ --1-t

(A,B....) (A.B....) (A,C,...) i (D.C....) (D.G....) (D.G,.. .) J (E.F... .) (E,F,.. .) (E.F... .) t

.-t

Coherent s e t s

Fig. 3. Illustration of the integration choices level.

In all the previous steps we have tried to find semantic likenesses in order to allow integration at link level. This last step (link-hierarchy integration) only needs structural rules. So, we will not present it in this paper, because it is of no interest here. For more information about it, see [5]. Now, we will briefly present each step, except the last one dealing with link integration.

3.3.1. The conflict-solving step In order to remove the conflicts existing between the two schemas to be integrated, it is necessary

to change them according to the information obtained about their overlapping, at the likeness step. We must remove the structural conflicts between the schemas using their semantic dimension. It is important to underline that there is no semantic loss through these schema changes. Indeed, it is always the richest structure that is kept. That is why, for each case, we have described the changes that must be made in order to remove structural conflicts. As an example, we illustrate here the conflict-solving rule between an attribute and a class (Fig. 2).

3.3.2. The integration rules The goal of our integration process is to obtain a single result schema by superposing the designer

schemas. The object of the link-integration step is to succeed in superposing the different hierarchies. That is why it is necessary to define the classes as strictly equivalent or strictly different in the step preceding the link-hierarchy integration.

The class-integration rules: a class is composed of a name and a set of attributes and methods, which we include under the general term: element. These elements correspond to the concept carried by the class, and the name of the class corresponds to the name given to the concept carried. Our approach consists in examining the different possible combinations between the concept carried and the name given to it, in order to detect and solve any possible conflicts. For that purpose, we based ourselves on the works of B.R. Gaines and M.L.G. [11] in the knowledge acquisition domain. This helped us to represent the various cases encountered when classes of design schemas where being integrated. We have listed 6 of them that cover all situations. They are:

192 I. Mirbel I Data & Knowledge Engineering 21 (1997) 183-195

• Case 1: the two classes have the same name; they are composed of the same elements. In this case, the two classes are strictly identical. There is no conflict between these classes, we can take them as they are.

• Case 2: the two classes have different names; they have no common element. In this case, the two classes are strictly different. There is no conflict between these classes. The two designers have worked out two different concepts of the real world. These two concepts must appear in the result schema. Do.

• Case 3: the two classes have the same name; but have no common element. In this case, we examine two concepts which are totally different but have the same name. There is an homonymy conflict between the two classes examined [1,7,16]. So we must change one name. After this change, the two classes are in case 2, and are not conflicting.

• Case 4: the two classes have different names but are composed of the same elements. In this case, we examine a single concept perceived by the two designers, but named differently. Here, there is a synonymy conflict between the two classes [1,7,16]. The action to be taken is to choose one of the two names and give it to the concept in the two schemas. After this change, the two classes are as in case 1, and are not conflicting.

Case 5 is more complex, and the changes to be made to remove conflicts are not the same depending on whether the class belongs to a generalization/specialization hierarchy or whether it belongs to a reference graph. So, we will divide case 5 into two parts: • Case 5a: the two classes have different names but some elements in common; they belong to

generalization/specialization hierarchies. It is not possible to say whether the same concept has been perceived by the two designers, because of the divergence in the names used and in the elements found there. But we might think that one or several common abstraction levels have been omitted from the two schemas. In fact they are hidden. By means of the integration process, we will extract this abstraction level (see Fig. 4). By integrating the design schemas, we have been able to obtain information which we would not have we if we had only examined one of the two schemas.

• Case 5b: the two classes have different names, but some common elements; they belong to reference graphs. In generalization/specialization hierarchies, we have noted a missing abstraction level in the common elements. In reference graphs, it is a decomposition level that is missing. Fig. 5 illustrates this case.

Schema 1 Schema 2

Tree t r u n k co lo r

b a r k k i n d

s p ec i e s

n a m e

Sclaema 1

Vegetable n a m e

spec ie s

Plant s t e m co lo r

s t e m k i n d

Schema 2

Vegetable n a m e

spec ie s

I

I Tree t r u n k co lo r

b a r k k i n d

Fig. 4. The extraction of an abstraction level.

I. Mirbel I Data & Knowledge Engineering 21 (1997) 183-195 193

S c h e m a 1

P l a n t leaf color leaf shape s tem color st em kind

S c h e m a 2

Tree

trunk color

bark kind

leaf color

leaf shape

S c h e m a 1

I, Tre__._e

ark kind

trunk color l e a f ~

4" Leaf

leaf color leaf shape

S c h e m a '2

P l a n t

s[eln co]or

stem kind

e a f ~

leaf shape leaf color

Fig. 5. The extraction of a decomposition level.

• Case 6: the two classes have the same name and some common elements; they belong to generalization/specialization hierarchies. This case corresponds to a situation where the two classes have the same name with some, but not all, elements in common. Our idea is to assume, in this case, that the two designers have seen the same reality, but from two different points of view. So, we have to enrich each class by the other to obtain the same class in both schemas in the most richest possible way. Fig. 6 illustrates this case.

What makes our approach worthy of interest is the discovery of dependency relationships (through references) and generalization/specialization relationships between the elements of the two schemas, even if there are no equivalent classes in the two schemas. For example, we look at the vegetable class which appears as the superclass of plant and tree in a first schema; and the shrub class, which appears as a class in the second schema. After the semantic-likeness step, we conclude that the shrub element is close to the vegetable element and that the proximity is of the Gen type; We are then able to integrate the two schemas in a proper way, that is to say to place the class shrub as a subclass of the class vegetable. In Fig. 7, we give another example of a possible type of enrichment during integration. After this integration step at the class level, we can integrate the class hierarchies. The thesaurus allows us to blend the names given to the elements in the result schema. The vocabulary

S c h e m a 1

Tree trunk color leaf color

leaf shape

S c h e m a 2

Tree eaf color eaf kind lower kind

7, /

S c h e m a 1

Tree trunk color

leaf color

leaf shape

leaf kiM

flower kind

S c h e m a '2

trunk color

leaf color

I leaf shape

leaf kind

flower kind

Fig. 6. Enrichment of a class by another one.

194 1. Mirbel / Data & Knowledge Engineering 21 (1997) 183-195

S c h e m a 1 .~ch~ m a 2 .q'ch¢ iTla i Sl 'hrt~m 2

Fig. 7. Example (variety, family, C~"'IC°""d).

used in the initial design schemas then retains its precision and richness. We consider the extraction of an abstraction level (a C1 class), from two intersecting classes (C2 and C3). If some common elements of C2 and C3 appear in likenesses of Gen or Spe type, then, we associate the most generic vocabulary with the elements of class C1, and we rename them with the most specific vocabulary in classes C2 and C3.

3.4. Vocabulary step

This last step consists in presenting the different solutions obtained by superposing one vocabulary level for each designer. During the previous step, we associated all the different vocabularies found in the initial schemas, with each common element, in order to present the result schema to each designer with his own vocabulary.

4. Conclusion

Our process allows the integration of object-oriented schemas. The integration is structural but above all semantic. To take into account the semantic provided by the design schema, a tool like the fuzzy thesaurus is a valuable asset at all the process levels: • during the schema-element comparison: the similar words (synonyms), but also the ambiguous

words (homonyms) can be detected; • during the integration choices: several solutions to the integration of two design schemas can be

proposed; • during the integration: it helps to find possible integrations between the schemas, even if there are

no two strictly-equivalent elements in the two schemas, by using generalization/specialization and/or aggregation likenesses;

• when the integration results are presented: it is possible to superpose a vocabulary level onto each result schema, in order to use the designer's own vocabulary when presenting the solution to him/her.

We hope to improve this process by studying schema-integration strategies, in order to guide the process with criteria such as class-number restriction, or the likeness to one initial schema.

References

[1] C. Batini, M. Lenzerini and S. Navathe, A comparative analysis of methodologies for database schema integration. ACM Computing Surveys 18(4) (1986) 323-364.

I. Mirbel I Data & Knowledge Engineering 21 (1997) 183-195 195

[2] J. Biskup and B. Convent, A formal view integration method, in Int. Conf. on the Management of Data, Washingtown (28-30 May 1986) ACM.

[3] J. Brunet, O*: a model for object-oriented analysis, rapport interne du laboratoire MASI, UniversitE de la Sorbonne, Paris, 1991.

[4] X. Castellani, MCO, mdthodologie ggn~rale d'analyse et de conception des systkmes d'objets (Masson, 1993). [5] A. Cavarero and I. Mirbel, A design and integration process of multi-expertise schemata using a fuzzy object-oriented

approach, in American Society of Mechanical Engineers Winter Annual Meeting New-Orleans (November 28- December 3 1993).

[6] R. Chaffin and D.J. Herrmann, The nature of semantic relations: a comparison of two approaches, in M.W. Evens, (ed.), Relational models of the lexicons (Cambridge Univ. Press, 1988) 289-334.

[7] I. Comyn-Wattiau, L'int~gration de vues dans le systkme SECSI, Ph.D. thesis, UniversitE Paris VI, France, Octobre 199O.

[8] D.A. Cruse, Lexical semantics, Cambridge textbooks in linguistics, (Cambridge Univ. Press, 1986). [9] M.W. Evens, Relational models of the lexicon: introduction, in M.W. Evens, (ed.), Relational models of the lexicons

(Cambridge Univ. Press, 1988) 1-37. [10] P. Fankhauser, M. Kracker and E.J. Neuhold, Semantic vs. structural resemblance of classes, Sigmod record 20(4)

(1991) 59-63. [11] B.R. Gaines and M.L.G. Shaw, Comparing the conceptual systems of experts, in Int. Joint Conf. on Artificial

Intelligence (1JCAI), DEtroit (1989) 633-638. [12] G. Hirst, Semantic interpretation and the resolution of ambiguity (Cambridge Univ. Press, 1987). [13] M.A. Iris, B.E. Litowitz and M. W. Evens, Problems of the part-whole relation, in M.W. Evens, (ed.), Relational models

of the lexicons (Cambridge Univ. Press, 1988) 261-288. [14] E. M&ais, Jean-Noel Meunier and Gilles Levreau, Database schema design: a perspective from natural language

techniques to validation and view integration, in 12th Int. Conf. on Entity-Relationship Approach ERA, Dallas, Maiott, Arlington, Texas, USA (15-17 December 1993) 190-205,

[15] I. Mirbel, A fuzzy thesaurus for semantic integration of design schemes, in J. Sharpe, (ed.), AI System Support for Conceptual Design (L1WED'95), Ambleside, United Kingdom, (27-29 March 1995) Springer, 319-335.

[16] C. Parent and S. Spaccapietra, IntEgration de vue et relativisme srmantique, in Vljournges bases de donnges avancdes, Montpellier, France (Septembre 1990).

[17] J. Rumbaugh, M. Blaha, W. Premerlani, F. Eddy and W. Lorensen, Object-oriented modelling and design (Prentice Hall, New Jersey, 1991).

[18] J.C. Sager, A practical course in terminology processing (John Benjamins, 1990). [19] W.W. Song, P. Johannesson and J.A. Bubenko, Semantic similarity relations in schema integration, in Entity

Relationship Approach, ER '92, lOth Int. Conf. on the Entity Relationship Approach, volume 645 of Lecture Notes in Computer Sciences (Springer Verlag, 1992) 97-120.

[20] C. Thieme and A. Siebes, An approach to schema integration based on transformations and behaviour, in Advanced Information Systems Engineering: Proc. 6th Int. Conf. on Advanced Information System Engineering (CAISE '94), June 6-10, 1994, volume 811 of Lecture Notes in Computer Sciences (Springer Verlag, 1994) 297-310.

I. Mirbel received the D.E.A in computer sciences (diploma equivalent to the M.Sc. degree in computer science) from nice-Sophia Antipolis University, France, in 1993. Since 1994, she has been a PhD student at the I3S Laboratory, Sophia Antipolis, France. Her main research area is data management; in particular design schema integration.