Outline of the Course Neural-Symbolic Integration · 2008. 6. 7. · A logical calculus of the ideas immanent in nervous activity . 1968 Marvin Minsky and Seymor Papert publish Perceptron

Neural-Symbolic IntegrationA selfcontained introduction

Sebastian Bader Pascal Hitzler

ICCL, Technische Universitat Dresden, Germany

AIFB, Universitat Karlsruhe, Germany

Outline of the Course

I Introduction and MotivationI The History of Neural-Symbolic IntegrationI The Core Method for Propositional LogicI The Core Method for First-Order Logic

Part

Introduction and Motivation

Motivation Connectionist Systems Symbolic AI Neural-Symbolic Integration

Why Neural Symbolic Integration

As we will see, connectionist systems and symbolic AI systemshave quite contrasting advantages and disadvantages. We tryto integrate both paradigms while keeping the advantages.

Neural-Symbolic Integration (Sebastian Bader, Pascal Hitzler) 4


The Neural Symbolic Cycle

SymbolicSystem

ConnectionistSystem

embedding

extraction

writable

readable

train

able



Connectionist Systems

I Inspired by nature.I Massively parallel computational model.I A Connectionist System consist of ...

• a set U of units (input, hidden and output).• a set of connections C ⊆ U × U, each labelled with a

weight w ∈ R.x

y

z

-1.5

0.3

-1.3

-0.7

1.8

-1.6



Units of Connectionist Systems

A unit is characterised by ...I Activation function, mapping inputs~i to the potential p:

p =∑

n

in · wn p =∑

n

(in − wn)2

I Output function, mapping the potential p to the output o:

threshold ramp sigmoidal tanh Gaussian



Dynamics of a Network

x

y

z

-1.5

0.3

-1.3

-0.7

1.8

-1.6

Activation F. Output F.input p set from outside o = phidden p =

Pn(in − wn)

2 o = e−p2

output p =P

n(in ∗ wn) o = p

t=0:

-1.0

1.0

t=1:0.74

0.5783

2.98

0.0001

t=2:

1.0391.039



Dynamics of a Network

Activation function Output function Result

p =∑

n(in − wn)2

1

-11-1

-2-1.5-1-0.5 0 0.5 1 1.5 2

-3-2.5

-2-1.5

-1-0.5

0 0.5

1 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-2-1.5

-1-0.5

0 0.5

1 1.5

2

p =∑

n in · wn

1

-11-1

-2-1.5-1-0.5 0 0.5 1 1.5 2

-2-1.5

-1-0.5

0 0.5

1 1.5

2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-2-1.5

-1-0.5

0 0.5

1 1.5

2



Training of Connectionist Systems

? How can we train a network to represent a function givenas a set of samples {(i1, o1), . . . , (in, on)}?

x

y

⇒ x

y

I Learning as generalization.



Backpropagation

I Let a set of samples {(i1, o1), . . . , (in, on)} be given.I Error of the network: E =

∑i(N (ii)− oi)

2.I Idea: minimise E by gradient descent.

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5

-2-1.5

-1-0.5

0 0.5

1 1.5

2

-2-1.5-1-0.5 0 0.5 1 1.5 2

0.5 1

1.5 2

2.5 3

3.5 4

4.5 5

5.5



Backpropagation in Detail

1. Present a training sample to the network.2. Compare the output of the network with the desired output.3. Calculate the error in each output unit.4. Modify the weights to the output layer such that the error

decreases.5. Propagate the error back to the last hidden units.6. Compute the part of the error caused by the hidden units.7. Modify the weights to the hidden units using this local error.8. Continue until input units are reached.



A sample run ...

Prior training:

x

-3.0

0.0

0.7

0.2

0.1

y-0.6

-2.6+0.2+3.1-1.1-0.7

-1.3-0.2-1.2-1.3-0.7

x

y

After training:

x

-3.6

0.0

-2.3

2.4

0.3

y0.6

-0.7-0.2+4.7-2.7-0.7

-0.4+0.1-1.3-2.6+0.2

x

y



Funahashi’s Theorem

Theorem (Ken-Ichi Funahashi, 1989)Every continous function f : K → R (with K ⊂ Rn compact) canbe approximated arbitrarily well using 3 layer feed-forwardnetworks with sigmoidal units.

1

-11-1

x

y



History of Connectionist Systems

1943 Warren Stirgis McCulloch and Walter Pitts publish“A logical calculus of the ideas immanent in nervousactivity”.

1968 Marvin Minsky and Seymor Papert publish“Perceptron”.

1974 Paul Werbos (1974), David Parker (1984) and DavidRumelhart & Ronald Williams (1985) inventBackpropagation

1989 Ken-Ichi Funahashi publishes“On the Approximate Realisation of Continuous Mappingsby Neural Networks”.



NETtalk

I Terrence J. Sejnowski and Charles R. Rosenberg, 1987I Network learns the map between letters and phonemesI 3-layer feed-forward network with sigmoidal units:

• 203 input units: encoding a window of 7 letters• 80 hidden units• 26 output units: representing phonemes, punctuation ...

I Trained using samples of the form:Word phonemes stress and syllablelogic laJIk > 1 < 0 <programme progr@m-- >> 1 >> 2 <<<neural nU-r-L > 1 << 0 <network nEtw-Rk > 1 <> 2 <<



ALVINN & MANIAC

ALVINN (Autonomous Land Vehicle In a Neural Network),I Pomerleau 1993I Learns to control NAVLAB vehicles by watching a person.I 3-layer feed-forward network with sigmoidal units:

• 960 input units: 30x32 units serve as two dimensional retina• 5 hidden units• 30 output units: representing the steering direction

MANIAC (Multiple ALVINN Networks In Autonomous Control)I Jochem et al 1993I Multiple ALVINN networks, each for a certain type of road.



ALVINN, MANIAC & RALPH

The road for ALVINN, MANIAC & RALPH:

RALPH (Rapidly Adapting Lateral Position Handler)I Pomerleau 1995I Drove in 9 days from Pittsburgh to San Diego

(2850 but 50 miles)



TD-Gammon

Artificial Backgammon player (Gerry Tesauro, 1995):I Based on standard neural network.I Learns by playing against itself.I Reaches championship level.

Btw.:I We play Backgammon symbolically.I Can we learn from the network to play better?



Properties of Connectionist Systems

, Massively parallel paradigm., Continuous function can be approximated arbitrarily well., Well suited to learn and to adapt to new environments.

• Hardly any knowledge about the function is needed.• Training from raw, incomplete and noisy data.

, Degrade gracefully, even if parts of the systems fail theoverall behaviour is ok.

, Successfully used in many application areas.

/ Declarative semantics is not available./ Structured objects are hard to represent./ Background knowledge cannot be used.



Symbolic AI

I Inspired by human thinking.I Aristotle tried to formalize human thinking by means of

logic (500BC).I First “is-a” hierarchy by Porphyry of Tyros (260).I Use of logic-like languages for knowledge representation

and reasoning.



SHRDLUI Terry Winograd, MIT, 1968-70I SHRDLU carried on a simple

dialog about a small world.I A sample dialog:

Person: Pick up a red block.Computer: Ok. (does it)Person: Grasp the pyramid.Computer: I don’t understand which pyramid you mean.Person: Find a block which is taller than the one

you are holding and put it into the box.Computer: By ‘‘it’’, i assume you mean the block

which is taller than the one i am holding.Computer: Ok. (does it)Person: What does the box contain?Computer: The blue pyramid and the blue block.



SHRDLU - Internals

I Written in LISP.I Equipped with some knowledge about the world:

((#IS :B1 #BLOCK)) ((#IS #RED #COLOR))((#IS :B2 #PYRAMID)) ((#IS #GREEN #COLOR))((#IS :B3 #BLOCK)) ((#IS #BLACK #COLOR))((#COLOR :B1 #RED)) ((#CONTAIN :BOX :B4))((#COLOR :B2 #GREEN)) ((#SHAPE :B1 #RECTANGULAR))((#COLOR :TABLE #BLACK))((#SHAPE :B3 #RECTANGULAR))

(DEFPROP TA-AT (THANTE (X Y) (#AT $?X $?Y)(THRPLACA (CDR (ATAB $?X)) $?Y))THEOREM)

(DEFPROP TA-EXISTS(THANTE (X) (#EXISTS $?X)(THSUCCEED)) THEOREM)

I Can be downloaded fromhttp://hci.stanford.edu/winograd/shrdlu



ProLog (Programming In Logic)

I Designed as a tool for man-mashine communication innatural language.

I Phillippe Roussell and Alain Colmerauer, 1972I The first Prolog-Application:

Every psychatrist is a person.Every person he analyzes is sick.Jacques is a psychatrist in Marseille.

Is Jacques a person? Yes.Where is Jacques? In Marseille.Is Jacque sick? I don’t know.

I Consisted of 610 clauses.



Applications Involving Prolog

Nowadays:I Turing complete programming language.I Usually with additional (non-logical) features.

Some application areas:I Expert and rule systems.I Computational linguistics (e.g. representation of

grammars).I Planning in AI.I Cognitive robotics.I Semantic web.



Deterministic Finite Automata

A Moore Machine consists of:I Q - set of states with an initial state q0 ∈ QI Σ - set of input symbolsI ∆ - set of output symbolsI δ - state transition function δ : Q × Σ→ QI λ - state output function λ : Q → ∆

ExampleQ = {q0, q1}Σ = {a, b} ∆ = {0, 1}

δ = {q0a7→ q0, q0

b7→ q1, q1a7→ q1, q1

b7→ q0}λ = {q0 7→ 1, q1 7→ 0}

q01

q10

ab

b

a



DFAs are everywhere

I Beverage vending machines.I Elevators.I Mobile phone menus.I etc



Properties of Symbolic Systems

, Human readable and writable, i.e. background knowledgeis directly integrable.

, Declarative semantics is available., Recursive structures can easily be represented and

manipulated., Successfully used in many application areas.

/ Hard to learn and to adapt to new environments./ If parts of the system breaks, the whole system fails./ Reasoning can be very hard.



Why Neural-Symbolic Integration?

I Connectionist systems and symbolic knowledgerepresentation are two major approaches in AI.

I Both have complementary advantages and disadvantages.I We try to integrate both by keeping the advantages:

, Human readble and writable., Declarative semantics is available., Recursive structures can easily be represented and

manipulated., Massively parallel paradigm., Well suited to learn and to adapt to new environments., Gracefully degradation.



Major Problems in Neural-Symbolic Integration

I How can symbolic knowledge be represented withinconnectionist systems?

I How can symbolic knowledge be extracted fromconnectionist systems?

I How can symbolic knowledge be learned usingconnectionist systems?

I How can connectionist learning be guided by symbolicbackground knowledge?






Part

The History of Neural-Symbolic Integration

A Joint Start RAAM SHRUTI KBANN Symmetric Networks The Core Method

A Joint Start - McCulloch and Pitts

? Can the activities of a neural system be modelled by alogical calculus?

I W. S. McCulloch and W. Pitts, 1943A logical calculus of the ideas immanent in nervous activity

I S. C. Kleene, 1956Representation of events in nerve nets and finite automata



Logical Connectives

? Can we model logical connectives using simple units?I Using binary threshold units and the activations 1 for “true”

and 0 for “false”, we obtain:

Disjunction: Conjunction: Negation:x

y

x ∨ y0.5

1.0

1.0

x

y

x ∧ y1.5

1.0

1.0x ¬x

-0.5-1.0



McCulloch-Pitts Networks

A McCulloch-Pitts network consist of ...I A set I of input units.I a set U of binary threshold units.I A subset O ⊆ U of output units.

ExampleI = {x , y}

U = {h, o}O = {o}

x

y

o0.5

1.5

-1.0

1.0-1.0

1.0 1.0

1.0



Moore Machines

I A Moore Machine consists of:• set of states with an initial state• set of input symbols• set of output symbols• state transition function• state output function

I Using the Moore-machine:Input: a b b a

State: q0 q1

ab

b

a

q0 q1

ab

b

a

q0 q1

ab

b

a

q0 q1

ab

b

a

q0 q1

ab

b

a

Output: 1 0 1 1


q01

q10

ab

b

a


From Moore Machines to McCulloch-Pitts Networks

q01

q10

ab

b

a

a

b 0∨

1∨

∨q0

∨q1

∨q′0

∨q′1

∧

∧

∧

∧



From Moore Machines to McCulloch-Pitts Networks

a 1

b 1

b 0

a 1

1



From McCulloch-Pitts Networks to Moore Machines

I A sample networkx

y

o0.5

1.5

-1.0

1.0-1.0

1.0 1.0

1.0

I Moore Machine:• Set of states (Q)• Input symbols (Σ)• Output symbols (∆)• State transitions (δ)• State outputs (λ)

I Q ={

, , ,}

I Σ ={

, , ,}

I ∆ ={

,}

I δ : Q × Σ→ Q

I λ : Q → ∆

λ(Q)



Conclusions

I McCulloch-Pitts networks are finite automata and viceversa.

I The paper (“A logical calculus of the ideas immanent innervous activity”) started the research on artificial neuralnetworks and on finite automata.

I Similar constructions work for other types of automata.



Recursive Autoassociative Memory (RAAM)I Designed to encode structured data, e.g. trees:

A B C D

I Terminals are maped to vectors:A 7→ (1, 0, 0, 0) B 7→ (0, 1, 0, 0)

C 7→ (0, 0, 1, 0) D 7→ (0, 0, 0, 1)

I Nonterminals are learned.

0010

B

0001

A

[AB]

0010

B

0001

A

0100

D

1000

C

[CD]

0100

D

1000

D

[CD]

[AB]

[ABCD] [CD]

[AB]



RAAM - UsageDue to the auto-associative structure, a single network can beused for encoding and decoding.

I Encoding:

0010

B

0001

A

[AB]

0010

B

0001

A

I Decoding:

0010

B

0001

A

[AB]

0010

B

0001

A



RAAM - Conclusions

I For details, see (Pollack, 199x)., Efficiently implementable., Use of powerfull gradient based learning techniques., System degrades gracefully.

/ Difficulties to distinguish terminals and non-terminals forterms with depth ≥ 5.

/ Capacity limit ≈ depth 5./ System needs an external controller.

, Demonstration that structured data can be representedwithin a connectionist system.



SHRUTI - A System for Reflexive Reasoning

I Humans can handle certain problems very easily and fast.I Humans approximate knowledge base: 108 rules and

facts, i.e. we perform reflexive reasoning in sublinear time.I The SHRUTI-System (Shastri & Ajjanagadde, 1993) is a

connectionist architecture for this type of reasoning.I Variable binding by synchronization of neurons.



SHRUTI - A Sample Knowledge Base

I Rules:owns(Y , Z )← gives(X , Y , Z )

owns(X , Y )← buys(X , Y )

can−sell(x , y)← owns(X , Y )

I Facts:gives(john, josephine, book)

(∃X )buys(john, X )

owns(josephine, ball)

I Question:can−sell(josephine, book)? yes(∃X )owns(josephine, X )? yes (X 7→ book , X 7→ ball)



SHRUTI - A Sample Network

Can-sell

Owns

Gives Buys

book john ball josephine

from john

from john

from josephine

from book



SHRUTI - A Sample Network Run



SHRUTI - Conclusions

I Answers are derived in a time proportional to the depth ofthe search space (Reflexive Reasoning)

I Network size is linear in the size of the knowledge base.I A rule can be used only a fixed number of times.I Biologically plausible.



SHRUTI - Extensions

I Support of negation and inconsistency(Shastri & Wendelken, 1999).

I Simple Learning using Hebbian Learning(Wendelken & Shastri, 2003).

I Multiple instantiation of a single rule(Wendelken & Shastri, 2004).



KBANN - Knowledge-Based Artificial Neural Networks

? Can simple “if-then” rules be represented and learnedusing a connectionist architecture?

I Geoffrey G. Towell and Jude W. Shavlik, 1994



KBANN - The Construction

A← B ∧ C ∧ ¬D.

A← D ∧ ¬E .

H ← F ∧G.

K ← A ∧ ¬H.

A← A′ ∨ A′′.

A′ ← B ∧ C ∧ ¬D.

A′′ ← D ∧ ¬E .

H ← F ∧G.

K ← A ∧ ¬H.

B

C

D

E

F

G

∧

A’

∧

A”

H∧

A∨

K∧

w

w

-w

w

-w

w

w

w

w w

-w



KBANN - Training

TrainingI Add hidden units.I Fully connect layers.I Add small random

numbers to weights andthresholds.

I Apply backpropagation.

B

C

D

E

F

G

∧

A’

∧

A”

H∧

A∨

K∧

w

w

-w

w

-w

w

w

w

w w

-w



KBANN - A Problem

, Works well if rules have only few conditions and there areonly few rules with the same consequence, but:

I Towell and Shavlik used sigmoidal output functions:

o =1

1 + e−(p−θ)

I The threshold of conjunctive units are computed as follows:

θ = (P − 0.5) · w

Where P is the number of positive antecedents.I The threshold of disjunctive units is allways set to

θ = w/2



KBANN - A Problem ctd.

I The rules:C ← A1 ∨ . . . ∨ An. θC = w/2Ai ← Ai1 ∧ . . . ∧ AiP . θAi = (P − 0.5) · w

I Let all but one Aij be true for each clause, i.e.pAi = (P − 1)w

oAi =1

1 + e−(p−θ)=

11 + e−((P−1)w−(P−0.5)w)

=1

1 + ew2

pC =n · w

1 + ew2

/ For any value of w we can compute an n, such that the oCexceeds any threshold to be concidered active.

, Can be solved using bipolar output functions.



KBANN - Conclusions

I Mapping of hierarchical domain knowledge into aconnectionist system.

I Refinement using standard backpropagation.I Successfully applied to a number of problems

(e.g. DNA sequence analysis).I Outperforms purely empirical and purely hand-built

classifiers.



Symmetric Networks

A symmetric network consists of ...I a set U of binary threshold units.I a set W of symmetric connections W ⊆ U × U, i.e.

wij = wji .The units are updated asynchronously until a stable state isreached.



Symmetric Networks - A Simple Example

0

0

5 0

2

-1

2

2

0

0

5 0

2

-1

2

2



Symmetric Networks and Logic Formulae

I It is possible to associate an energy function E(t)describing the state of the network at time t .

I The energy is monotone decreasing, i.e. E(t) ≥ E(t + 1).

? Is there a link between propositional logic formulae andsymmetric networks (Pinkas, 1991)?

I To each propositional logic formula we can define afunction τ which is “compatible” with the error function.

I We can construct a symmetric network such that theactivation of the network at the minima coincide with themodels of the formula.



Symmetric Networks - An Example

Example

F = (¬o ∨m) ∧ (¬s ∨ ¬m) ∧ (¬c ∨m) ∧ (¬c ∨ s) ∧ (¬v ∨ ¬m)

τ(F ) = vm − cm − cs + sm − om + 2c + o

0

1

2

0

0

0

0

m

so

v c

1 -1

-1 1

1



Symmetric Networks - Conclusions

I Strong link between propositional logic formulae andsymmetric networks.

I Further extensions to non-monotonic logics andinconsistency.

I Add penalties to clauses which define a preference.I Network settles down to most preferable interpretation.



The Core Method

I Relate logic programs and connectionist systemsI Embed interpretations into (vectors of) real numbers.I Hence, obtain an embedded version of the TP-operator.I Construct a network computing one application of fP .I Add recurrent connections from output to input layer.

IL ILTP

Rm Rm

ι ι−1

fP

~x fP (~x)




I How can symbolic knowledge be represented withinconnectionist systems? (What is ι?)

I How can symbolic knowledge be extracted fromconnectionist systems? (What is ι−1?)








Part

The Core-Method for Propositional Logic

Propositional Logic Programs The Core Method for Propositional Logic CILLP and some Derivatives Conclusions

Propositional Logic Programs – An Example

A← ¬B. % A is true, if B is false.B ← A ∧ ¬B. % B is true, if A is true and B is false.B ← B. % B is true, if B is true.



Propositional Logic Programs – The Syntax

Definition (Propositional Variables & Connectives)A, B, C, D, . . . ∧ = “and” ← = “if-then” ¬ = “not”

Definition (Clause)H︸︷︷︸

head

← L1 ∧ L2 ∧ . . . ∧ Ln.︸︷︷︸body with Li either X or ¬X

Definition (Propositional Logic Program)A propositional logic program is a finite set of clauses.



Propositional Logic Programs – The Semantics

Definition (Herbrand Base BL)The Herbrand base is the set of all variables occuring in P.

Example (BL for the running example)BL = {A, B}

Definition (Interpretation)An interpretation is a subset of the Herbrand base.

Example (Interpretations for the running example)I1 = ∅ I2 = {A} I3 = {B} I4 = {A, B}


A← ¬B.

B ← A ∧ ¬B.

B ← B.


Propositional Logic Programs – The Semantics Ctd.

Example (For I2 = {A})(A)I2 = true (¬A)I2 = false

(B)I2 = false (¬B)I2 = true

(A← ¬B)I2 = true (B ← B)I2 = true

(A ∧ ¬B)I2 = true (B ← A ∧ ¬B)I2 = false


A← ¬B.

B ← A ∧ ¬B.

B ← B.


Propositional Logic Programs – The Semantics Ctd.

Definition (Model)An interpretation M satisfying every clause of a program P iscalled a model of P (in symbols M |= P).

Example (Models of the running example)

A← ¬B.

B ← A ∧ ¬B.

B ← B.

∅ 6|= P

{A} 6|= P

{B} |= P

{A, B} |= P



The Immediate Consequence Operator TP

Definition (TP)TP(I) = {A | there is a clause A← body in P and I |= body}

I The TP-operator propagates truth along the clauses.

Example (TP for our running example)

A← ¬B.

B ← A ∧ ¬B.

B ← B.

{} 7→ {A}{A} 7→ {A, B}{B} 7→ {B}

{A, B} 7→ {B}

I For definite programs, TP converges to the least model.



Constructing the Core-Network

1. For each element of BL, addan input unit and an outputunit with threshold 0.5.

2. For each clause H ← L1 . . . Lndo the following:2.1 Add a hidden unit c and a

connection to H ′ (w = 1.0).2.2 Connect every Li and c with

w =

{+1.0 if Li is positive,−1.0 if Li is negated.

2.3 Set the threshold of c to“number of pos. Li ”−0.5.

Example

A← ¬B.

B ← A ∧ ¬B.

B ← B.

A

B

A’∨

B’∨

1.0-1.0

∧

∧

1.0

-1.0 1.0

∧

1.0 1.0



One Application of TP

A← ¬B.

B ← A ∧ ¬B.

B ← B.

A

B

A’0.5

B’0.5

-0.5

0.5

0.5

-1.01.0

1.0

-1.0 1.0

1.0 1.0

{} 7→ {A} 7→ 7→

{A} 7→ {A, B} 7→ 7→

{B} 7→ {B} 7→ 7→

{A, B} 7→ {B} 7→ 7→



Repetitive Application of TP

A← ¬B.

B ← A ∧ ¬B.

B ← B.

A

B

A’0.5

B’0.5

-0.5

0.5

0.5

-1.01.0

1.0

-1.0 1.0

1.0 1.0

7→ 7→ 7→ 7→ 7→

7→ 7→ 7→ 7→ 7→

7→ 7→ 7→ 7→ 7→

7→ 7→ 7→ . . .



Main Results (Holldobler & Kalinke, 1994)

I 2-layer networks cannot compute TP .I For each program P there exists a 3-layer kernel

computing TP .



Space and Time Complexity

Let n be the number of clauses, m be the number ofpropositional variables:

I 2m + n units, 2mn connections in the kernel.I TP(I) is computed in 2 steps.I The parallel model to compute TP is optimal.I The recurrent network settles down in at most 3n steps.



Extraction Methods

I Single units do not necessarily correspond to single rules.I In general: It is NP-complete to find the minimal logical

description for a trained network (Golea, 1996).I There is not allways a single minimal program (Lehmann,

Bader & Hitzler, 2005).

Decompositionalrule1

rule2

rule3

rule4

Pedagogical

~x fP (~x)

rules



Extraction – A Pedagogical Approach

A

B

0.3

-0.4

0.6

A’-0.2

B’0.2

1.0

-2.0

-0.5

1.5

0.3

0.8

2.00.7

0.0

-1.0

-0.21.7

A B c1 c2 c3 A’ B’0 0 0.0 / 0.0 0.0 / 1.0 0.0 / 0.0 0.0 / 1 -1.0 / 00 1 1.5 / 1.0 0.3 / 1.0 0.8 / 1.0 1.8 / 1 0.7 / 11 0 1.0 / 1.0 -2.0 / 0.0 -0.5 / 0.0 2.0 / 1 0.7 / 11 1 2.5 / 1.0 -1.7 / 0.0 0.3 / 0.0 2.0 / 1 0.7 / 1




A B A’ B’0 0 1 00 1 1 11 0 1 11 1 1 1

A← ¬A ∧ ¬B.

A← ¬A ∧ B.

A← A ∧ ¬B.

A← A ∧ B.

B ← ¬A ∧ B.

B ← A ∧ ¬B.

B ← A ∧ B.




A← ¬A ∧ ¬B.

A← ¬A ∧ B.

A← A ∧ ¬B.

A← A ∧ B.

B ← ¬A ∧ B.

B ← A ∧ ¬B.

B ← A ∧ B.

A.

B ← ¬A ∧ B.

B ← A.




, Sound, i.e. every extracted rule is a rule implemented bythe network.

, Complete, i.e. every rule implemented by the network willbe extracted.

/ Bad time-complexity, due to the exponential blow-up./ Does not create the smallest program automatically.



Extraction – A Decompositional Approach

We can do much better (Mayer-Eichberger, 2006):I Decompositional approach., Implementable (the implementation is under way)., Sound., Complete., Create very small programs automatically.



Main Results (Holldobler & Kalinke, 1994)

I 2-layer networks cannot compute TP .I For each program P there exists a 3-layer kernel

computing TP .I For each 3-layer kernel K there exists a program P, such

that K computes TP .I Let n be the number of clauses, m be the number of

propositional variables• 2m + n units, 2mn connections in the kernel.• TP(I) is computed in 2 steps.• The parallel model to compute TP is optimal.• The recurrent network settles down in at most 3n steps.



The CILLP-System

? Can the learning capabilities of KBANN be combined withthe Core Method (Garcez & Zaverucha, 1999)?

I Using sigmoidal functions, we obtain a standard 3-layerfeed-forward neural network.

⇒

I This network is trainable using back-propagation.



CILLP - The Construction

I Define ranges for ”true” and “false”:

a“true”

-a“false”

?

I Compute a, the weights and thresholds such that thesigmoidal kernel computes TP (Garcez & Zaverucha,1999).



CILLP - Extracting a Learned Program

I The pedagogical approach would work, but ...I The decompositional approach mentioned above does not

work for sigmoidal units.I Garcez, Broda & Gabbay (2001) proposed a suitable

method, which ..., is sound., is computational feasible due to clever restriction of the

search space./ is not necessarily complete./ does not necessarily create the small programs.



CILLP - The MONK’s Problems

I Robots are described by 6 properties,e.g. head-shape ∈ {round, square, octagon}, ...

I Classification task: “Recognice robots with(body-shape = head-shape) or (jacket-color = red)”

I Network architecture:• 17 input units: one for each attribute.• 3 hidden layer units.• 1 output unit: indicating answer “yes” or “no”.

I 100% performance of the network and extracted rules.I Pruning: from 131072 possible inputs for some hidden

unit, only 18724 were queried.



CILLP - Conclusions

I Successfully used for ...• classification tasks like the MONK’s problem.• DNA sequence analysis (Promoter Recognition, Splice

Junction Determination).• Power system fault diagnosis.

I Extensions of the CILLP-System:• Metalevel priorities between rules (Garcez, Broda &

Gabbay , 2000).• Intuitionistic logic (Garcez, Lamb & Gabbay, 2003).• Modal logic (Garcez, Lamb, Broda & Gabbay, 2004).



The Core Method

I Relate logic programs and connectionist systemsI Embed interpretations into (vectors of) real numbers.I Hence, obtain an embedded version of the TP-operator.I Construct a network computing one application of fP .I Add recurrent connections from output to input layer.

IL ILTP

Rm Rm

ι ι−1

fP

~x fP (~x)




I How can symbolic knowledge be represented withinconnectionist systems? (What is ι?)

I How can symbolic knowledge be extracted fromconnectionist systems? (What is ι−1?)





Conclusions

We have a complete system implementing the NeSy-Cycle forpropositional logic programs.

SymbolicSystem

ConnectionistSystem

embedding

extraction

writable

readable

train

able



Main Results

I 3-layer feedforward networks can compute TP .I Using sigmoidal units, the network is trainable using

Back-Propagation.I Extraction is sound (and complete).I Successfully applied to real world problems.






Part

The Core Method for First Order Logic

FOL Programs Bridging the Gap FineBlend Further Topics Extraction Conclusions

First Order Logic Programs – Two Examples

nat(0). % 0 is a natural number.nat(succ(X ))← nat(X ). % The successor succ(X ) is a natural

% number if X is a natural number.

even(0). % 0 is an even number.even(succ(X ))← odd(X ). % The successor of an odd X is even.odd(X )← ¬even(X ). % If X is not even then it is odd.



First Order Logic Programs – The Syntax

Functions, Variables and TermsF = {0/0, succ/1}V = {X}T = {0, succ(0), succ(X ), succ(succ(0)), . . .}

Predicate Symbols and AtomsP = {even/1, odd/1}A = {even(succ(X )), odd(succ(0)), odd(0), odd(X ), . . .}

Connectives, Clause and Program↗ propositional logic


e(0).

e(s(X ))← o(X ).

o(X )← ¬e(X ).


First Order Logic Programs – The Semantics

Herbrand Base BL = Set of ground atomsBL = even(0), even(succ(0)), . . . , odd(0), odd(succ(0)), . . .

Interpretations = Subsets of the Herbrand baseI1 = {even(succ2n(0) | n ≥ 1} I2 = {}I3 = {odd(succ2n+1(0) | n ≥ 0} I4 = I2 ∪ I3



TP for our running examples

Definition (TP)TP(I) = {A | there is A← body in ground(P) and I |= body}

Example (Natural numbers){} 7→ {n(0)}

n(0). {n(0)} 7→ {n(0), n(s(0))}n(s(X ))← n(X ). {n(0), n(s(0))} 7→ {n(0), n(s(0)), n(s(s(0)))}

{n(X ) | X ∈ T } 7→ {n(X ) | X ∈ T }

Example (Even and odd numbers){} 7→ {e(0), o(X ) | X ∈ T }

e(0). {o(X ) | X ∈ T } 7→ {e(0), e(s(X )), o(X ) | X ∈ T }e(s(X ))← o(X ). {e(s2n(0)) | n ≥ 0} 7→ {e(0), o(s2n+1(0)) | n ≥ 0}o(X )← ¬e(X ). {o(s2n+1(0)) | n ≥ 0} 7→ {e(0), e(s2n(0)) | n ≥ 0}

BL 7→ {e(0), e(s(X )) | X ∈ T }



Problems

I BL is usually infinite and therefore the propositionalapproach does not work.

I How can we bridge the gap?• How can first-order terms be represented?• How can first-order rules be represented?• How can the variable-binding be solved?



Level Mappings

I A Level Mapping | · | assigns a (unique) natural number toeach ground atom ...

Example (Even and odd numbers)

|e(sn(0))| = 2n + 1 |o(sn(0))| = 2n + 2

I ... hence, enumerates the Herbrand base:

Example (Even and odd numbers)[e(0)︸︷︷︸

1

, o(0)︸︷︷︸2

, e(s(0))︸︷︷︸3

, o(s(0))︸︷︷︸4

, e(s(s(0)))︸︷︷︸5

, . . .]



Embedding First-Order Terms into the Real Numbers

Using an injective level mapping, we can assign a unique realnumber to each interpretation:

ι(I) =∑A∈I

4−|A|

This coincides with a “binary” representation:

BL= [ e(0),o(0),e(1),o(1),e(2),. . .]

ι({e(0)})= 0. 1 0 0 0 04 = 0.2510ι({e(0), e(1), e(2)})= 0. 1 0 1 0 14 ≈ 0.2710



The Graph of the Natural Numbers

ι (I)

ι (TP (I))

0.25

0.25

n(0).

n(s(X ))← n(X ).

|n(sn(0))| = n + 1

{}|{z}0.0

7→ {n(0)}| {z }0.25

{n(0)}| {z }0.25

7→ {n(0), n(s(0))}| {z }0.3125



The Graph of the Even and Odd Numbers

ι (I)

ι (TP (I))

0.25

0.25 e(0).

e(s(X ))← o(X ).

o(X )← ¬e(X ).



Some Results

Theorem (Holldobler, Kalinke & Storr, 1999)The TP-operator associated with an acyclic (wrt. injective levelmapping) first order logic program can be approximatedarbitrarily well using standard sigmoidal networks.

Some conclusions and limitations:, The Core-Method can be applied to first order logic., First treatment of first-order logic with function symbols in a

connectionist setting./ No algorithm to construct the network./ Very limitted class of logic programs.



Approximating the Embedded TP-Operator

ι (I)

ι (TP (I))

0.25

0.25

e(0).

e(s(X ))← o(X ).

o(X )← ¬e(X ).

ε = 0.05

Constructions using sigmoidal and RBF-units are given in(Bader, Hitzler & Witzel, 2005).



A Problem ...

I The accuracy of this approach is very limitted.I E.g., on a 32 bit computer, only 16 atoms can be

represented.I Therefore, we need to use real vectors instead of a single

real number to represent interpretations.



Multi-dimensional Level Mappings

I A Multi-dimensional Level Mapping ‖ · ‖ assigns to eachground atom a level l ∈ N+ and a dimension d ∈{1, . . . m}:Example (Even and odd numbers)

‖e(sn(0))‖ = (n + 1, 1) ‖o(sn(0))‖ = (n + 1, 2)

I ... still “enumerates” the Herbrand base:Example (Even and odd numbers)

1 2 3 4dim1 : e(0) e(s(0)) e(s(s(0))) e(s(s(s(0))))dim2 : o(0) o(s(0)) o(s(s(0))) o(s(s(s(0))))



Embedding First-Order Terms into the Real Numbers

Using an injective m-dimensional level mapping, we can assigna unique m-dimensional vector to each interpretation:

~ι(I) =∑A∈I

~ι(A)

~ι(A) =(ι1(A), . . . , ιm(A)) with

ιi(A) =

{4−l for ‖A‖ = (l , d) and i = d0 otherwise



Cm- The Set of all embedded Interpretations

Cm for the 2-dimensional case:

Cm = {~ι(I) | I ∈ IL}x

0.3

y0.3

{} 7→ (0, 0)

{e(0)} 7→ (0.25, 0)

{o(0)} 7→ (0, 0.25)

{e(0), o(0)} 7→ (0.25, 0.25)



Cm- The Set of all embedded Interpretations

Another construction:

x

y

x

y

x

y

x

y

. . .

x

y




ι (I)

ι (TP (I))

0.3

0.3

d10.3

d2

0.3



Implementation

A first prototype implemented by Andreas Witzel (Witzel, 2006):

I Merging of the techniques described above andSupervised Growing Neural Gas (SGNG) (Fritzke, 1998).

I Radial basis function network approximating TP .I Very robust with respect to noise and damage.I Trainable using a version of backpropagation together with

techniques from SGNG.




ι (I)

ι (TP (I))

0.3

0.3

d10.3

d2

0.3



Statistics - FineBlend vs SGNG

0.01

0.1

1

10

100

0 2000 4000 6000 8000 10000 12000 14000 0

20

40

60

80

100

120

140

erro

r

#uni

ts

#examples

#units (FineBlend 1)error (FineBlend 1)

#units (SGNG)error (SGNG)



Statistics - Unit Failure

0.01

0.1

1

10

100

0 2000 4000 6000 8000 10000 12000 14000 16000 0

10

20

30

40

50

60

70

80

erro

r

#uni

ts

#examples

#units (FineBlend 1)error (FineBlend 1)



Statistics - Iteration of Random Inputs

0

0.05

0.1

0.15

0.2

0.25

0.3

0 0.05 0.1 0.15 0.2 0.25 0.3

dim

ensi

on 2

(od

d)

dimension 1 (even)



Conclusions

, Prototypical implementation., Very robust with respect to noise and damage., Trainable using more or less standard algorithms., System outperforms other architectures (at least for the

tested examples).

/ System requires many parameters./ There is no first-order extraction technique yet.



First-order by propositional approximation

Let P be definite and I be its least Herbrand model (Seda &Lane, 2004):

I Choose some error ε.I There exists a finite ground subprogram Pn (least model In)

such thatd(I, In) < ε.

I Use propositional approach to encode Pn.I Increasing n yields better approximations of TP .

(If TP is continuous wrt. d .)I Approach works for other (many-valued) logics similarly.



Comparison of the approaches

I Seda & Lane:• For definite programs under continuity constraint.• Treatment of acyclic programs should be ok.• Better approximation increases all layers of network.• Step functions only.• Sigmoidal approach (learning) to be investigated.

I Bader, Hitzler & Witzel:• For acyclic normal programs.• Treatment of definite (continuous) programs should be ok.• Better approximation increases only hidden layer.• Variety of activation functions.• Standard learning possible.



Iterated Function Symbols

I The Sierpinsky Triangle:

x1.0

y1.0

;x

1.0

y1.0

;x

1.0

y1.0

;x

1.0

y1.0

x1.0

y1.0

;x

1.0

y1.0

;x

1.0

y1.0

; . . .



From Logic Programs to Iterated Function Systems

I For some logic programs we can explicitely construct anIFS, such that the attractor coincides with the graph of theembedded TP-operator.

I Let P be a program such that fP is Lipschitz-continous.Then there exists an IFS such that the attractor is thegraph of fP .

I For a finite set of points taken from a TP-operator, we canconstruct an interpolating IFS.

I The sequence of attractors of interpolating IFSs for acyclicprograms converges to the graph of the program.

I IFSs can be encoded using RBF networks.



Extraction of First-Order Logic Programs

I Very little work has been done on this.I A general idea:

• Use any initialization method as a base.• Neural network are points in Rn, where n is number of

weights.• Define conditions on programs which may be extracted

(E.g.: maximum number of atoms or of term nesting depth).• ; discrete points in Rn via initialization method.• Program which lies closest to network in Rn is the extracted

program.

? Could this work?



Conclusions

I 3-layer feedforward networks can approximate TP forcertain programs.

I Using sigmoidal units, the network is trainable usingbackpropagation.



Open Problems

? How can first-order descriptions be extracted from aconnectionist system?

? Can a first-order neural-symbolic system be applied to realworld problems, outperforming conventional approaches?

? How does the Core Method relate to reasoningapproaches from Cognitive Science?

? ... (many more) ...

www.neural-symbolic.org


Documents

Outline of the Course Neural-Symbolic Integration · 2008. 6. 7. · A logical calculus of the ideas immanent in nervous activity . 1968 Marvin Minsky and Seymor Papert publish Perceptron