Data Analysis in the Hebrew Bible

Preview:

DESCRIPTION

Joint work with Martijn Naaijer (VU University). With the Hebrew Bible encoded in Linguistic Annotation Framework (LAF-ISO), and with a new LAF processing tool, we demonstrate how you can do practical data analysis. The tool, LAF-Fabric, integrates with the ipython notebook approach. Our example here is lexeme cooccurrence analysis of bible books. For now, the road from data to visualization is more important than the exact visualization.

Citation preview

DATA ANALYSIS INTHE HEBREW BIBLE

CLIN 2014-01-17Dirk Roorda (DANS/TLA), Martijn Naaijer and Gino Kalkman (VU ETCBC)

RESEARCH @

just started

EXEGESIS

preaching the word of God

the devil is in the details

meanings of specific words

DISTANT READING

scan large quantities of text

find patterns

signals in the noise

study other aspects than meaning

text transmission

linguistic variation

literary form

VARIATION IN BIBLICAL HEBREW

Timespan of Hebrew Bible writing: ~1000 years

Assumption: we can divide the books in 2 groups

EBH (early biblical Hebrew)

LBH (late biblical Hebrew)

"PROOF"

Select some features that differ for EBH and LBH

Risk of circularity

We need data analysis that is

comprehensive (not eclectic)

critical (not everything is a signal)

SYNTACTIC VARIATION

syntactic features

phrase, clause, text

large units

chapters

books

drivers of change

diachrony

geography

demography

variation

THE HEBREW BIBLE AS DATA

THE HEBREW BIBLE IN LAF

LAF ISO 24612:2012

SHEBANQ (github)

2.27 GB

1.5 M nodes

1.5 M edges

40 M features

400 K words

13 M XML ids

PROCESSING LAF

it is XML

but not document-like (not asTEI)

and not database like (not nice for XQUERY)

it is graph-like

PROCESSING LAF

eXist (>30min loading time, simple queries >60min)

indexes needed: but which ones

tried POIO (>60min loading time, needs >20GB RAM)

straightforward object oriented in Python

scripting language overhead

LAF-FABRIC

LAF-Fabric

loads in a few seconds

executes in a few seconds

on a laptop

can run

in a Terminal

as an IPython notebook

also Python

uses C-like arrays

COOCCURRENCES

1 Common Nouns

2 Proper Nouns

Nodes are books

Edges are cooccurrences of lexemes (1 or 2)

WEIGHTED EDGES

S(lex): number of books containing lex

C(b1, b2): intersection of lexemes of b1 and b2

L(b1, b2): union of lexemes of b1 and b2

Common Nouns

no weight

Common Nouns

with weight

Proper Nouns

no weight

Proper Nouns

with weight

DATA-DRIVEN THEOLOGYm.naaijer@vu.nl

g.j.kalkman@vu.nl

dirk.roorda@dans.knaw.nl

Thank You