Open Research Onlineoro.open.ac.uk/25230/1/10.1.1.154.1284_(1).pdf · nologies, meta-data based positioning, and posi-tioning based on ontology mappings (Kalz et al., 2007). Latent

Open Research OnlineThe Open University’s repository of research publicationsand other research outputs

Positioning for conceptual development using latentsemantic analysisConference or Workshop ItemHow to cite:

Wild, Fridolin; Hoisl, Bernhard and Burek, Gaston (2009). Positioning for conceptual development using latentsemantic analysis. In: The 12th Conference of the European Chapter of the Association for Computational Linguistics(EACL 2009), Workshop on GEMS: GEometical Models of Natural Language Semantics, 31 Mar 2009, Athens, Greece.

For guidance on citations see FAQs.

c© 2009 Association for Computational Linguistics

Version: Version of Record

Link(s) to article on publisher’s website:http://portal.acm.org/citation.cfm?id=1705421&preflayout=flat

Copyright and Moral Rights for the articles on this site are retained by the individual authors and/or other copyrightowners. For more information on Open Research Online’s data policy on reuse of materials please consult the policiespage.

oro.open.ac.uk

http://oro.open.ac.uk/help/helpfaq.html

http://portal.acm.org/citation.cfm?id=1705421&preflayout=flat

http://oro.open.ac.uk/policies.html

Proceedings of the EACL 2009 Workshop on GEMS: GEometical Models of Natural Language Semantics, pages 41–48,Athens, Greece, 31 March 2009. c©2009 Association for Computational Linguistics

Positioning for Conceptual Developmentusing Latent Semantic Analysis

Fridolin Wild, Bernhard HoislVienna University of Economics

and Business Administration

Gaston BurekUniversity of Tubingen

Computational Linguistics Division

Abstract

With increasing opportunities to learn on-line, the problem of positioning learnersin an educational network of content of-fers new possibilities for the utilisation ofgeometry-based natural language process-ing techniques.

In this article, the adoption of latent se-mantic analysis (LSA) for guiding learnersin their conceptual development is investi-gated. We propose five new algorithmicderivations of LSA and test their validityfor positioning in an experiment in order todraw back conclusions on the suitability ofmachine learning from previously accred-ited evidence. Special attention is therebydirected towards the role of distractors andthe calculation of thresholds when usingsimilarities as a proxy for assessing con-ceptual closeness.

Results indicate that learning improves po-sitioning. Distractors are of low value andseem to be replaceable by generic noiseto improve threshold calculation. Fur-thermore, new ways to flexibly calculatethresholds could be identified.

1 Introduction

The path to new content-rich competencies ispaved by the acquisition of new and the reorgani-sation of already known concepts. Learners will-ing to take this journey, however, are imposed withthe problem of positioning themselves to that pointin a learning network of content, where they leavetheir known trails and step into the unknown – andto receive guidance in subsequent further concep-tual development.

More precisely, positioning requires to mapcharacteristics from a learner’s individual epis-temic history (including both achievements and

shortcomings) to the characteristics of the avail-able learning materials and to recommend reme-dial action on how to achieve selected conceptualdevelopment goals (Van Bruggen et al., 2006).

The conceptual starting points of learners nec-essary to guide the positioning process is reflectedin the texts they are writing. Through structureand word choice, most notably the application ofprofessional language, arrangement and meaningof these texts give cues about the level of compe-tency1 development.

As learning activities increasingly leave digitaltraces as evidence for prior learning, positioningsupport systems can be built that reduce this prob-lem to developing efficient and effective match-making procedures.

Latent semantic analysis (LSA) (Deerwester etal., 1990) as one technology in the family ofgeometry-based natural language models could inprinciple provide a technological basis for the po-sitioning aims outlined above. The assumption un-derlying this is that the similarity to and of learn-ing materials can be used as a proxy for similar-ity in learning outcomes, i.e. the developmentalchange in conceptual coverage and organisationcaused by learning.

In particular, LSA utilises threshold values forthe involved semantic similarity judgements. Tra-ditionally the threshold is obtained by calculat-ing the average similarity between texts that cor-respond to the same category. This procedure canbe inaccurate if a representative set of documentsfor each category is not available. Furthermore,similarity values tend to decrease with increasingcorpora and vocabulary sizes. Also, the role ofdistractors in this context, i.e. negative evidenceas reference material to sharpen classification forpositioning, is largely unknown.

With the following experiment, we intend to1See (Smith, 1996) for a clarification of the difference of

competence and competency

41

validate that geometrical models (particularly la-tent semantic analysis) can produce near humanresults regarding their propositions on how to ac-count written learner evidence for prior learningand positioning these learners to where the best-suiting starting points are. We will show that latentsemantic analysis works for positioning and that itcan provide effective positioning.

The main focus of this contribution is to inves-tigate whether machine learning proves useful forthe positioning classifiers, whether distractors im-prove results, and what the role of thresholds forthe classifiers is.

The rest of this paper is structured as follows.At first, positioning with LSA and related workare explained. This is followed by an outline ofour own approach to positioning. Subsequently,a validation experiment for the set of new algo-rithms is outlined with which new light is shed onthe utilisation of LSA for positioning. The resultsof this experiment are analysed in the followingsection in oder to, finally, yield conclusions andan outlook.

2 Positioning with LSA

According to (Kalz et al., 2007), positioning “isa process that assists learners in finding a start-ing point and an efficient route through the [learn-ing] network that will foster competence build-ing”. Often, the framework within which thiscompetence development takes places is a formalcurriculum offered by an educational provider.

Not only when considering a lifelong learner,for whom the borders between formal and infor-mal learning are absolutely permeable, recogni-tion of prior learning turns out to be crucial for po-sitioning: each individual background differs andprior learning needs to be respected or even ac-credited before taking up new learning activities –especially before enrolling in a curriculum.

Typically, the necessary evidence of prior learn-ing (i.e., traces of activities and their outcomes)are gathered in a learner’s portfolio. This portfoliois then analysed to identify both starting points anda first navigation path by mapping evidence ontothe development plans available within the learn-ing network.

The educational background represented in theportfolio can be of formal nature (e.g. certi-fied exams) in which case standard admissionand exemption procedures may apply. In other

cases such standard procedures are not available,therefore assessors need to intellectually evaluatelearner knowledge on specific topics. In proce-dures for accreditation of prior learning (APL), as-sessors decide whether evidence brought forwardmay lead to exemptions from one or more courses.

For supporting the positioning process (as e.g.needed for APL) with technology, three differentcomputational classes of approaches can be distin-guished: mapping procedures based on the analy-sis of informal descriptions with textmining tech-nologies, meta-data based positioning, and posi-tioning based on ontology mappings (Kalz et al.,2007). Latent semantic analysis is one of manypossible techniques that can be facilitated to sup-port or even partially automate the analysis of in-formal portfolios.

2.1 LSA

LSA is an algorithm applied to approximate themeaning of texts, thereby exposing semantic struc-ture to computation. LSA combines the classi-cal vector-space model with a singular value de-composition (SVD), a two-mode factor analysis.Thus, bag-of-words representations of texts can bemapped into a modified vector space that is as-sumed to reflect semantic structure.

The basic idea behind LSA is that the colloca-tion of terms of a given document-term-space re-flects a higher-order – latent semantic – structure,which is obscured by word usage (e.g. by syn-onyms or ambiguities). By using conceptual in-dices that are derived statistically via a truncatedSVD, this variability problem is believed to beovercome.

In a typical LSA process, first a document-termmatrix is constructed from a given text base of ndocuments containing m terms. This matrix Mof the size m × n is then resolved by the SVDinto the term vector matrix T (constituting the leftsingular vectors), the document vector matrix D(constituting the right singular vectors) being bothorthonormal and the diagonal matrix S.

Multiplying the truncated matrices Tk, Sk andDk results in a new matrix Mk (see Figure 1)which is the least-squares best fit approximationof M with k singular values (Berry et al., 1994).

2.2 Related Work

LSA has been widely used in learning applicationssuch as automatic assessment of essays, provision

42

Figure 1: Reconstructing a textmatrix from thelower-order latent-semantic space.

of feedback, and selection of suitable materials ac-cording to the learner’s degree of expertise in spe-cific domains.

The Intelligent Essay Assessor (IEA) is an ex-ample of the first type of applications where thesemantic space is build from materials on the topicto be evaluated. In (Foltz et al., 1999) the finding isreported that the IEA rating performance is closeto the one of human raters.

In (van Bruggen et al., 2004) authors report thatLSA-based positioning requires creating a latent-semantic space from text documents that modellearners’ and public knowledge on a specific sub-ject. Those texts include written material of learn-ers’ own production, materials that the learner hasstudied and learned in the past, and descriptions oflearning activities that the learner has completedin the past. Public knowledge on the specific sub-ject includes educational materials of all kind (e.g.textbooks or articles).

In this case the description of the activity needsto be rich in the sense of terminology related tothe domain of application. LSA relies on the useof rich terminology to characterize the meaning.

Following the traditional LSA procedure, thesimilarity (e.g. cosine) between LSA vector mod-els of the private and public knowledge is then cal-culated to obtain the learner position with respectto the public knowledge.

3 Learning Algorithms for Positioning

In the following, we design an experiment, con-duct it, and evaluate the results to shed new lighton the use of LSA for positioning.

The basic idea of the experiment is to investi-gate whether LSA works for advising assessors onacceptance (or rejection) of documents presentedby the learner as evidence of previous conceptualknowledge on specific subjects covered by the cur-riculum. The assessment is in all cases done bycomparing a set of learning materials (model solu-

tions plus previously accepted/rejected referencematerial) to the documents from learners’ portfo-lios using cosines as a proxy for their semanticsimilarity.

In this comparison, thresholds for the cosinemeasure’s values have to be defined above whichtwo documents are considered to be similar. De-pending on how exactly the model solutions andadditional reference material are utilised, differentassessment algorithms can be developed.

To validate the proposed positioning serviceselaborated below, we compare the automatic rec-ommendations for each text presented as evidencewith expert recommendations over the same text(external validation).

To train the thresholds and as a method for as-sessing the provided evidence, we propose to usethe following five different unsupervised and su-pervised positioning rules. These configurationsdiffer in the way how their similarity thresholdis calculated and against which selection of doc-uments (model solutions and previously expert-evaluated reference material) the ‘incoming’ docu-ments are compared. We will subsequently run theexperiment to investigate their effectiveness andcompare the results obtained with them.

Figure 2: The five rules.

The visualisation in Figure 2 depicts the work-ing principle of the rules described below. In eachpanel, a vector space is shown. Circles depict ra-dial cosine similarity. The document representa-tives labelled with gn are documents with positiveevidence (‘good’ documents), the ones labelledwith bn are those with negative. The test docu-

43

ments carry the labels en (‘essay’).Best of Golden: The threshold is computed by

averaging the similarity of all three golden stan-dard essays to each other. The similarity of theinvestigated essay is compared to the best threegolden standard essays (=machine score). If themachine score correlates above the threshold withthe human judgement, the test essay is stated cor-rect. This rule assumes that the gold standardshave some variation in the correlation among eachother and that using the average correlation amongthe gold standards as a threshold is taking that intoaccount.

Best of Good: Best essays of the humanlyjudged good ones. The assumption behind this isthat with more positive examples to evaluate an in-vestigated essay against, the precision of the eval-uation should rise. The threshold is the average ofthe positive evidence essays among each other.

Average to Good > Average among Good: Testsif the similarity to the ‘good’ examples is higherthan the average similarity of the humanly judgedgood ones. Assumption is that the good evi-dence gathered circumscribes that area in the la-tent semantic space which is representative of theabstract model solution and that any new essayshould be within the boundaries characterised bythis positive evidence thus having a higher correla-tion to the positive examples then they have amongeach other.

Best of Good > Best of Bad: Tests whether themaximum similarity to the good essays is higherthan the maximum similarity to bad essays. If atested essay correlates higher to the best of thegood than to the best of the bad, then it is clas-sified as accepted.

Average of Good > average of Bad: The samewith average of good > average of bad. Assump-tion behind this is again that both bad and goodevidence circumscribe an area and that the incom-ing essay is in either the one or the other class.

4 Corpus and Space Construction

The corpus for building the latent semantic spaceis constructed with 2/3 German language corpus(newspaper articles) and 1/3 domain-specific (atextbook split into smaller units enhanced by a col-lection of topic related documents which Googlethrew up among the first hits). The corpus has asize of 444k words (59.719 terms, 2444 textualunits), the mean document length is 181 words

with a standard deviation of 156. The term fre-quencies have a mean of 7.4 with a standard devi-ation of 120.

The latent semantic space is constructed overthis corpus deploying the lsa package for R (Wild,2008; Wild and Stahl, 2007) using dimcalc shareas the calculation method to estimate a good num-ber of singular values to be kept and the standardsettings of textmatrix() to pre-process the rawtexts. The resulting space utilises 534 dimensions.

For the experiment, 94 essays scored by a hu-man evaluator on a scale from 0 to 4 points whereused. The essays have a mean document lengthof 22.75 terms with a standard deviation of 12.41(about one paragraph).

To estimate the quality of the latent semanticspace, the learner writings were folded into thesemantic space using fold in(). Comparing thenon-partitioned (i.e. 0 to 4 in steps of .5) humanscores with the machine scores (average similar-ity to the three initial model solutions), a highlysignificant trend can be seen that is far from be-ing perfect but still only slightly below what twohuman raters typically show.

Figure 3: Human vs. Machine Scores.

Figure 3 shows the qualitative human expertjudgements versus the machine grade distributionusing the non-partitioned human scores (from 0 to4 points in .5 intervals) against the rounded aver-age cosine similarity to the initial three model so-lutions. These machine scores are rounded suchthat they – again – create the same amount of in-tervals. As can be seen in the figure, the extreme

44

of each score level is displayed in the upper andlower whisker. Additionally, the lower and upper‘hinge’ and the median are shown. The overallSpearman’s rank correlation of the human versusthe (continuous) machine scores suggests a with.51 medium effect being highly significant on alevel with the p-value below .001. Comparing thisto untrained human raters, who typically correlatearound .6, this is in a similar area, though the ma-chine differences can be expected to be differentin nature.

A test with 250 singular values was conductedresulting in a considerately lower Spearman cor-relation of non-partitioned human and machinescores.

Both background and test corpus have deliber-ately been chosen from a set of nine real life casesto serve as a prototypical example.

For the experiment, the essay collection wassplit by half into training (46) and test (48) setfor the validation. Each set has been partitionedinto roughly an equal number of accepted (scores< 2, 22 essays in training set, 25 in test) and re-jected essays (scores >= 2, 24 essays in training,23 in test). All four subsets, – test and trainingpartitioned into accepted and rejected –, include asimilarly big number of texts.

In order to cross validate, the training and testsets were random sampled ten times to get rid ofinfluences on the algorithms from the sort order ofthe essays. Both test and training sets were foldedinto the latent semantic space. Then, random subsamples (see below) of the training set were usedto train the algorithms, whereas the test set of 48test essays in each run was deployed to measureprecision, recall, and the f-measure to analyse theeffectiveness of the rules proposed.

Similarity is used as a proxy within the al-gorithms to determine whether a student writingshould be accepted for this concept or rejected. Assimilarity measure, the cosine similarity cosine()was used.

In each randomisation loop, the share of ac-cepted and rejected essays to learn from was var-ied in a second loop of seven iterations: Alwayshalf of the training set essays were used and theamount of accepted essays was decreased from 9to 2 while the number of rejected essays was in-creased from 2 to 9. This way, the influence of thenumber of positive (and negative) examples couldbe investigated.

This mixture of accepted and rejected evidenceto learn from was diversified to investigate theinfluence of learning from changing shares andrising or decreasing numbers of positive and/ornegative reference documents – as well as toanalyse the influence of recalculated thresholds.While varying these training documents, the hu-man judgements were given to the machine in or-der to model learning from previous human asses-sor acceptance and rejection.

5 Findings

5.1 Precision versus Recall

The experiments where run with the five differentalgorithms and with the sampling procedures de-scribed above. For each experiment precision andrecall where measured to find out if an algorithmcan learn from previous inputs and if it is better orworse compared to the others.

As mentioned above, the following diagrammesdepict from left to right a decreasing number of ac-cepted essays available for training (9 down to 2)while the number of rejected essays made avail-able for training is increased (from 2 to 9).

Rule 1 to 3 do not use these negative samples,rule 1 does not even use the positive samples butjust three additional model solutions not containedin the training material of the others. The curvesshow the average precision, recall, and f-measure2

of the ten randomisations necessary for the crossvalidation. The size of the circles along the curvessymbolises the share of accepted essays in thetraining set.

Figure 4: Rule 1: Best of Three Golden

2F = 2 · precision·recallprecision+recall

45

Figure 4 shows that recall and precision stay sta-ble as there are no changes to the reference ma-terial taken into account: all essays are evaluatedusing three fixed ‘gold standard’ texts. This ruleserves as a baseline benchmark for the other re-sults.

Figure 5: Rule 2: Best of Good

Figure 5 depicts a falling recall when havingless positively judged essays in the training sam-ple. In most cases, the recall is visibly higher thanin the first rule, ‘Best of Gold’, especially whengiven enough good examples to learn from. Preci-sion is rather stable. We interpret that the fallingrecall can be led back to the problem of too fewexamples that are then not able to model the targetarea of the latent semantic space.

Figure 6: Rule 3: Avg of Good > Avg amongGood

Figure 6 displays that the recall worsens and isvery volatile3. Precision, however, is very stable

3We analysed the recall in two more randomisations of the

and slightly higher than in the previous rule, es-pecially with rising numbers of positive examples.It seems that the recall is very dependant on thepositive examples whether they are able to char-acterise representative boundaries: seeing recallchange with varying amounts of positive exam-ples, this indicates that the boundaries are not verywell chosen. We assume that this is related to con-taining ’just pass’ essays that were scored with 2.0or 2.5 points and distort the boundaries of the tar-get area in the latent semantic concept space.

Figure 7: Rule 4: Best of Good > Best of Bad

Figure 7 exhibits a quickly falling recall, thoughstarting on a very high level, whereas precisionis relatively stable. Having more negative evi-dence clearly seems to be counter productive andit seems more important to have positive examplesto learn from. We have two explanations for this:First, bad examples scatter across the space and itis likely for a good essay to correlate higher witha bad one when there is only a low number of pos-itive examples. Second, bad essays might containvery few words and thus expose correlation arte-facts that would in principle be easy to detect, butnot with LSA.

Figure 8 depicts a recall that is genericallyhigher than in the ‘Best of Gold’ case, while pre-cision is in the same area. Recall seems not to beso stable but does not drop with more bad samples(and less good ones) to learn from such as in the‘Best of Good’ case. We interpret that noise can beadded to increase recall while still only a low num-ber of positive examples is available to improve it.

whole experiment; whereas the other rules showed the sameresults, the recall of this rule was unstable over the test runs,but in tendency lower than in the other rules.

46

Figure 8: Rule 5: Avg of Good > Avg of Bad

5.2 ClusteringTo gain further insight about the location of the94 essays and three gold standards in the higherorder latent-semantic space, a simple cluster anal-ysis of their vectors was applied. Therefore, alldocument-to-document cosine similarities werecalculated, filtered by a threshold of .65 to captureonly strong associations, and, subsequently, a net-work plot of this resulting graph was visualised.

Figure 9: Similarity Network (cos >= .65).

As can be seen in the two charts, the humanlypositively judged evidence seems to cluster quitewell in the latent-semantic space when visualisedas a network plot. Through filtering the docu-ment vectors by the vocabulary used only in theaccepted, rejected, or both classes, an even clearerpicture could be generated, shown in Figure 10.

Both figures clearly depict a big connected com-ponent consisting mainly out of accepted essays,whereas the rejected essays mainly spread in the

Figure 10: Network with filtered vocabulary.

unconnected surrounding. The rejected essays arein general not similar to each other, whereas theaccepted samples are.

The second Figure 10 is even more homoge-neous than the first due to the use of the restrictedvocabulary (i.e. the terms used in all accepted andrejected essays).

6 Conclusion and Outlook

Distractors are of low value in the rules tested. Itseems that generic noise can be added to keep re-call higher when only a low number of positive ex-amples can be utilised. An explanation for this canbe found therein that there are always a lot moreheterogeneous ways to make an error. Homogene-ity can only be assumed for the positive evidence,not for negative evidence.

Noise seems to be useful for the calculationof thresholds. Though it will need further inves-tigation whether our new hypothesis works thatbad samples can be virtually anything (that is notgood).

Learning helps. The recall was shown to im-prove in various cases, while precision stayed atthe more or less same level as the simple baselinerule. Though the threshold calculation using thedifference to good and bad examples seemed tobear the potential of increasing precision.

Thresholds and ways how to calculate them areevidently important. We proposed several wellworking ways on how to construct thresholds fromevidence that extend the state of the art. Thresh-olds usually vary with changing corpus sizes andthe measures proposed can adopt to that.

We plan to investigate the use of support vec-

47

tor machines in the latent semantic space in orderto gain more flexible means of characterising theboundaries of the target area representing a con-cept.

It should be mentioned that this experimentdemonstrates that conceptual development can bemeasured and texts and their similarity can serveas a proxy for that. Of course the experiment wehave conducted bears the danger to bring resultsthat are only stable within the topical area chosen.

We were able to demonstrate that textual rep-resentations work on a granularity level of around23 words, i.e. with the typical length of a free textquestion in an exams.

While additionally using three model solutionsor at least two positive samples, we were able toshow that using a textbook split into paragraph-sized textual units combined with generic back-ground material, valid classifiers can be built withrelative ease. Furthermore, reference material toscore against can be collected along the way.

The most prominent open problem is to try andcompletely get rid of model solutions as referencematerial and to assess the lower level concepts(terms and term aggregates) directly to further re-duce corpus construction and reference materialcollection. Using clustering techniques, this willmean to identify useful ways for efficient visuali-sation and analysis.

7 Acknowledgements

This work has been financially supported by theEuropean Union under the ICT programme of the7th Framework Programme in the project ‘Lan-guage Technology for Lifelong Learning’.

ReferencesMichael Berry, Susain Dumais, and Gavin O‘Brien.

1994. Using linear algebra for intelligent informa-tion retrieval. Technical Report CS-94-270, Depart-ment of Computer Science, University of Tennessee.

Scott Deerwester, Susan Dumais, Georg W. Furnas,Thomas K. Landauer, and Richard Harshman. 1990.Indexing by latent semantic analysis. Journalof the American Society for Information Science,41(6):391–407.

Peter Foltz, Darrell Laham, and Thomas K. Landauer.1999. Automated essay scoring: Applications to ed-ucational technology. In Collis and Oliver, editors,Proceedings of World Conference on EducationalMultimedia, Hypermedia and Telecommunications1999, pages 939–944, Chesapeake, VA. AACE.

Marco Kalz, Jan Van Bruggen, Ellen Rusmann, BasGiesbers, and Rob Koper. 2007. Positioning oflearners in learning networks with content analysis,metadata and ontologies. Interactive Learning En-vironments, (2):191–200.

Mark K. Smith. 1996. Competence and competency.http://www.infed.org/biblio/b-comp.htm.

Jan van Bruggen, Peter Sloep, Peter van Rosmalen,Francis Brouns, Hubert Vogten, Rob Koper, andColin Tattersall. 2004. Latent semantic analysis asa tool for learner positioning in learning networksfor lifelong learning. British Journal of EducationalTechnology, (6):729–738.

Jan Van Bruggen, Ellen Rusman, Bas Giesbers, andRob Koper. 2006. Content-based positioning inlearning networks. In Kinshuk, Koper, Kommers,Kirschner, Sampson, and Didderen, editors, Pro-ceedings of the 6th IEEE International Conferenceon Advanced Learning Technologies, pages 366–368, Kerkrade, The Netherlands.

Fridolin Wild and Christina Stahl. 2007. Investigatingunstructured texts with latent semantic analysis. InLenz and Decker, editors, Advances in Data Analy-sis, pages 383–390, Berlin. Springer.

Fridolin Wild. 2008. lsa: Latent semantic analysis. rpackage version 0.61.

48

Documents

Open Research Onlineoro.open.ac.uk/25230/1/10.1.1.154.1284_(1).pdf · nologies, meta-data based positioning, and posi-tioning based on ontology mappings (Kalz et al., 2007). Latent