14
NetPal: A Dynamic Network Administration Knowledge Base Ashley George, Adetokunbo Makanju, Evangelos Milios, Nur Zincir-Heywood Faculty of Computer Science Dalhousie University Halifax, NS {ageorge, makanju, eem, zincir}@cs.dal.ca Markus Latzel, Sotirios Stergiopoulos Palomino System Innovations Inc. Toronto, ON {markus, sotirios}@palominosys.com Abstract Netpal is a web-based dynamic knowledge base system designed to assist network administra- tors in their troubleshooting tasks, in recall- ing and storing experience, and in identify- ing new failure cases and their symptoms. In the context of web hosting environments, Net- pal summarises network data and and supports retrieval of relevant organisational experience for system administrators. The system design draws on a variety of domains including knowl- edge management, information retrieval, ma- chine learning and network management. We describe the system architecture, user interface design, user software testing and future direc- tions for development. 1 Introduction In the course of their duties, system adminis- trators acquire and retain extensive specialised knowledge regarding the operation of complex systems and networks. Experienced system ad- ministrators can rapidly narrow the field of candidate solutions when confronted with net- work faults and associated evidence. Thus the value of experience for a system administrator is high; the administrator uses this knowledge Copyright c 2008 Dr. Evangelos Milios, Dr. Nur Zincir-Heywood, Ashley George, Tokunbo Makanju, Markus Latzel, Sotirios Stergiopoulos. Permission to copy is hereby granted provided the original copyright notice is reproduced in copies made. to judge an appropriate initial avenue of in- vestigation. This presents an opportunity to optimise use of time and resources by creating a system which supports experience manage- ment [2, 8, 9] and recall in system administra- tors through suggestion of relevant previously recorded experience cases. In addition to knowing how to solve a recog- nised fault, an administrator needs to detect the evidence which characterises a fault. The collective data stream generated by hosts in a large network grows to exceed that which can be sorted by ad-hoc visual inspection. Numer- ical and visual tools for filtering and reduc- ing the data stream allow the administrator to gain a high-level overview of their network’s be- haviour. In the NetPal project, a prototype experi- ence retrieval system for system administrators was produced in the context of the web host- ing business. Our goal is to reduce the learn- ing curve faced by junior system administrators and to supplement recall in senior system ad- ministrators through retrieval of previous expe- rience cases, integrated with web-based report- ing and analysis of network event data, particu- larly host-based server log data. We approach this problem through a combination of infor- mation retrieval, knowledge management, host and network analysis and user interface design. The process of troubleshooting a network fault is described in [10] as containing the fol- lowing steps and roughly illustrated in Figure 1. 1

NetPal: A Dynamic Network Administration Knowledge Base

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: NetPal: A Dynamic Network Administration Knowledge Base

NetPal: A Dynamic Network

Administration Knowledge Base

Ashley George, Adetokunbo Makanju,

Evangelos Milios, Nur Zincir-Heywood

Faculty of Computer Science

Dalhousie University

Halifax, NS

{ageorge, makanju, eem, zincir}@cs.dal.ca

Markus Latzel, Sotirios Stergiopoulos

Palomino System Innovations Inc.

Toronto, ON

{markus, sotirios}@palominosys.com

Abstract

Netpal is a web-based dynamic knowledge basesystem designed to assist network administra-tors in their troubleshooting tasks, in recall-ing and storing experience, and in identify-ing new failure cases and their symptoms. Inthe context of web hosting environments, Net-pal summarises network data and and supportsretrieval of relevant organisational experiencefor system administrators. The system designdraws on a variety of domains including knowl-edge management, information retrieval, ma-chine learning and network management. Wedescribe the system architecture, user interfacedesign, user software testing and future direc-tions for development.

1 Introduction

In the course of their duties, system adminis-trators acquire and retain extensive specialisedknowledge regarding the operation of complexsystems and networks. Experienced system ad-ministrators can rapidly narrow the field ofcandidate solutions when confronted with net-work faults and associated evidence. Thus thevalue of experience for a system administratoris high; the administrator uses this knowledge

Copyright c© 2008 Dr. Evangelos Milios, Dr. NurZincir-Heywood, Ashley George, Tokunbo Makanju,Markus Latzel, Sotirios Stergiopoulos. Permission tocopy is hereby granted provided the original copyrightnotice is reproduced in copies made.

to judge an appropriate initial avenue of in-vestigation. This presents an opportunity tooptimise use of time and resources by creatinga system which supports experience manage-ment [2, 8, 9] and recall in system administra-tors through suggestion of relevant previouslyrecorded experience cases.

In addition to knowing how to solve a recog-nised fault, an administrator needs to detectthe evidence which characterises a fault. Thecollective data stream generated by hosts in alarge network grows to exceed that which canbe sorted by ad-hoc visual inspection. Numer-ical and visual tools for filtering and reduc-ing the data stream allow the administrator togain a high-level overview of their network’s be-haviour.

In the NetPal project, a prototype experi-ence retrieval system for system administratorswas produced in the context of the web host-ing business. Our goal is to reduce the learn-ing curve faced by junior system administratorsand to supplement recall in senior system ad-ministrators through retrieval of previous expe-rience cases, integrated with web-based report-ing and analysis of network event data, particu-larly host-based server log data. We approachthis problem through a combination of infor-mation retrieval, knowledge management, hostand network analysis and user interface design.

The process of troubleshooting a networkfault is described in [10] as containing the fol-lowing steps and roughly illustrated in Figure1.

1

Page 2: NetPal: A Dynamic Network Administration Knowledge Base

Figure 1: NetPal: network fault lifecycle diagram

1. alarm collection

2. customer satisfaction maintenance

3. alarm filtering and correlation

4. fault diagnosis

5. development and implementation of an ac-tion plan

6. verification of fault elimination

7. statistical analysis of fault managementprocess

The fourth and fifth steps are considered to ac-count for 80% of the troubleshooting efforts,based on experience in Palomino; reductions inthe effort required from system administratorsboth for identifying faults and planning resolu-tions would result in significant savings for anorganisation. We aim to achieve these reduc-tions by increasing the amount of relevant priorinformation available to system administratorsas they tackle a particular fault scenario.

2 Motivation

The NetPal project is inspired by previous ef-forts in constructing network knowledge man-agement systems [4, 7, 10, 6]. Many of these

systems incorporate rule-based or case-basedreasoning. We also take note of a success-ful experimental medical diagnosis support sys-tem, based on similar principles of retrievingliterature relevant to a case using text miningtechniques [12]. Drawing on these sources, wefocus on constructing a system which acts toease the recall of experience and the interpre-tation of system information by administrators.Therefore we are less concerned with automat-ically producing corrections for problems; weare more interested in prompting both learn-ing and recall in employees who are well-suitedfor correcting emerging problems.

2.1 Rule- and Case-basedReasoning

Rule-based reasoning (RBR) systems, in thecontext of computer network management,tend to take a control systems approach, as re-viewed in [4]. In a network-oriented controlsystem, there is a database of rules which con-sists of conditions and procedures. The con-ditions represent knowledge about the normaloperating state of the network. When theseconditions are violated, the network enters anirregular operating state. At this point, proce-dures can be executed to ostensibly return thestate of the network to the normal operating

2

Page 3: NetPal: A Dynamic Network Administration Knowledge Base

state. In [7], Lewis suggests that rule-basedreasoning systems tend to suffer from brittle-ness and a “knowledge acquisition bottleneck”.These faults are respectively characterised bydifficulty in adapting to novel problem situa-tions and an unconditioned growth of the rule-base as new conditions and procedures are in-serted. In the latter case, the database becomesunmanageable and may lead to unpredictablebehaviour.

Following this assessment, Lewis proposesa case-based reasoning system to record andadapt “episodes of problem solving” where theadaptation of these episodes consists of ob-serving interactions with the case database(“abstraction/respecialization adaptation”) ormodifications directly to the cases themselves(“critic-based adaptation”) [7]. The system,’Critter’, enhances a typical trouble ticketdatabase system by introducing constraint-based ticket filtering and user feedback-capturing mechanisms for updating the casedatabase over time while minimising user in-tervention where possible. Lewis’ augmenta-tion of the trouble ticket system appears tofavour a predicate-oriented representation (aswith RBR) where we prefer an information re-trieval strategy.

An alternate approach to case-based reason-ing for system administration is to insert thereasoning system at a different point in theproblem resolution process. In [10], the case-based reasoning system is situated closer tothe alarm collection source, in the sense thatalarms are processed by the system before aticket is raised, in the hopes of reducing thevolume of raised tickets. This approach tar-gets a coarse-grained control system - whenthe collected alarms match the preconditionsfor a case, that case and its attached proce-dures are automatically applied to resolve theproblem. Contrast this coarse-grained activ-ity with the application of a discrete series ofindividual rules and procedures geared to in-crementally restore operational state, as in arule-based system. For their experimental test-ing, pre-defined cases were used. It was foundthat procedures were automatically applied to8% of faults over time, leaving 92% of faults for“in-site maintenance”.

Each of the systems described in these works

focuses on applying fine-grained or coarse-grained procedures in the context of telecom-munications systems when predetermined con-ditions are matched. In the fine-grained case,the expert is required to craft a number ofcondition-action statement pairs which incre-mentally return the system to its ideal operat-ing state. In the coarse-grained case, the do-main expert crafts a complex condition and acomplex procedural response which returns thesystem to its ideal operating state. In bothcases, the requirement of formal statements forconditions and solutions suggests difficulty infuture adaptation of the rule or case base tochanging network conditions.

2.2 Information Retrieval andMedical Diagnosis Support

Patients in a clinical setting may exhibit amixture of symptoms, giving rise to multipleplausible explanations for their condition. Thecaregiver for the patient is expected to inferfrom the observed symptoms (and patient casehistory) an accurate diagnosis which will pro-vide a basis for corrective and preventativemedical action.

In the context of medical diagnosis, ISABEL[12] is a diagnosis support system which, inearly trials, has reduced errors of “diagnosisomission” [11, 16, 13]. ISABEL is reminiscentof a search engine and is heavily tailored to-ward searching medical literature. This con-tent restriction allows for the implementationof domain-specific features, including presenta-tion of probabilistically ranked diagnoses, ex-pert annotations of case material and more.Unlike the rule-based and case-based systemsdescribed in the previous section, the purposeof ISABEL is not to automatically apply rulesor case-based scripts but to act as a reminderaide for the expert.

Much as a doctor may benefit from prompt-ing to consider alternative diagnoses, networkadministrators may also benefit from a sys-tem which prompts knowledge recall. Ourproblem domain requires a different approachfrom ISABEL given that there is no systemic,widely applicable body of diagnostic litera-ture which uniquely characterises root causes offault symptoms for web hosting environments.

3

Page 4: NetPal: A Dynamic Network Administration Knowledge Base

Thus we provide the administrator with searchand visualisation tools to help them interac-tively narrow their hypotheses.

We aim to enhance the abilities of admin-istrators to filter generated system events andto recover previous work experience. Whererule-based and case-based systems attemptedto automatically detect conditions and applyscripted solutions, we rely on arming adminis-trators with increased prior information so theymay better assess conditions and produce solu-tions. Where ISABEL links medical literatureto symptom descriptions, we index and presentprevious experience and domain-specific docu-ments to aid the administrator in gaining un-derstanding of the current case. We aim not toautomate all problem-solving but to improveknowledge transmission and recall for juniorand senior network administrators.

3 The Web Hosting

Environment

In our work, NetPal is developed and tested inthe context of the small to medium web host-ing environment. Servers deployed in a webhosting environment handle hosting responsi-bilities in a distributed manner. In additionto web content, there are many services offeredby the web hosting provider to support theircustomers, including email, domain name ser-vices and custom application support. The webhosting environment therefore produces multi-ple kinds of faults, emanating from differenthardware, software, customer and consumersources.

The hardware and software configuration se-lected by a web hosting provider will dependon customer needs. Cardellini et al [3] de-scribe several configuration scenarios for han-dling high-load web applications. While smallcustomers with relatively static website contentconsume a fraction of a server’s capacity, cus-tomers with high traffic and application pro-cessing costs require concurrent solutions con-suming multiple physical servers, both for pro-cessing and for redundancy.

Allocation of resources in a web hosting envi-ronment is therefore multiplied over reliabilityand responsibilities. To handle these respon-

sibilities, there are typically groups of serversdedicated to

• handling web requests

• application layer processing

• database and object retrieval

NetPal targets this kind of heterogeneous op-erating environment which contains off-the-shelf components, custom components and per-customer hardware and software configura-tions. By taking an information retrieval ap-proach, we hope to mitigate some of this het-erogeneity when searching through collecteddata. For our testbed environments, we choseto simulate a complex hosting environment byreplaying log data and case history collected atPalomino Inc. into databases on the NetPaltest servers. In Section 5, we further describehow our software testers interacted with thissystem and summarise their feedback.

4 System Architecture

The NetPal system is composed of a set of corecomponents which are intended to integratewith external fault detection systems. The corecomponents include the following.

• case knowledge base

• problem/solution matching engine

• presentation, feedback and editor modules

Their component interactions are summarisedin Figure 2.

4.1 Case and Problem/SolutionMatching

Underlying the case knowledge and prob-lem/solution matching engine is an informationretrieval (IR) subsystem. In an information re-trieval system, the prevailing metaphor is thatof query-answering, normally by returning aranked set of documents for the user’s queryunder a measure of relevancy [1, 5].

In our system, queries can be generated man-ually or automatically as the system adminis-trator interacts with NetPal and builds a case.

4

Page 5: NetPal: A Dynamic Network Administration Knowledge Base

Figure 2: NetPal core system components diagram

We then make recommendations for related ar-ticles, cases and events (logfile entries, at thistime). The administrator can also query therelevant components individually. We wouldtypically generate a list of keywords from casecomponents but an example of a manual sam-ple query might be a series of keywords gar-nered from observing log entries or a best initialhypothesis, such as

• “php open_basedir”

• “domain conf problem”

These kind of queries can be extended throughquery expansion: using a thesaurus with do-main knowledge to automatically produce a listof synonyms. We use query expansion to op-tionally broaden the scope of queries generatedin NetPal.

4.1.1 Information Retrieval

We adopt the vector space model to char-acterise our cases for retrieval purposes andto capture the similarity between vectors us-ing the cosine angle distance. This model,originally described in [14] and summarisedin [5], represents documents as vectors ina term frequency-inverse document frequency(TF-IDF) space. We describe this model hereas a basis for our system.

Let T name the vocabulary of unique termsextracted from the document corpus and ti bethe i’th term in T . Let D be the set of all docu-ments and dj be a particular document. Then,the normalised term frequency tfij is given by

tfij =nij

|dj |(1)

where the frequency of occurrence of term tiin document dj is denoted by nij . This measurecaptures the relative importance of a term in aparticular document while accounting for theeffect of varying document length.

The in-document term frequency (TF) is onecomponent of the representation. The other isthe inverse document frequency (IDF). Givena term ti, the inverse document frequency iscalculated as

idfi = log

(

|D|

|{dj : ti ∈ dj}|

)

(2)

This quantity characterises the support ofthe term ti in the document corpus; termswhich appear in few documents obtain a highIDF score while terms which occur widelythroughout the document set will score closerto zero.

For document dj and term ti, tfidfij is theproduct of these quantities. The term fre-quency captures the in-document importance

5

Page 6: NetPal: A Dynamic Network Administration Knowledge Base

for ti and the inverse document frequencycaptures the whole-corpus importance for ti.These values, multiplied, reward most thoseterms which are highly characteristic of a par-ticular document, yet not widely occurringwith respect to the entire document set. Fi-nally, a particular document dj is representedas a vector of such TF-IDF scores, with oneentry for each term in the vocabulary T , as inEquation 4.

tfidfij = tfij · idfi (3)

dj ={

tfidf(1,j), tf idf(2,j), · · · , tf idf(|T |,j)

}

(4)We facilitate retrieval of document vectors

by constructing an inverted index [1, 5] relat-ing terms to the documents in which they ap-pear. This enables a query document to bematched against terms in the database whichin turn generates a set of candidate matchingdocuments. That is, for each term in the querydocument, the corresponding set of documentsin which that term appears is fetched from thedatabase. However, these document sets mayoverlap and they are unordered. Therefore, weorder them based on the cosine angle similaritybetween the query vector, q, and the vector foreach document in the result set.

cossim(q, dj) =

i tfidfiq · tfidfij√

i tfidf2iq ·

i tfidf2ij

(5)The query vector can be generated manu-

ally or automatically from different sources; weadapt the information retrieval system to dif-ferent tasks through modifying the indexingand either requesting query input or generat-ing queries automatically.

4.1.2 Documents and Queries in NetPal

We described the vector space model for infor-mation retrieval. In NetPal, we use this modelfor retrieving indexed articles, cases and events.Each of these type of documents is representedas a vector of tfidf weight values in a corre-sponding index. We reduce the index size byfiltering unrepresentative terms, as is common

in information retrieval applications [1, 5]; inparticular, we filter terms by their inverse doc-ument frequency, reducing the term dimension-ality of our indices by an order of magnitude.(e.g., from tens of thousands of terms to a thou-sand terms.)

We produced a system taxonomy through acombination of automatic generation and ex-pert refinement of the vocabulary, and withmanual organisation of the taxonomy’s hierar-chical structure. We use the taxonomy to ex-pand queries by introducing related terms. Forproblem/solution case matching, we generatequeries automatically from terms in the case it-self and any attachments it may hold (includingtext fields, log events and attached articles orcases). When directly searching through cases,we expressly allow the user to input queries andoptionally expand these using the taxonomy.

4.1.3 Log Data Acquisition Engine

We developed a system for aggregating loggeddata, dubbed the ’Log Data Acquisition En-gine’ (or LDAE). The LDAE is a backgroundprocess which serves to collect system log datafrom multiple hosts, process the entries to allowbasic grouping of like events, and feed the datainto an SQL database for manual or automaticcorrelation with cases in NetPal.

Feeding log entries through an aggregator en-ables the collection of global log data statis-tics. We initially attempted using an as-sociation rule mining algorithm for identify-ing strong subpatterns in log entries, as inVaarandi [17, 15]. This would allow us to per-form grouping of log entries in the user inter-face by frequent shared tokens. However, theoffline cost of this algorithm was prohibitive forour system.

Therefore, we selected a static set of regularexpressions for matching common logfile pat-terns. This included dates, times, IP addresses,common services, local hostnames and more.By replacing highly-variant substrings in con-secutive entries with common tokens, we couldenable a limited version of the grouping effectwe aimed to achieve. We employ this in theuser interface for ’collapsing’ like entries intoclusters and reducing the number of log entrieson display.

6

Page 7: NetPal: A Dynamic Network Administration Knowledge Base

In sum, the role of the LDAE is to collect andpreprocess log data, from multiple services dis-tributed across hosting servers, and to aggre-gate this data in a central database. Once theselog entries are collected in the central database,they can be indexed for subsequent retrieval inNetPal.

4.2 User and WebInterface Design

The main NetPal interface consists of a web-based module which can be integrated in pop-ular web hosting control panel software as aplugin or can be run separately as a standaloneweb application. Figure 3 displays an overviewof the NetPal user interface. A number of com-ponents are visible and we address these here.

The main page is divided into two panes. Wefocus on the left pane which contains the eventbrowser and event graph. The event browser ispictured in greater detail in Figure 4 (a). Thebrowser includes several features to aid in sift-ing through logged data. Like cases, events aresearchable and the list of relevant events canbe narrowed by entering search terms in thesearch field at the top of the event browser.Each event is initially depicted using a one-linesummary, here containing the date/time of oc-currence, the service responsible for generatingthe event and the message associated with thisevent. Notice here that the raw event can berecalled by selecting the desired event in thebrowsing list; the event then is displayed in itsentirety below the list.

Underneath the events browser is the eventsgraph, depicted in Figure 4 (b). The eventsgraph is meant to impart the level of event ac-tivity over time. The height of the peaks onthe chart corresponds to the number of eventsoccurring during that time interval. The timeinterval of the graph can be modified for greaterdetail. The graph is color coded, as in the leg-end, to indicate which services generate whichpeaks. Note that clicking anywhere within thegraph repositions the event browser’s listing toshow events surrounding the chosen time.

The left-hand pane also has multiple tabs atthe top. The default pane contains the eventbrowser and event graph. Alternate panes in-clude search interfaces for cases and articles,

a visual representation of the domain taxon-omy (a text-based tree component) and the “vi-tal signs” component. The vitals componentindicates the proportion of errors recorded ineach logfile on the main host and commonlyrequested system statistics. The vitals compo-nent is depicted in Figure 5.

When a new case is opened or a case isretrieved from the case base, the case edi-tor/viewer, as in Figure 6, is populated withdata in each relevant field either by thedatabase or by the user. The user can populatea case by interacting with the event search inthe left-hand pane, including dragging relevantevents from the left pane to the right or addingthem at the click of a button. As fields arefilled, events added and cases linked to the cur-rent case, a heterogeneous data structure repre-senting the case is constructed internally. Thisstructure is stored as the user modifies it andis indexed for matching and retrieval.

5 Software Testing

As our prototype neared completion, we soughtsoftware testing feedback from candidate usersabout the usability and suitability of the soft-ware for two real-world system administrationtasks reconstructed from experience by staff atPalomino Inc. At Dalhousie University, we ob-tained feedback on these scenarios from fivetesters, of which three were permanent IT staffmembers and two were co-operative educationstudents working as IT helpdesk assistants. AtPalomino Inc., feedback was obtained from twoprofessional system administrators outside thecompany who volunteered their time to test thesystem.

5.1 Testing Configuration

At each site (Dalhousie University andPalomino Inc.), we set up a server runningthe NetPal system. We populated the NetPaldatabase with sample cases, knowledge basearticles and log data, previously recorded byPalomino Inc. during instances of problem-solving. We provided each tester with a soft-ware manual explaining the components of theNetPal system. The testers received scenariooutlines describing observed faults which the

7

Page 8: NetPal: A Dynamic Network Administration Knowledge Base

Figure 3: NetPal prototype user interface

testers would attempt to ’diagnose’, given thehypothetical circumstances.

Testers were required to consider the faultsymptoms, sift the log data using the web in-terface, build a case using NetPal, accuratelyidentify the root cause of the observed fault andpropose a reasonable solution based on experi-ence (articles and cases) where possible. Theywere not required to fully implement the pro-posed solution. We asked the testers to de-scribe their experiences with each scenario ona per-component basis and to provide ratingsfor the usefulness of each component for eachscenario.

5.2 Tester Feedback

The components which we asked our testers torate and on which we asked them to commentincluded the event list and graph (Figure 4, (a)and (b), respectively), the vital signs compo-nent (Figure 5), the case viewer/editor (Figure6), the taxonomy tree and the knowledge basesearch.

The event list component (for searching andbrowsing of aggregated log data) was most

highly rated by the administrators and IT staff.The two help desk assistants gave the eventlisting the lowest rating. The most requestedfeature for the event listing was that NetPalshould expose additional filtering and searchfeatures, particularly filtering on severity. Themost common complaints were that there wasdisplay lag when scrolling or paging in the logevent listing window and that for usability theevent list should be more easily resized.

The event graph received mixed ratings. Al-though the majority of our testers agreed thatthe graph helped to situate the event listing intime and to give a sense of the period of thelog data, very few indicated that they used thezooming or panning feature to support theirproblem solving. The reasons offered for thisincluded the lack of resolution in the graph dis-play and the latency of the graph refresh rate.

The vital signs component was highly ratedby all testers. From the perspective ofthe administrative group, collecting frequentlyqueried system data into a common displaywas compelling. From the perspective of thehelpdesk assistants, the at-a-glance sorted listof error rate per log file was appealing. One

8

Page 9: NetPal: A Dynamic Network Administration Knowledge Base

(a)

(b)

Figure 4: (a) The NetPal events browser and (b) the events graph

9

Page 10: NetPal: A Dynamic Network Administration Knowledge Base

Figure 5: NetPal ’vital signs’ component.

10

Page 11: NetPal: A Dynamic Network Administration Knowledge Base

Figure 6: NetPal case viewer component.

11

Page 12: NetPal: A Dynamic Network Administration Knowledge Base

administrator requested that this componentprovide an expanded summary of local and net-work features.

Most testers suggested a middling rating forthe case viewer/editor. An experienced admin-istrator requested further automation at thispoint, suggesting that there be some method tojumpstart the population of a new case’s fields.Another administrator requested minor inter-face enhancements (e.g. larger font, resizing).The rest of the testers found the viewer/editorserviceable but that the testing scenarios pro-vided little opportunity to exercise this featureto the fullest.

We provided a display of the system tax-onomy on a separate tab. This was imple-mented using a simple tree widget displayingthe terms in the taxonomy along with theirhierarchical relationships. Most of our testerscouldn’t relate the taxonomy to their problem-solving task. Two testers said that the taxon-omy helped them to find topical synonyms intheir search of the knowledge base.

Finally, we provided an interface for query-ing our database of collected knowledge articlesand cases. Here the testers gave mixed ratings;some were able to locate relevant articles andcases and some were not. The direct search in-terfaces appealed to some users for the power torefine one’s query directly and manually. Mostrequested that the search results should morestrongly resemble popular search engines.

5.3 Discussion

The events list component and the vital signscomponent resonated with the administratorsand IT staff. These components capture com-monly made system queries. Encoding queriesdirectly in the interface succeeds where thesecapture frequently used previous working expe-rience and make it immediately accessible andreplicable. So we can think of one kind of ex-perience as accumulated patterns of system in-teraction which emanate from the administra-tor. If we can strongly capture these in theinterface, we should enable the administratorto dispense with rote behaviours in their dailywork. Several administrators asked for addi-tional filtering criteria (e.g. by severity); thissuggests either careful statistics or standard-

ised semantics for log files would be valuable.One tester suggested including an embeddedterminal, opening an opportunity for automat-ically capturing command sequences for futurereference.

Although the event list appealed most tosenior staff, it held little interest for juniortesters. It’s unclear exactly how much the stu-dents were put off by network latency in theupdating of the list of events. Eliminating net-work delays from the prototype would allow usto answer this question. It may be that a com-bination of inexperience and frustration pre-vented them from fully accessing the implicitexperience represented by the events list com-ponent.

All our testers expressed some approval ofthe temporal and visual summary communi-cated by the events graph; yet, none of themfound navigation via the graph to be useful intheir testing scenarios. One way to improvethe navigation might be to use a visualisationwhich directly expresses a discrete metric of thelog data. The graph, as it stands in Figure 4(b), uses a continuous function to represent fre-quency over time. On the other hand, events ina log file are discrete occurrences. Our testersmight be better served with a discrete visuali-sation which can be drilled down, rather thana continuous one. (This might permit, in theextreme, the events list and the graph to bemerged for additional screen real estate.)

For our test scenarios, in addition to the setof cases we crafted manually, we treated our ar-ticle database as equivalent to cases for testingpurposes. But, while the articles with whichwe bootstrapped the database are in the sameproblem domain (that is, web hosting), theyare only an approximation of experience cases.One administrator commented that many ofthe articles in the knowledge base were re-lated to initialisation or permanent configura-tion problems whereas the testing scenarios re-lated more to transient problems. Additionally,the number of articles we used to bootstrap theinitial database was small, on the order of a fewthousand; this may have limited the applicabil-ity of the case base for our test scenarios.

12

Page 13: NetPal: A Dynamic Network Administration Knowledge Base

6 Conclusion

We have described the components, devel-opment and testing of a prototype networkadministration information system, NetPal,which is a synthesis of information retrieval,human-computer interaction, event analysisand experience retrieval. We outlined the sig-nificant user interface components, describedthe inner workings of the retrieval system andour model for acquiring and processing log datafor consumption in the main system. We ob-tained feedback from testers which will aid usin selecting future directions for the project.

Our prototype faces hurdles yet we have es-tablished a footing on which to base future de-velopment and research. Future developmenton NetPal will focus on automatic experiencecollection for the case database, improving thehuman factors associated with a dynamic webapplication, applying more sophisticated datamining techniques to our acquired event data,particularly to examine the apriori-like algo-rithm of Vaarandi as a clustering model and toenable enhanced visualisations, and fine-tuningour search engine for optimum retrieval.

Acknowledgements

We acknowledge the financial support ofPrecarn Inc. and the Natural Sciences andEngineering Research Council of Canada.

Ashley George received his BCS and MCS in 2003and 2006 respectively from Dalhousie University andis now pursuing his Ph.D. His research interests focuson machine learning, document mining, network andsystem security and philosophy of science.

Adetokunbo Makanju obtained his B.Sc. fromthe Department of Computer Science at University ofLagos, Nigeria and his M.CS. in 2008 from DalhousieUniversity. He is presently studying for his Ph.D.with the Faculty of Computer Science at DalhousieUniversity. His research interests are in the areasof wireless networks, intrusion detection, geneticprogramming and case-based reasoning.

Evangelos Milios is Professor and Killam Chairof Computer Science with the Faculty of ComputerScience at Dalhousie University. He received a diplomain Electrical Engineering from the National TechnicalUniversity of Athens, Greece, in 1980 and M.Sc.and Ph.D. degrees in Electrical Engineering andComputer Science from the Massachusetts Institute

of Technology, Cambridge, Massachusetts in 1986.At Dalhousie, he served as Director of the GraduateProgram (1999-2002) and he is currently AssociateDean, Research. His current research activity iscentered on modelling and mining of content and linkstructure of Networked Information Spaces.

Nur Zincir-Heywood is Associate Professorin the Faculty of Computer Science at DalhousieUniversity, NS, Canada. She obtained her B.Sc.,M.Sc. and Ph.D. in Computer Engineering from EgeUniversity, Turkey in 1991, 1993 and 1998 respectively.Her research interests include network services andmanagement, network information retrieval, and theeffects of the Internet and information technologies onsocio -economic development.

Markus Latzel is President & CEO of PalominoSystem Innovations Inc, a successful university spin-offin its seventh successful year. He received his M.Eng.in Computer Science and Electrical Engineering fromUniversity of Paderborn, Germany in 1999. Hisresearch and publications are in areas of intelligentsystems, robotic systems, knowledge management,hierarchical databases, HCI and computer vision.

Sotirios Stergiopoulos recieved his B.Sc. in Com-puter Science from York University, Toronto, ON,Canada. He was previously with MacDonald Dettwilerand Array Systems Computing. He is currently LeadApplication Developer for Palomino System Innova-tions, Inc., Toronto.

References

[1] Ricardo A. Baeza-Yates and BerthierRibeiro-Neto. Modern Information Re-trieval. Addison-Wesley Longman Pub-lishing Co., Inc., Boston, MA, USA, 1999.

[2] Ralph Bergmann. Experience Manage-ment: Foundations, Development Method-ology, and Internet-Based Applications.Springer-Verlag New York, Inc., Secaucus,NJ, USA, 2002.

[3] Valeria Cardellini, Emiliano Casalicchio,Michele Colajanni, and Philip S. Yu. Thestate of the art in locally distributed web-server systems. ACM Comput. Surv.,34(2):263–311, 2002.

[4] R.N. Cronk, P.H. Callahan, and L. Bern-stein. Rule-based expert systems for net-work management and operations: an in-troduction. Network, IEEE, 2(5):7–21,Sep 1988.

13

Page 14: NetPal: A Dynamic Network Administration Knowledge Base

[5] David A. Grossman and Ophir Frieder.Information Retrieval: Algorithms andHeuristics, Second Edition. Springer, Dor-drecht, The Netherlands, 2004.

[6] H. Inamura, O. Takahashi, T. Ishikawa,H. Shigeno, and K. Okada. Automat-ing detection of faults in tcp implemen-tations. Advanced Information Network-ing and Applications, 2004. AINA 2004.18th International Conference on, 1:315–320 Vol.1, 2004.

[7] L. Lewis. A case-based reasoning approachto the management of faults in commu-nication networks. INFOCOM ’93. Pro-ceedings.Twelfth Annual Joint Conferenceof the IEEE Computer and Communica-tions Societies. Networking: Foundationfor the Future. IEEE, pages 1422–1429vol.3, 1993.

[8] Mirjam Minor. Erfahrungsmanagementmit fallbasierten Assistenzsystemen (Ex-perience Management with Case-based as-sistant systems). PhD thesis, 2006.

[9] Mirjam Minor and Christina Biermann.Case acquisition and semantic cross-linking for case-based experience manage-ment systems. In Du Zhang, Taghi M.Khoshgoftaar, and Mei-Ling Shyu, edi-tors, IRI, pages 433–438. IEEE Systems,Man, and Cybernetics Society, 2005.

[10] G. Penido, J.M. Nogueira, andC. Machado. An automatic fault diagnosisand correction system for telecommuni-cations management. Integrated NetworkManagement, 1999. Distributed Man-agement for the Networked Millennium.Proceedings of the Sixth IFIP/IEEE Inter-national Symposium on, pages 777–791,1999.

[11] P. Ramnarayan, N. Cronje, R. Brown,R. Negus, B. Coode, P. Moss, T. Hassan,W. Hamer, and J. Britto. Validation of adiagnostic reminder system in emergencymedicine: a multi-centre study. Emer-gency Medicine Journal, pages 619–624,2007.

[12] P. Ramnarayan, G. Kulkarni, A. Tomlin-son, and J. Britto. Isabel: a novel internet-delivered clinical decision support system.Current Perspectives in Healthcare Com-puting, pages 245–246, 2004.

[13] P. Ramnarayan, A. Winrow, M. Coren,V. Nanduri, R. Buchdahl, B. Jacobs,H. Fisher, P.M. Taylor, J.C. Wyatt, andJ. Britto. Diagnostic omission errorsin acute paediatric practice: impact ofa reminder system on decision-making.Medical Informatics and Decision Making,2006.

[14] G. Salton, A. Wong, and C. S. Yang.A vector space model for automatic in-dexing. Commun. ACM, 18(11):613–620,1975.

[15] J. Stearley. Towards informatic analysis ofsyslogs. Cluster Computing, 2004 IEEEInternational Conference on, pages 309–318, 20-23 Sept. 2004.

[16] Neal J. Thomas, Padmanabhan Ram-narayan, Michael J. Bell, Prabhat Ma-heshwari, Shaun Wilson, Emily B. Nazar-ian, Lorri M. Phipps, David C. Stockwell,Michael Engel, Frank A. Maffei, Harish G.Vyas, and Joseph Britto. An internationalassessment of a web-based diagnostic toolin critically ill children. Technology andHealth Care, pages 103–110, 2008.

[17] R. Vaarandi. A data clustering algo-rithm for mining patterns from event logs.IP Operations and Management, 2003.(IPOM 2003). 3rd IEEE Workshop on,pages 119–126, 1-3 Oct. 2003.

14