Whole-Product Testing Guidelines...Bare-Metal vs. Virtualized Environment Virtualization is one particularly important consideration in the platform choice. Some products may behave

Whole-ProductTestingGuidelines

Copyright©2016Anti-MalwareTestingStandardsOrganization,Inc.Allrightsreserved.Nopartofthisdocumentmaybereproducedinanyform,inanelectronicretrievalsystemorotherwise,withouttheprior

writtenconsentofthepublisher.

2

NoticeandDisclaimerofLiabilityConcerningtheUseofAMTSODocuments

ThisdocumentispublishedwiththeunderstandingthatAMTSOmembersaresupplyingthisinformationforgeneraleducationalpurposesonly.Noprofessionalengineeringoranyotherprofessionalservicesoradvice isbeingofferedhereby. Therefore,youmustuseyourownskillandjudgmentwhenreviewingthisdocumentandnotsolelyrelyontheinformationprovidedherein.

AMTSObelievesthattheinformationinthisdocumentisaccurateasofthedateofpublicationalthoughithasnotverifieditsaccuracyordeterminedifthereareanyerrors.Further,suchinformationissubjecttochangewithoutnoticeandAMTSOisundernoobligationtoprovideanyupdatesorcorrections.

Youunderstandandagreethat thisdocument isprovidedtoyouexclusivelyonanas-isbasiswithoutanyrepresentationsorwarrantiesofanykindwhetherexpress, impliedorstatutory. Without limitingthe foregoing, AMTSO expressly disclaims all warranties of merchantability, non-infringement,continuousoperation,completeness,quality,accuracyandfitnessforaparticularpurpose.

InnoeventshallAMTSObeliableforanydamagesorlossesofanykind(including,withoutlimitation,any lost profits, lost data or business interruption) arising directly or indirectly out of any use of thisdocument including, without limitation, any direct, indirect, special, incidental, consequential,exemplary and punitive damages regardless of whether any person or entity was advised of thepossibilityofsuchdamages.

Thisdocument isprotectedbyAMTSO’s intellectualpropertyrightsandmaybeadditionallyprotectedbytheintellectualpropertyrightsofothers.



3

Whole-ProductTestingGuidelines(Protection)

Introduction

Thisdocumentdiscussestheissuesinvolvedinwholeproducttesting.Thedocumentoutlinesadditionalissuesinvolvedinbestpracticetestingofsuchproducts,aboveandbeyondotherAMTSOguidelinesandbestpractices.Thisdocumentisnotacomprehensivelistingofallsuchissues.

Unless otherwise defined herein, all terms included in this document are used with their commonmeaning.ThefollowingdocumentshouldbereadinconjunctionwithAMTSO’sFundamentalPrinciplesofTestingandotherinformationavailableatwww.amtso.org.

AMTSOdocumentsarebest read inconjunctionwith theotherdocumentson theAMTSOdocumentspage, including Fundamental Principles of Testing, Best Practices for Testing In-the-Cloud SecurityProductsandBestPracticesforDynamicTesting.

Testing of detection rates alone has been a primary focus of many anti-malware tests. This is notunreasonable: after all, threat detection and/or blocking is a core function of most content securityproductswithin AMTSO’s sphere of interest. However, given the large number of new and dynamicthreats facing end users, security companies have developed additional detection and protectioncapabilities to supplement traditional on-demanddetectionmethods. Toprovidea realisticpictureoftoday’s security products, testersmust expand their testing approach to include all relevant productcapabilitiesandsubjecttheseproductstoatestsetthatcorrelateswiththerealthreatsfacingusers.

Thispaperwillfocusonabalancedlookatproducteffectivenessincludingdetection(andfalsepositivetestingagainstnon-maliciousprograms)inrealistictestscenariosthatsimulatetheexperiencesofrealusers.

Thispaperwillnot focusonusabilityandperformance (asdescribed inAMTSO’sPerformanceTestingGuidelines)althoughsuchtestscouldrelyonthesamebestpractices.

Anti-malwareproductsaresophisticatedapplications,andtestingasecurityproductinawaythatfullyand accurately evaluates its capabilities is a challenging process, requiring testing methodology thatallows for the assessment of multiple technologies and modules which are integrated into a singlepackagesoas toprotectauseragainstmalware.Endpointsecurityproductsaresimplydifferent thantheyusedtobe–theyhaveblendedcountermeasures,whichworktogethertoprotecttheuser.

Forwholeproducttesting,givenitsadditionalcomplexity, it’sespecially importantforthetestertobeclear onwhat kindsof sampleswere tested, howproductswere setup and interactedwith, andhoweffectivenesswasmeasured.



4

WhatConstitutesWhole-ProductTesting

Testersmightconsidertwomethodstoevaluatethewholeproduct:

• Whole-ProductTesting

• Sum-of-the-PartsTesting

AMTSOrecommendsWhole-ProductTesting (rather thanSum-of-the-PartsTesting)asbeing themostrepresentativeofrealworldeffectiveness.

Whole-ProductTestingbestmimicsaproduct’suseintherealworld.Testersarestronglyrecommendedtousethemostcommon(e.g.default)settingsandthensubjecttheproducttoastatisticallyrelevantnumberofreal-worldthreats.Theproductisevaluatedonhowwelltheuserisprotected.Theproductistreatedasawholeratherthanasumofitsparts.Theprimaryfocusisonwhethertheuserisprotectedagainst the threat, but it’s also instructive for the tester to ascertain which specific capability orcapabilitieswereusedtoprotecttheuser.Theultimateeffectivenessisjudgedbylookingatthethreat’simpactonthesystemwithandwithouttheproduct.

Sum-of-the-Parts Testing tries isolating a specific product capability e.g. on-demand scanning or URLfiltering, a tester needn’t guess on how a product blocked a threat. By combining the results frommultiple test parts, a tester can form some conclusions about the product’s full capabilities but theywouldbesubjectiveduetotheassumptionsmadebythetester.Thedrawbackofthisapproachisthatproduct capabilities oftenwork together to stop a given threat and this interaction cannot be shownthrough sum-of-the-parts testing. Another disadvantage is that even if one of the product parts wassuccessfulinstoppingpartofthethreatasmeasuredbythetester,otherpartsofthethreatmayhavebypassedtheprotectionand infectedthecomputer (thiscanbeaddressedbymonitoringorpassivelyassessingtheimpactonthesystem).

FacetsofWhole-ProductTesting

Testersshouldconsiderthefollowingwhenconductingafullproducttest:

• StatingtheTestPurpose

• SelectingSamples

• SettingupTestsandProducts

• IntroducingSamples

• HandlingUserInteraction

• CapturingTestResults

• InterpretingTestResults



5

StatingtheTestPurpose

Duetothemultitudeoftest’sobjectivesitishighlyrecommendedthatthepurposeofthetestisstated(settleddown)before thenextstagesare taken inplanning the test.Asanexample,aproduct test isdoneinordertocheckproductsabilitiestoprotectagainst“allexisting”,“newest”or“determined”typeof threats. In practice the purpose of the tests is such a matter of course. The results are mostlyinterpreted as being theultimate judgment for thequality of product against all existing andnot yetexistingthreats.

SettingUpEnvironmentsandProducts

Theunderlyingtestplatforms,componentsandfeatureschosen,andsettingsusedwilldependonthetestobjective,butinallcases,thetestershouldstriveforalevelplayingfield.Thespecifictestplatform,components, features used and settings chosen should be published in the final report or appendixmaterial.

First,thetestermustchoosethetestenvironment(hardware,operatingsystem,browserversion,patchlevel,applications,etc.)tobeused.Thesechoiceswillmostcertainlyimpacttheworkingsoftheproductandultimatelytheresults.

Then,oncetheselectionofcomponentsismade,choicesforproductconfigurationinclude:

• Default (or the most common) configuration (perhaps appropriate if testing real consumerexperience) – note that some products have different default settings based on underlyinghardware.

• Testerexercisestheir judgment inchoosingenabledfeaturesandsettingsused(e.g.enableallcountermeasures configured for maximum protection, use a vendor-recommendedconfiguration, or choose a different configuration that might offer less protection but mayprovide a more convenient user experience). This approach may be more applicable formanagedenvironments(likecorporateproducts).

• Testers may consult the vendors to determine the appropriate configuration for a specificWhole-Producttest.

The choice of features and settings may also have a significant impact on false positives andperformanceresults.Thetestershouldusethesamefeaturesandsettingsfordetection,falsepositive,and performance tests to provide readers a realistic view of what they can expect. In any case themethodologyapplied(chosenuseractionchain,enabledproductfeatures)requiresproperanddetaileddocumentation.Thiswouldavoidmisunderstandingsandcomplaintsregardingtheobjectivityofatest.

Bare-Metalvs.VirtualizedEnvironment

Virtualization is one particularly important consideration in the platform choice. Some productsmaybehavedifferently ina virtualizedenvironment.Also,wholeproduct testing includes theexecutionofthemalware,andsomemalwaremaybehavedifferentlyinavirtualizedenvironment.



6

If the goal is to test effectiveness in a bare-metal environment, testers are encouraged to considertestingonbare-metaltestmachines.However,virtualenvironmentsareincreasinglyimportantalsoandsotestersmustmakeathoughtfulchoice.

Forexample:

• Itmayterminate

• Itmayremaininmemorybutinactive

• It may continue to execute, but exhibit only behavior that won’t trip detection by behavioranalysis(whetherstaticordynamic)

• Itmaytrytoexploitanyvulnerabilities inthevirtualizedenvironment, for instancetoperformsomemaliciousactiononthehostmachine,notjustinthevirtualenvironment

It’s important to understand both the differences introduced by the virtual environment and theindividualbehaviorofthechosensamplesinavirtualenvironment.

SelectingSamples

A tester should evaluateproducts using valid, realistic and relevant threats from the field as this is agood indication of whether a product’s capabilities are indeed relevant to and effective against thethreatsfacingrealusers.Withwhole-producttesting,atestermustconsiderdiversitynotonlyintypesofattacksandfamiliesofmalwarebutalsothesourceoftheattacksasallofthesefactorscanimpactaproduct’sabilitytostopathreat.

Asalways,abalancebetweenmaliciousandcleantestsamples/scenariosmustbemaintained.

IntroducingSamples

Thesamplesshouldbeintroducedviatheirnaturalpropagationvectors(e.g.,threatscomingviaEmailshouldbeintroducedasEmails).Twoapproachescouldbeused:

• Createatestingenvironmentbasedon“replay”(thathelpsreproducibilityandrepeatabilityofthetests).LiveInternetconnectionoftheproductsshouldnotbealterednorinterferedwith.

• Use liveattacks (this isusuallynot repeatablebut still canbeverifiableprovidedenough logsandrecordsarekept).

In any case care should be taken to ensure fairness with respect to the timing of tests of individualproducts.

EntryVectors

Thelistbelowprovidesanumberofcommonentryvectors:

• The user voluntarily, possibly through a social engineering attack, downloads and potentiallyexecutesafilefromtheWeb



7

• Theuserinvoluntarilydownloadsathreat(drive-bydownload)fromtheWeb

• Theuseropensareale-mail,clicksonalink,anddownloadsandpotentiallyexecutesafile

• Theuseropensane-mailandopensandpotentiallyexecutesanattachment

• The user receives an instant message and either clicks a link, opens an attachment, andpotentiallyexecutesafile

• TheuseraccessesafilefromaUSBstickorotherremovablemediaeithervoluntarilyorviaauto-run

• Theuseraccessesvoluntarilyafilefromanetworkshare

• Theuser’scomputerisinfectedwithafilefromthenetworkinvoluntarily(worms)

• The user receives and potentially executes a file from a P2P network, newsgroup, or otherapplication.

Theselectionanddistributionofvectorsshouldreflectthestatedpurposeofthetest.

HandlingUserInteraction

In handling user interaction, testers should follow the guidelines laid out in the “Popups and UserInteractions”sectioninAMTSO’sBestPracticesforDynamicTesting.

Thefollowinguserinteractionmodelsarepossibletodescribehowauserwillinteractwhenaproductpresentsadialogboxorother interfacecomponents. It’s important tonote that theuser interactionswill depend on choices made during product setup or by policies deployed to the product prior totesting:

• NoAction–theusertakesnoaction

• Default–simulatebypressingEnter

• Safestaction–blockeverything(couldleadtomorefalsepositives)

• Worstaction–alloweverything

• Randomaction– if this is chosen, testermust includeastatistically significantnumberof testcases

• Predefinedaction(e.g.alwaysusethetopbuttonortheleftbutton)–ifthisischosenthetestermustdocumentthereasoningbehindtheirchoices

• Recommendedaction(maynotbetheDefaultone)–ifthisischosenthetestermustdocumentthereasoningbehindtheirchoices

• “Survey-based” action – if this is chosen the testermust document the reasoning behind thechoices.

Notethatproductmaypresentbothmodalandnon-modalwindowsandbothmustbeaddressed.



8

Testingalltheoptionsistheidealcasebutpracticalconsiderationswilltypicallynecessitatechoosingauserinteractionmodel.

Thechoiceofuser interactionwilldependonthefocusofthetestingandwhothetesterenvisionsasthetargetuser.Forenterpriseenvironments,thetestermayleantowardsmorerestrictiveapproachesreflectingthechoicesanITadministratormightmake,andtheend-userexperiencemightbemoresilentgiven choices already made by an administrator. Consumer-focused whole product testing mightsuggest amore permissive user interactionmodelwithmore dialogs butmay not. Although randomactions may be realistic for complex human beings, it is not suggested because it may not allowreproducibleandcredibleresults.

Nomatterwhichuserinteractionmodelischosen,itmustbeappliedconsistentlytoallproductstested.

For a given test, testersmight also wish to count the number of times the user is asked tomake adecision.Thisdatacanthenbeusedalongwitheffectivenessresultstodrawconclusionsonthefulluserexperience.It’sgenerallyagreedthateffectivenessbeingequal,feweruserinteractionsispreferred(nointeractionsbeingthebest).

CapturingTestResults

Firstandforemost,testersshouldhavetheirownverificationsystemratherthanrelyingonlyonvendorproductlogs(itisrecommendedthatproductsgeneratereliableandusablelogs).Testersshouldkeepinmind that product logs/reports can be incomplete or incorrect. The tester can use monitoring tools(includingnetwork)orpassivetechniquesthatcompareasystem’sstatebeforeandafterexposuretothemalware.Logscanbeusedforanecdotalinformationperhapstodeterminehowaproductblockedathreat.

In practical terms, it’s easiest to look for changes in the system rather than looking forwhether themalwareisstillactive,whichcanbemoredifficulttodetermine.

Note:Sometimesitmaybenecessarytoreboottoensureexecution.

InterpretingTestResults

Successcanbedefined indifferentways.Successmaybedefinedas thesecurityproduct renderingamalwaresampleunabletoexecute.Orsuccessmightrequirethatthesystemisbroughtbacktoitspre-infectionstate.Testersmustdefinesuccessclearlyforatestandapplythesamecriteriatoallproducts.

Interpreting test results related to detection and/or blocking can be complex due to the difficulty indetermining whether the malware was still able to cause harm to the system or the user after thecountermeasureshavebeenapplied.

In most cases, testers can treat products as “black boxes” in that it doesn’t matter what productcapabilitywassuccessfulinblockingathreat,onlythatthemaliciousactionofthethreatwasblocked.Toprovideadditionalgranularitytothetestresults,testerscanattempttogaindeeperinsightintohowaproductblockeda threat.Thiscanbedonebyaccessingproduct logsor, in somecases,bybuildingadditionaldetectionsystems.



9

Counting the number of detections and misses just once on a local system may not reflect all theproduct’s protection capabilities. For example, cloud-based security products use external knowledgeandmaywellbecapableofblockingallbuttheveryfirstfewsightingsofanyspecificthreat.ForAMTSOguidelines on testing cloud-based products please refer to AMTSO’sBest Practices for Testing In-the-CloudSecurityProducts.Anotherexamplecouldbeaproductthatwouldgaintheprotectionknowledgeinternally, for example by applying artificial intelligence methods. Similarly, a product may losedetectionsovertime.Forthesereasonsitisstronglyrecommendedtotrackthedetectionoveraperiodoftime,sufficienttocovertheentiretimespanoftheattack.Thetestersshouldalsobeawareofthefactthattestingcloud-basedproductsagainstlowprevalencethreatsmaytriggerasecurityresponse.

In all cases, testers should document test methodology, product settings, and other informationnecessary to interpret the results. Focus on the principles and guidelines as published by AMTSO atwww.amtso.org/documents.

UserImpact

Finally,testersmayattempttoassessthedamagecausedbythesampletothesystem(e.g.,akeyloggersent private data back to a server). In someof these cases the effects cannot beundoneeven if themalwareiseventuallycleanedup.Testersmayalsogiveconsiderationtowhathappenswithuserdata(likebankaccountdata,emailorotherpersonaldata).

______________________________________________________________________________

ThisdocumentwasadoptedbyAMTSOonMay25,2010