161
Pivotal Greenplum ® Text Version 3.1.0 User Guide Rev: 01 © 2018 Pivotal Software, Inc.

Pivotal Greenplum Textgptext.docs.pivotal.io/archives/GPText-docs-310.pdf · GPText nodes can be installed on the Greenplum Database cluster hosts alongside the Greenplum segments

  • Upload
    others

  • View
    18

  • Download
    0

Embed Size (px)

Citation preview

  • PivotalGreenplum®Text

    Version3.1.0

    UserGuide

    Rev:01

    ©2018PivotalSoftware,Inc.

  • 2349141620273032344458677884135157158160

    TableofContents

    TableofContentsPivotal®Greenplum®Text3.1.0DocumentationPivotal®GPText3.1.0ReleaseNotesInstallingGPTextUpgradingGPTextIntroductiontoPivotalGPTextAdministeringGPTextGPTextHighAvailabilityGPTextBestPracticesTroubleshootingHadoopConnectionProblemsWorkingWithGPTextIndexesQueryingGPTextIndexesCustomizingGPTextIndexesWorkingWithGPTextExternalIndexesUsingNamedEntityRecognitionwithGPTextGPTextFunctionReferenceGPTextManagementUtilitiesGPTextandSolrDataTypeMappingsGPTextSchemaTablesGPTextConfigurationParameters

    ©CopyrightPivotalSoftware,Inc,2013-2018 2 3.1.0

  • Pivotal®Greenplum®Text3.1.0Documentation

    GPTextDocumentationPDF

    PivotalGPText3.1.0ReleaseNotes

    InstallingPivotalGPText

    UpgradingPivotalGPText

    UsingPivotalGPText

    GPTextReferences

    AdditionalResourcesPivotalGreenplumDatabase

    ApacheSolrWebSite

    ApacheMADlib

    ©CopyrightPivotalSoftware,Inc,2013-2018 3 3.1.0

    http://docs-gptext-staging.cfapps.io/archives/GPText-docs-310.pdfhttp://docs-gptext-staging.cfapps.io/310/topics/http://docs-gptext-staging.cfapps.io/310/topics/FuncRef_preface.htmlhttp://gpdb.docs.pivotal.iohttp://lucene.apache.org/solr/http://madlib.apache.org/

  • Pivotal®GPText3.1.0ReleaseNotesThisdocumentcontainsreleaseinformationforPivotalGPText3.1.0

    Released:September2018

    AboutPivotalGPTextPivotalGPTextjoinstheGreenplumDatabasemassivelyparallel-processingdatabaseserverwithApacheSolrCloudenterprisesearchandtheApacheMADlibAnalyticsLibrarytoprovidelarge-scaleanalyticsprocessingandbusinessdecisionsupport.GPTextincludesfreetextsearchaswellassupportfortextanalysis.

    GPTextincludesthefollowingfeatures:

    TheGPTextdatabaseschemaprovidesin-databaseaccesstoApacheSolrindexingandsearching

    BuildindexeswithdatabasedataorexternaldocumentsandsearchwiththeGPTextAPI

    Customtokenizersforinternationaltextandsocialmediatext

    AUniversalQueryProcessorthatacceptsquerieswithmixedsyntaxfromsupportedSolrqueryprocessors

    Facetedsearchresults

    Termhighlightinginresults

    Naturallanguageprocessing,includingpart-of-speechtaggingandnamedentityextraction

    Greateremphasisonhighavailability

    TheGPTextmanagementutilitysuiteincludescommand-lineutilitiestoperformthefollowingtasks:

    Start,stop,andmonitorZooKeeperandGPTextnodes

    ConfigureGPTextnodesandindexes

    Addanddeletereplicasforindexshards

    BackupandrestoreGPTextindexes

    RecoveraGPTextnode

    ExpandtheGPTextclusterbyaddingGPTextnodes

    PrerequisitesInstallingGPTextalsoinstallsApacheSolrCloudand,optionally,ApacheZooKeeper.

    FollowingareGPTextinstallationprerequisites.

    GPTextrunsonRedHatEnterpriseLinux5.x,6.x,and7.x.

    InstallandconfigureyourGreenplumDatabasesystem,version4.3.6orhigher.SeethePivotalGreenplumDatabaseInstallationGuideathttps://gpdb.docs.pivotal.io .

    InstallJavaJRE1.8.xandaddthe bin directorytothe PATH onallhostsinthecluster.GPTextistestedwithOracleJava1.8andOpenJDK1.8.

    Ensurethat nc (netcat)isinstalledonallGreenplumclusterhosts( sudo yum install nc ).

    Installing lsof onallclusterhostsisrecommended( sudo yum install lsof ).

    GPTextcannotbeinstalledontoasharedNFSmount.

    GPTextnodescanbeinstalledontheGreenplumDatabaseclusterhostsalongsidetheGreenplumsegmentsoronadditional,non-databasehostsaccessibleontheGreenplumclusternetwork.AllhostsparticipatingintheGPTextsystemmusthavethesameoperatingsystemandconfigurationandhavepasswordless-sshaccessforthegpadminuser.SeethePivotalGreenplumDatabaseInstallationGuideforinstructionstoconfigurehosts.

    IfyouplantoplaceGPTextnodesontheGreenplumDatabasesegmenthosts,ensurethatyoureservememoryforGPTextusewhenyouconfigureGreenplumDatabase.TodeterminethememorytosetasideforGPText,multiplythenumberofGPTextnodestocreateoneachGreenplumsegmenthostbytheJVMmaximumsize.SubtractthismemoryfromthephysicalRAMwhencalculatingthevaluefortheGreenplumDatabasegp_vmem_protect_limit serverconfigurationparameter.SeetheGreenplumDatabaseserverconfigurationparameter gp_vmem_protect_limit in

    theGreenplumDatabaseReferenceGuideforrecommendedmemorycalculationformulasorvisittheGPDBVirtualMemoryCalculator website.

    ApacheSolrrequiresaZooKeeperclusterwithatminimumthreenodes(fivenodesrecommended).Youcaninstalla“binding”ZooKeeperclusterwithGPTextontheGreenplumclusterhosts,oryoucanuseanexistingZooKeepercluster.WhendeployedalongsideGreenplumDatabasesegments,

    ©CopyrightPivotalSoftware,Inc,2013-2018 4 3.1.0

    https://gpdb.docs.pivotal.iohttp://greenplum.org/calc/

  • ZooKeeperperformancecanbeaffectedunderheavydatabaseload.Forbestperformance,installaZooKeeperclusteronseparatehostswithnetworkconnectivitytotheGreenplumnetwork.

    NewFeaturesandEnhancementsinGPText3.1.0TheGPText3.1.0releaseprovidesthefollowingfeaturesandenhancements.

    ImprovementstoaidindevelopingandtestinganalyzerchainsThenew gptext.list_field_types() functionliststhefieldtypesdefinedinthe managed-schema configurationfileforanindex.

    Thenew gptext.get_field_type() functiondisplaystheindexandqueryanalyzerchainsforafieldtypeinJSONformat.

    Thenew gptext.analyzer() functionshowstheindexorqueryanalyzerchainoutputforagivenfieldtypeandinputtext.Thisfunctionisusefulfortestinganddebugginganalyzerchainsinteractivelywithoutmodifyingtheindex.

    Part-of-speechtaggingandnamedentityrecognitionGPTextincludesOpenNLPlibrariesandanalyzerclassestoclassifyindexedterms’parts-of-speech(POS),andtorecognizenamedentities,suchasthenamesofpersons,locations,andorganizations(NER).GPTextsavesNERtermsinthefield’stermsvector,prependedwithacodetoidentifythetypeofentityrecognized.Thisallowssearchingdocumentsbyentitytype.

    Thenew gptext.ner_terms() functionlistsNER-taggedtermsfordocumentsthatmatchaquery.

    GPTextincludestheOpenNLPmodelsfortheEnglishlanguage.YoucandownloadmodelsforotherlanguagesfromtheOpenNLPwebsiteandusethemwithGPText.

    OtherenhancementsandfixesThefirstargumentofthe gptext.terms() function,ananytabledatatype,hasbeenmadeoptional.

    Fixedanerrorwherethe gptext.partition_status() functiondisplayedpartitioninformationforanindexafteritwasdropped.

    ApacheSolrupdatedtoSolrversion7.3GPText3.1.0includesApacheSolr7.3.SeethefollowingreleasedocumentsforinformationabouttheSolr7.3release.

    ApacheSolr7.3UpgradeNotes

    ApacheSolr7.3ReleaseHighlights

    FollowingareGPTextchangesandSolrusagenotesrelatedtotheSolr7.3upgrade.

    GPTextserver-sidecomponentsarerebuiltandtestedwiththenewSolrJARfiles.

    The managed-schema , solrconfig.xml andothercollectionconfigurationfilesareupdated.

    Thetop-level elementin solrconfig.xml isnowofficiallydeprecatedinfavoroftheequivalent syntax.ThiselementhasbeenoutofuseindefaultSolrinstallationsforseveralreleasesalready.

    The legacyCloud parameternowdefaultstofalse.Ifanentryforareplicadoesnotexistin state.json ,thatreplicawillnotberegistered.Thismayaffectuserswhobringupreplicasandtheyareautomaticallyregisteredasapartofashard.ItispossibletoreverttotheoldbehaviorbysettingthepropertylegacyCloud=true intheclusterpropertiesbyrunningthefollowingcommandintheGPTextinstallationdirectory:

    $./server/scripts/cloud-scripts/zkcli.sh-zkhost127.0.0.1:2181-cmdclusterprop-namelegacyCloud-valtrue

    WithearlierSolrreleases,ifyoudropanindexwhileaSolrnodewithareplicaoftheindexisdown,whenthedownnodecomesbackon-line,theindexcomesbackandcannotbedeleted.Solr7fixesthisbug.TheGPTextworkaroundforthisbugisremoved.

    PointFieldsaredefaultnumerictypes.Solrhasimplemented*PointFieldtypesacrosstheboard,toreplaceTrie*basednumericfields.AllTrie*fieldsarenowconsidereddeprecated,andwillberemovedinSolr8.IfyouareusingTrie*fieldsinyourschema,youshouldconsidermovingtoPointFieldsassoonasfeasible.ChangingtothenewPointFieldtypeswillrequireyoutore-indexyourdata.

    Thefollowingspatial-relatedfieldshavebeendeprecated:LatLonType

    ©CopyrightPivotalSoftware,Inc,2013-2018 5 3.1.0

    https://lucene.apache.org/solr/guide/7_3/solr-upgrade-notes.htmlhttps://wiki.apache.org/solr/ReleaseNote73

  • GeoHashFieldFieldTypeSpatialTermQueryPrefixTreeFieldTypeUseoneofthesefieldtypesinstead:LatLonPointSpatialFieldSpatialRecursivePrefixTreeFieldRptWithGeometrySpatialField

    ToimproveparameterconsistencyintheCollectionsAPI,theparameternames fromNode fortheMOVEREPLICAcommandandsource,and target fortheREPLACENODEcommandhavebeendeprecatedandreplacedwith sourceNode and targetNode instead.Theoldnameswillcontinuetoworkforbackwardscompatibility,buttheywillberemovedinSolr8.

    Thereplicacorenamehaschangedfrom _shard#_replcai# to _shard#_replicai_# .Forexample, test_shard0_replica1

    becomes test_shard0_replica_n3 .

    NewFeaturesandEnhancementsinGPText3.0.0GPText3.0.0allowsaddingdocumentsstoredinAmazonWebServicesS3bucketstoaGPTextexternalindex.ThisenhancementincludeschangestoenableuploadingAWScredentialstoZooKeeperandsupportforthe s3 documentsourcetypeforthe gptext.external_login() , gptext.external_logout() ,gptext.index_external() ,and gptext.index_external_dir() GPTextfunctions.

    The gptext-state utilitywiththe --index ( -i )optionnowincludesthedateandtimetheGPTextindexwaslastmodified.

    NewFeaturesandEnhancementsinGPText2.4.0GPText2.4.0allowsaddingdocumentsstoredinanauthenticatedFTPservertoaGPTextexternalindex.Thisenhancementincludeschangestoaddsupportforthe ftp typetothe gptext.external upload command-lineutilityandthe gptext.external_login() , gptext.external_logout(), gptext.index_external() ,and gptext.index_external_dir() GPTextfunctions.

    NewFeaturesandEnhancementsinGPText2.3.1The gptext-backup command-lineutilitycannowbackupGPTextindexestolocalGPTextclusterstorageaswellasadirectoryonashareddrive.Forlocalbackups,backupmetadataandtheindexconfigurationfilesarebackeduptotheGreenplumDatabasemasterdatadirectoryandindexshardsarebackedupinthesegmentdatadirectoriesoneachhost.

    The gptext-backup utilityhasanewoptiontobackupjusttheindexconfigurationfilesfromZooKeeper,withnoindexdata.

    The gptext-restore uilityisupdatedtorestorebackupscreatedonlocalclusterstorage.

    The gptext-restore utilityhasanewoptiontorestoreonlytheconfigurationfilesfromabackup.ThisoptionloadstheconfigurationfilesintoZooKeeperandcreatesanemptyGPTextindex.

    NewFeaturesandEnhancementsinGPText2.3.0

    Revisedgptext-configUtilitySyntaxThe gptext-config command-lineutilitywasrevisedtohaveamoreuser-friendlysyntax.

    Anew list subcommandwasaddedto gptext-config youcanusetolistalloftheconfigurationfilesforaspecifiedGPTextindex.

    $gptext-configlist-i

    IndexDocumentsinaHadoopFileSystem(hdfs)DocumentSourceGPText2.3.0enablesyoutoadddocumentsstoredinahdfssystemtoaGPTextexternalindex.

    Thenew gptext-external command-lineutilityuploadsHadoopconfigurationandauthenticationfilestoanamedconfigurationinZooKeeper.The

    ©CopyrightPivotalSoftware,Inc,2013-2018 6 3.1.0

  • utilityhassubcommands upload , list ,and delete tomanagetheconfigurationsyouhaveuploaded.

    Thenew gptext.external_login() functionlogsintothehdfssystemusingthenamedconfigurationyouhaveuploaded.Youcanlogintoonlyoneexternaldocumentsourceatatime.

    UseURLsoftheform hdfs:// withthe gptext.index() and gptext.index_external() functionstoadddocumentstoaGPTextexternalindex.

    Usethenew gptext.index_external_dir() functiontoaddalldocumentsinanhdfsdirectorytoaGPTextexternalindex.

    Logoutofthehdfsexternaldocumentsourcewiththenew gptext.external_logout() function.

    SeeAuthenticatingwithanExternalDocumentSourceforstepstoenableaccesstoanhdfsdocumentsource.

    KnownIssuesFollowingareknownissuesinGPText.Workaroundsareprovidedwhenavailable.

    WildcardsinGPTextSearchOptionsSolrdoesnotreturnallfieldswhenthe fl Solrsearchoptioncontainsawildcardthatmatchesfieldnames.Forexample,givenatablewithcolumnscontenta and contentb ,specifying fl=contenta,contentb,(sum,1,1) correctlyreturnsthreefields.Specifying fl=cont*,sum(1,1) correctlyreturns contenta andcontentb ,butomitsthepseudo-field sum(1,1) .

    Specifyingawildcardtomatchallfields( fl=*,sum(1,1) )alsoomitsthepseudo-field.

    IndexLoadFailureAfterConfigurationFileErrorIfSolrfailstoloadanindexbecauseofaconfigurationfileerror,andthentheindexisdroppedwithoutfirstcorrectingtheconfigurationfileerror,theindexcannotberecreateduntilGPTextisrestarted.Thiscanhappenifyouedit managed-schema or solrconfig.xml andintroduceanXMLsyntaxerrororatypoinconfigurationvalues.

    Workaround:

    1. Whenanindexfailstoload,checktheSolrlogtofindthecause.

    2. Ifthecauseisaconfigurationfileerror,suchasinvalidXML,usethe gptext-config utilitytoeditthefileandfixtheerror.Droppingtheindexwithoutfirstcorrectingtheerrorisnotrecommended.

    3. Ifyouhavedroppedanindexthatfailedtoloadwithoutfirstcorrectingthecauseofthefailure,youmustrestartGPTextbeforeyoucanrecreatetheindex.Run gptext-start -r torestartGPText.

    StartupFailurewithLargeNumbersofIndexesWhenthereisalargenumberofSolrcores,SolrCloudcanfailtorestartsuccessfully,witherrormessagesindicatingfailuretoelectleadersforshards.ThisisaknownSolrissue;seehttps://issues.apache.org/jira/browse/SOLR-5990 intheApacheSolrJiraforanexample.Becauseofthisissue,itisrecommendedtoavoiddesigningGPTextapplicationsthatcreatelargenumbersofindexes,shards,andreplicas.Thenumberofcoresyoucancreatebeforeyouobservethisbehaviorishardwaredependent,soyoushouldtesttodetermineyoursystem’slimits.Youcancreateandsuccessfullyoperatealargernumbersofindexesthancanberestartedsuccessfullylater,sobesuretotestrestartingGPTexttodetermineapracticallimit.

    SettingGPTextConfigurationParametersWithoutFirstSettingcustom_variable_classesIfthe custom_variable_classes GreenplumDatabaseserverconfigurationparameterdoesnotincludethevalue“gptext”,attemptingtosetaGPTextconfigurationparameterreturnsanerrormessage,forexample:

    mydb-#setgptext.replication_factor=4;WARNING:PleaselogonagaintomakeGUCsettingtakeeffect.(GucValue.h:301)WARNING:PleaselogonagaintomakeGUCsettingtakeeffect.(GucValue.h:301)ERROR:unrecognizedconfigurationparameter"gptext.replication_factor"

    ©CopyrightPivotalSoftware,Inc,2013-2018 7 3.1.0

    https://issues.apache.org/jira/browse/SOLR-5990

  • InGPText2.0,inadditiontotheerrormessage,thevalueoftheconfigurationparameterpersistedinZooKeeperiszero,replacingthepreviousvalueoftheparameter.

    mydb-#showgptext.replication_factor;gptext.replication_factor----------------------------0

    BeginningwithGPText2.1,theerrormessageisstillgenerated,howeverthevaluesavedinZooKeeperisthevaluespecifiedinthe set command,4intheprecedingexample.

    Topreventtheerrormessage,beforesettinganyGPTextconfigurationparameters,usethe gpconfig command-lineutilitytosetthe custom_variable_classesconfigurationparameter:

    $gpconfig-ccustom_variable_classes-v'gptext'

    ©CopyrightPivotalSoftware,Inc,2013-2018 8 3.1.0

  • InstallingGPText

    PrerequisitesTheGPTextinstallationincludestheinstallationofApacheSolrCloudand,optionally,ApacheZooKeeper.

    IfyouareinstallinganewGPTextreleaseintoanexistingGPTextsystem,followtheinstructionsinUpgradingGPTextinstead.

    FollowingareGPTextinstallationprerequisites.

    InstallandconfigureyourGreenplumDatabasesystem,version4.3.6orhigher.SeethePivotalGreenplumDatabaseInstallationGuideathttps://gpdb.docs.pivotal.io .

    GPTextrunsonRedHatEnterpriseLinuxorCentOS5.x,6.x,or7.x.

    GPTextcannotbeinstalledontoasharedNFSmount.

    InstallaJRE1.8.xonallhostsinthecluster.

    Ensurethat nc (netcat)isinstalledonallGreenplumclusterhosts( yum install nc ).

    Installing lsof onallclusterhostsisrecommended( sudo yum install lsof ).

    GPTextnodescanbeinstalledontheGreenplumDatabaseclusterhostsalongsidetheGreenplumsegmentsoronadditional,non-databasehostsaccessibleontheGreenplumclusternetwork.AllhostsparticipatingintheGPTextsystemmusthavethesameoperatingsystemandconfigurationandhavepasswordless-sshaccessforthegpadminuser.SeethePivotalGreenplumDatabaseInstallationGuideforinstructionstoconfigurehosts.

    IfyouplantoplaceGPTextnodesontheGreenplumDatabasesegmenthosts,ensurethatyoureservememoryforGPTextusewhenyouconfigureGreenplumDatabase.TodeterminethememorytosetasideforGPText,multiplythenumberofGPTextnodestocreateoneachGreenplumsegmenthostbytheJVMmaximumsize.SubtractthismemoryfromthephysicalRAMwhencalculatingthevaluefortheGreenplumDatabasegp_vmem_protect_limit serverconfigurationparameter.SeetheGreenplumDatabaseserverconfigurationparameter gp_vmem_protect_limit in

    theGreenplumDatabaseReferenceGuideforrecommendedmemorycalculationformulasorvisittheGPDBVirtualMemoryCalculator website.

    ApacheSolrrequiresaZooKeeperclusterwithatminimumthreenodes.Youcaninstalla“binding”ZooKeeperclusterwithGPTextontheGreenplumclusterhosts,oryoucanuseanexistingZooKeepercluster.WhendeployedalongsideGreenplumDatabasesegments,ZooKeeperperformancecanbeaffectedunderheavydatabaseload.Forbestperformance,installaZooKeeperclusterwithatleastthreenodes(fivenodesrecommended)onseparatehostswithnetworkconnectivitytotheGreenplumnetwork.

    InstalltheGPTextBinaryDistribution1. OntheGreenplummasterhost,extracttheGPTextdistributionfile.Forexample:

    $cd/home/gpadmin$tarxvfzgreenplum-text--.tar.gz

    Thisextractstwofilesinthecurrentdirectory: gptext_install_config andtheGPTextinstallationbinary,whichhasanameintheformat greenplum-text--.bin .

    2. Ifnecessary,grantexecutepermissiontotheGPTextbinary.Forexample:

    $chmod+x/home/gpadmin/greenplum-text--.bin

    3. IfyouareinstallingGPTextinadirectorythatisonlywritablebyroot,suchasthedefaultdirectory /usr/local ,performthesestepsasroot:

    a. Sourcethe greenplum_path.sh fileintheGreenplumDatabaseinstallationdirectory.

    #source/usr/local/greenplum-db-/greenplum_path.sh

    b. LocateorcreateatextfilecontainingalistofthenamesofallhostswhereyouwillinstallGPText,oneperline,includingthemasterandstandbyhostnames.

    c. Startgpssh,specifyingthetextfilewithhostnames.

    #gpssh-fhostlist.txt

    d. Createtheinstallationdirectoryandthe greenplum-solr directoryandsettheownershipandpermissions.Forexample,ifyouareinstallingGPTextinthedefaultdirectory, /usr/local :

    ©CopyrightPivotalSoftware,Inc,2013-2018 9 3.1.0

    https://gpdb.docs.pivotal.iohttp://greenplum.org/calc/

  • =>mkdir/usr/local/greenplum-text-=>mkdir/usr/local/greenplum-solr=>chowngpadmin:gpadmin/usr/local/greenplum-text-=>chmod775/usr/local/greenplum-text-=>chowngpadmin:gpadmin/usr/local/greenplum-solr=>chown775/usr/local/greenplum-solr=>exit

    e. Completethetheremainingstepsasthegpadminuser.

    4. Editthe gptext_install_config filetosetparametersfortheinstallation.SeeSetInstallationParametersfordetails.

    5. RuntheGPTextinstallationbinaryas gpadmin onthemasterserver:

    $./gptext-.bin-c

    6. AcceptthePivotallicenseagreement.

    OptionalTwo-PartGPTextInstallationYoucanruntheGPTextinstallationintwopartsbyfollowingthesesteps.

    1. PrepareGPTextinstallationdirectoriesasdescribedinsteps1through3inInstalltheGPTextBinaries.

    2. RuntheGPTextinstallationbinaryas gpadmin onthemasterserver:

    $./greenplum-text-.bin-b

    Notethatthe -c optionisomitted.

    3. SourcetheGPTextenvironmentscriptintheGPTextinstallationdirectory:

    $source/greenplum-text_path.sh

    4. Editthe gptext_install_config filetosetparametersfortheGPTextinstallation.SeeSetInstallationParametersfordetails.

    5. DeploytheGPTextclusterwiththefollowingcommand:

    $gptext-deploy-c

    SetInstallationParametersAGPTextconfigurationfilenamed gptext_install_config containsparameterstoconfiguretheGPTextinstallation.Editthefileandsettheparametersasdescribedinthefollowingtable.

    GPTextinstallationparameters

    The GPTEXT_HOSTS and DATA_DIRECTORY installationparametersdeterminethenumberofGPTextnodesthataredeployed.Thenumberofdirectoriesincludedinthe DATA_DIRECTORY arrayisthenumberofGPTextnodesthatarecreatedperhost.

    The GPTEXT_HOSTS parameterdeterminesthenumberofhosts.Ifsettotheconstant "ALLSEGHOSTS" thenumberofGPTextnodehostsisthesameasthenumberofGreenplumsegmenthosts.If GPTEXT_HOSTS issettoanarrayofhostnames,thelengthofthearrayisthenumberofGPTextnodehosts.

    ThemaximumnumberofGPTextnodesisthenumberofGreenplumDatabaseprimarysegments.ThebestpracticerecommendationistodeployfewerGPTextnodeswithmorememoryratherthantodividethememoryavailabletoGPTextamongthemaximumnumberofGPTextnodesallowed.Forexample,ifthereareeightprimarysegmentsperhostintheGreenplumDatabasecluster,themaximumnumberofGPTextnodesperhostiseight,butyoushouldtestwithtwoorfourGPTextnodesperhost,adjustingthe JAVA_OPTS installationparametertodividethememoryreservedforGPTextamongthem.

    ©CopyrightPivotalSoftware,Inc,2013-2018 10 3.1.0

  • GPTEXT_HOSTS

    AnarrayofhostnamesonwhichtoinstallGPText,orusetheconstant "ALLSEGHOSTS" toinstallGPTextonallGreenplumDatabasesegmenthosts.GPTexthostsmustbepasswordlessssh-accessiblebythegpadminuserfromallotherhostsintheGreenplumCluster.

    declare -a GPTEXT_HOSTS=(gptext_h1 gptext_h2 gptext_h3)

    GPTEXT_HOSTS="ALLSEGHOSTS"

    DATA_DIRECTORY

    AnarrayofdirectorypathswhereGPTextdatadirectoriesaretobecreated.ThenumberofdirectoriesinthearraydeterminesthenumberofGPTextnodesthatwillbecreatedoneachphysicalhost.If GPTEXT_HOSTS listsmultipleinterfacesperhost,theGPTextnodesarespreadevenlyacrosstheinterfaceaddresses.

    declare -a DATA_DIRECTORY=(/data/primary /data/primary)

    JAVA_OPTS

    SetstheminimumandmaximummemoryeachSolrCloudJVMcanuse.

    JAVA_OPTS="-Xms1024M -Xmx2048M"

    GPTEXT_PORT_BASE

    GP_MAX_PORT_LIMIT

    SetarangeofportnumbersavailabletoGPTextnodes.GPTextfindsunusedportsinthespecifiedrange.

    GPTEXT_PORT_BASE=18983GP_MAX_PORT_LIMIT=28983

    ZOO_CLUSTER

    WhethertodeployaGPTextbindingZooKeeperclusteroruseanexistingZooKeepercluster.Ifsetto "BINDING" theinstallationdeploysaZooKeepercluster.TouseanexistingZooKeepercluster,setthisparametertoalistofZooKeepernodesintheformat"host1:port,host2:port,host3:port “.

    ZOO_CLUSTER="BINDING"

    ZOO_HOSTS

    If ZOO_CLUSTER issetto "BINDING" ,thisparameterisanarrayofthehostswheretheZooKeepernodesaretobeinstalled.Thearraymustcontain3,5,or7hostnames,forexample ZOO_HOSTS=(sdw1 sdw2 swd3 sdw4 sdw5) .IfyouareusingasinglehostforZooKeeper,specifyitmultipletimes,forexample, ZOO_HOSTS=(sdw1 sdw1 sdw1) .

    declare -a ZOO_HOSTS=(sdw1 sdw2 sdw3 sdw4 sdw5)

    ZOO_DATA_DIR

    TheZooKeeperdatadirectory,requiredwhen ZOO_CLUSTER issetto "BINDING" .

    ZOO_DATA_DIR="/data/master/"

    ZOO_GPTXTNODE

    ThenodepathinZooKeeperforGPText.Thisparameterisrequiredwhether ZOO_CLUSTER issetto "BINDING" oralistofhosts.

    ZOO_GPTXTNODE="gptext"

    ZOO_PORT_BASE

    ZOO_MAX_PORT_LIMIT

    ArangeofportnumberstousefortheZooKeepercluster.Unusedportsareallocatedfromwithinthisrange.Therangemustcontainatleast4000portnumbers.

    ZOO_PORT_BASE=2188ZOO_MAX_PORT_LIMIT=12188

    GPTEXT_JAVA_HOME

    ThehomedirectoryoftheJavainstallationtorunforZooKeeperandSolrprocesses.Ifnotset,theJREspecifiedinthe PATH and JAVA_HOMEenvironmentvariableswillbeused.

    ©CopyrightPivotalSoftware,Inc,2013-2018 11 3.1.0

  • GPTEXT_JAVA_HOME=/usr/java/jdk1.8.0_131

    StartingGPTextFirst,makesuretheGPTextcommand-lineutilitiesareinyourpathbysourcingtheGreenplumDatabaseandGPTextenvironmentscripts.ItisimportanttosourcetheGPTextenvironmentscripteachtimeyousourcetheGreenplumDatabasescript.Forexample:

    $source/usr/local/greenplum-db-/greenplum_path.sh$source/usr/local/greenplum-text-/greenplum-text_path.sh

    TouseGPTextinadatabase,youmustfirstusethe gptext-installsql managementutilitytoinstalltheGPTextuser-definedfunctionsandotherobjectsinthedatabase:

    $gptext-installsqldatabase[database2...]

    TheGPTextobjectsarecreatedinthe gptext schema.

    TheZooKeeperclustermustberunningbeforeyoustartGPText.IfyouinstalledaboundZooKeepercluster,startitwiththe zkManager command-lineutility.

    $zkManagerstart

    StartGPTextwiththe gptext-start utility.

    $gptext-start

    ConfigureGreenplumDatabaseGPTextconfigurationparametersaresavedinZooKeeper.Youcan,however,viewandsetGPTextconfigurationparametersinaGreenplumDatabasesessionusingthe SHOW and SET commands.ThisrequiresaddingtheGPTextcustomvariableclasstotheGreenplumDatabase custom_variable_classesconfigurationparameter.

    The custom_variable_classes configurationparameterisacomma-separatedlistofclassnames.Itisunsetbydefault.Toseeifanycustomvariableclasseshavealreadybeenconfigured,runthis gpconfig commandatthecommandline.

    $gpconfig-scustom_variable_classes

    Ifnocustomvariableclasseshavebeenset,settheparameterwiththefollowingcommand.

    $gpconfig-ccustom_variable_classes-v'gptext'[gpadmin@gpsne~]$gpconfig-ccustom_variable_classes-v'gptext'20171029:12:29:11:028199gpconfig:gpsne:gpadmin-[INFO]:-completedsuccessfully

    Ifotherclasseshavebeenconfigured,add gptext totheexistinglist,separatedbyacomma.

    Run gpstop-u

    tohaveGreenplumDatabasereloadtheconfigurationfile.

    WhenyouwanttovieworsetGPTextconfigurationparameters,firstexecutethe gptext.version() functiontoloadtheGPTextconfigurationparametersintothesession.

    ©CopyrightPivotalSoftware,Inc,2013-2018 12 3.1.0

  • =#SELECTgptext.version();version--------------------------------GreenplumTextAnalytics2.1.2(1row)

    =#SHOWgptext.idx_delim;gptext.idx_delim------------------,(1row)

    SeeSettingGPTextConfigurationParametersformoreaboutGPTextconfigurationparameters.

    UninstallingGPTextTouninstallGPText,runthe gptext-uninstall utility.YoumusthavesuperuserpermissionsonalldatabaseswithGPTextschemastorun gptext-uninstall .

    gptext-uninstall runsonlyifthereisatleastonedatabasewithaGPTextschema.

    Execute:

    $gptext-uninstall

    ©CopyrightPivotalSoftware,Inc,2013-2018 13 3.1.0

  • UpgradingGPTextUpgradingaGPTextsystemtoanewGPTextreleaseinstallsthenewGPTextsoftwarereleaseonallhostsintheGreenplumclusterandthenupgradestheGPTextsystem.

    UpgradingGPTextandGreenplumDatabaseattheSameTimeIfyouareupgradingtonewreleasesofGreenplumDatabaseandGPTextatthesametime,followthesesteps:

    1. CompletetheGreenplumDatabaseupgradefirstandensurethedatabaseisoperational.

    2. RuntheGPText gptext-migrator utilitytomigrateyourcurrentGPTextsystemtothenewlyupgradedGreenplumDatabasesystem.

    3. EnsurethatthecurrentversionofGPTextworkswiththenewGreenplumDatabaseversion.

    4. ProceedwiththeGPTextupgrade.

    UpgradingaGPTextReleaseUpgradingaGPTextreleaseisatwo-partprocess:installthenewsoftwarereleaseontheGreenplumclusterhostsandthenupgradetheexistingGPTextsystem.TheGPTextinstallerperformsthefirstpart,installingthenewsoftware.The gptext-upgrade utilityperformsthesecondpart,upgradingthecurrentGPTextsystemtothenewversion.

    TheGPTextinstallerdetectsanexistingGPTextsystemand,afterinstallingthenewsoftwarerelease,offerstorunthe gptext-upgrade utilityforyou.IfyouchoosetoupgradetheGPTextsystemlater,youcanrunthe gptext-upgrade utilityyourself.

    AllupgradetasksareexecutedontheGreenplummasterhostasthe gpadmin user.The gpadmin usermusthavewritepermissioninthedirectorywherethenewGPTextreleaseistobeinstalled, /usr/local/greenplum-text-- bydefault.

    TheGreenplumDatabase,ZooKeeper,andGPTextclustersmustberunning.TheprocedurestopsandrestartsGPTextduringtheupgrade.

    Followthesesteps:

    1. DownloadthenewGPTextreleaseforyourplatformfromPivotalNetwork .

    2. Extractthereleasepackage.

    $tarxfzgreenplum-text--.tar.gz

    3. MakesurethatZooKeeperandGPTextarerunning.

    $gptext-state

    4. RuntheGPTextinstaller.

    $./greenplum-text--.bin

    5. TheinstallerpromptsyoutoacceptthePivotallicenseagreementandtochooseandcreatetheinstallationdirectory.

    6. Theinstallerverifiestheenvironmenttoensurethatprerequisitesarepresent,suchasPythonandJava.Ifanyproblemsarediscovered,theinstalleroutputsanerrormessageandstops.Correcttheproblemidentifiedbythemessageandruntheinstalleragain.

    7. AfterthenewsoftwarehasbeeninstalledontheGreenplumcluster,theinstallerlooksforanexistingGPTextinstallation.IfanexistingGPTextsystemisfound,theinstallerasksifyouwishtoupgradeGPTextdirectly.

    Ifyouansweryes,theinstallerrunsthe gptext-upgrade script.The gptext-upgrade utilityvalidatestheenvironmenttoensureitcancompletetheupgrade,thenexecutestheupgradeandrestartstheGPTextsystem.Ifanyproblemsarediscovered, gptext-upgrade outputsamessageandquits.Fixtheindicatedproblemsandrunthegptext-upgradeutility(at /bin/gptext-upgrade )tocomplete

    WhenupgradingGPText,youdonotspecifyaninstallationconfigurationfileasyoudofortheinitialGPTextinstallation.

    ©CopyrightPivotalSoftware,Inc,2013-2018 14 3.1.0

    http://network.pivotal.io

  • theGPTextsystemupgrade.Ifyouanswerno,youmustrunthe gptext-upgrade scriptaftertheinstallercompletes.Seethegptext-upgradeutilityreferenceforinstructions.

    Important:Ifyouanswernoorifthe gptext-upgrade quitswithoutupgradingyoursoftware,followthesestepstore-run gptext-upgrade atalatertime:

    a. Sourcethe greenplum-text_path.sh scriptintheoldGPTextinstallationdirectory.Forexample:

    $ source /usr/local/greenplum-text-/greenplum-text_path.sh

    b. Runthe gptext-upgrade commandfromthenewGPTextinstallationdirectory:

    $ /usr/local/greenplum-text-/bin/gptext-upgrade

    8. Aftertheupgradehascompleted,sourcethe greenplum-text_path.sh inthenewGPTextreleasedirectoryandrun gptext-statehealthcheck toverifytheGPTextsystem:

    $source/usr/local/greenplum-text-/greenplum-text_path.sh$gptext-statehealthcheck

    ©CopyrightPivotalSoftware,Inc,2013-2018 15 3.1.0

  • IntroductiontoPivotalGPTextPivotalGPTextenablesprocessingmassquantitiesofrawtextdata(suchassocialmediafeedsore-maildatabases)intomission-criticalinformationthatguidesbusinessandprojectdecisions.GPTextjoinstheGreenplumDatabasemassivelyparallel-processingdatabaseserverwithApacheSolrCloudenterprisesearchandtheMADlibAnalyticsLibrarytoprovidelarge-scaleanalyticsprocessingandbusinessdecisionsupport.GPTextincludesfreetextsearchaswellassupportfortextanalysis.GPTextsupportsbusinessdecisionmakingbyoffering:

    Multiplekindsofdata:GPTextsupportsbothsemi-structuredandunstructureddatasearches,whichexponentiallyincreasesthekindsofinformationyoucanfind.

    Lessschemadependence:GPTextdoesnotrequirestaticschemastosuccessfullylocateinformation;schemascanchangeorbequitesimpleandstillreturntargetedresults.

    Textanalytics:GPTextsupportsanalysisoftextdatawithmachinelearningalgorithms.TheMADlibanalyticslibraryisintegratedwithGreenplumDatabaseandisavailableforusewithGPText.

    Thischaptercontainsthefollowingtopics:

    GPTextSystemArchitecture

    GPTextSampleUseCase

    GPTextWorkflow

    TextAnalysis

    GPTextSystemArchitectureGPTextcombinesaGreenplumDatabaseclusterwithanApacheSolrCloudcluster.GreenplumDatabasesegmentsandGPTextnodescanbedeployedonthesamehostsorondifferenthostswithnetworkconnectivity.

    ThefollowingfigureshowstheprocessarchitectureofthecombinedGreenplumDatabaseandApacheSolrclusters.ThefigureshowsfourclusternodeswithfourGreenplumsegmentsandfourSolrinstancesdeployedoneach.AnApacheZooKeeperservicemanagestheSolrCloudcluster.BecauseZooKeeperismostefficientwithanoddnumberofservers,ZooKeepernodesaredeployedonthreeofthefourhosts.GreenplumDatabaseusersaccessSolrCloudservicesviaGPTextuser-definedfunctionsinstalledinGreenplumdatabasesandcommand-lineutilities.

    ThefigureomitstheGreenplummasterhost,secondarymaster,andmirrorsegmentsfortheGreenplumprimarysegments.

    ©CopyrightPivotalSoftware,Inc,2013-2018 16 3.1.0

  • TheGreenplumsegments,Solrinstances,andZooKeepernodesmayallbedeployedonseparatehostsonthesamenetwork,dependingonapplicationandperformancerequirements.

    ThefollowingsectionsdescribehowGPTextintegratesSolrCloudwithGreenplumDatabaseandhowthetwoclustersworktogethertoprovideparalleltextsearchcapabilitiesinGreenplumDatabaseandmaintainhighavailability.

    GreenplumDatabaseClusterAGreenplumDatabaseclusteriscomprisedofthefollowingcomponents:

    Amasterdatabaseinstance,executingonadedicatedhost,conventionallynamed mdw .(Notillustrated)

    Asecondarymasterinstance,onahostconventionallynamed smdw ,actingasawarmstandbyforthemasterinstance.(Notillustrated)

    Anarrayofdatabaseprimarysegmentinstancesandmirrorsdeployedonsegmenthosts,byconvention sdw1 through sdwn .AsegmentinstanceisanindependentPostgresdatabaseprocessmanagingaportionofthedistributeddata.Eachsegmenthasamirror(notillustrated)onanotherhostintheclustertoprovideuninterruptedserviceincaseofasegmentorsegmenthostfailure.Thenumberofprimarysegmentsperhostisdeterminedbythehardwareconfiguration—thenumberandtypeofprocessorcores,theamountofphysicalRAM,localstoragecapacity,andnetworkcapacity—aswellasavailabilityandperformancerequirements.

    TheGreenplummasterinstancecoordinatestheworkofthesegmentinstances.OptimalperformanceofaGreenplumDatabaseclusterrequiresthatallsegmenthostsbeconfiguredidenticallywiththesamenumberofprimaryandmirrorsegmentsoneach,andwiththedatabasedatadistributedevenlyamongthesegmentinstances.Thefullcapacityofthedatabaseclusterisutilizedwheneverysegmenthostperformsanequalamountofwork.

    ApacheSolrCloudApacheSolrisaserverprovidingaccesstoApacheLucenefull-textindexes.ApacheSolrCloudisahighlyavailable,faulttolerantclusterofApacheSolrservers.ThetermGPTextclusterisanotherwaytorefertoaSolrCloudclusterdeployedbyGPTextforusewithaGreenplumDatabasesystem.

    ASolrCloudclusteriscomprisedofthefollowingcomponents:

    AnApacheZooKeeperclustertomanagetheSolrCloudcluster.SolrCloudusesZooKeepertomanageserverconfigurationandtocoordinatethecluster’sactivities.GPTextcaninstallZooKeeperclusterthatisboundtotheGPTextcluster,oritcanshareanexistingZooKeepercluster.IfGPTextinstallstheZooKeepercluster,itcanbemanagedusingGPTextfunctionsandutilities.TheZooKeeperclustercanbedeployedonGreenplumDatabaseclusterhostsor,forbestperformance,onseperatehostsaccessibletotheGreenplumDatabasecluster.

    MultipleSolrCloudserverinstancesdeployedontheGreenplumsegmenthostsoronotherhostsonthesamenetwork.EachinstanceisaJVMprocessrunningSolrserver.SolrCloudinstancesuselocalstorage,whichmaybethesamelocalstoragevolumesthatstoreGreenplumDatabasedata.ThenumberofSolrCloudinstancesperhostcanbethesameasthenumberofGreenplumprimarysegmentsperhost,butthisisnotarequirement.ThenumberofinstancestoexecuteperhostisspecifiedduringGPTextinstallation.

    GPTextprovidesdocumentindexingandsearchcapabilitiesforGreenplumDatabasebyaddinguser-definedfunctions(UDFs)thataccessSolrAPIsfromwithindatabasequeries.

    GPTextUDFsperformthefollowingtasks:

    createandmanageGPTextindexes

    insertdocumentsintoindexesfromdatabasetablesor,forGPTextexternalindexes,fromdocumentsstoredoutsideofGreenplumDatabase

    searchindexes

    TherearealsoGPTextUDFsandcommand-lineutilitiestoconfigure,monitor,andmanagetheSolrCloudclusterandtomanagereplicas,SolrCloud’shigh-availabilitymechanism.(Moreonreplicasinthenextsection.)

    ParallelisminGPTextIndexingandSearchingSolrClouddistributesdocumentindexesinslicescalledshards.WithGPText,thenumberofshardsforanindexisthesameasthenumberofGreenplumsegments,soeachGreenplumsegmentoperatesonanequalportionoftheindex.EachshardismanagedbyaSolrCloudinstanceandtheshardsaredistributedevenlyamongtheSolrCloudinstances.TheSolrCloudinstanceandGreenplumsegmentarenotrequiredtobeonthesamehost.

    HighAvailabilityforGPTextIndexesSolrCloudprovideshighavailabilitybymaintainingreplicasofshardsandprovidingautomaticfailoverifashardfailsorbecomesunavailable.Onereplica

    ©CopyrightPivotalSoftware,Inc,2013-2018 17 3.1.0

  • ofeachshardistheleadreplicaandanychangestoitareappliedtotheotherreplicas.Thereplicationfactor,whichdeterminesthenumberofreplicastomaintainforeachshard,issetwhentheindexiscreated.ReplicasmayalsobeaddedordroppedlaterusingGPTextUDFsorcommand-lineutilities.

    ZooKeeperdeterminesthelocationsofshardreplicasamongtheSolrnodesandhosts.WhenaddingareplicausingaGPTextUDForcommand-lineutility,anewshardcanbeexplicitlyplacedonaSolrCloudinstance.

    GPTextSampleUseCaseForensicfinancialanalystsneedtolocatecommunicationsamongcorporateexecutivesthatpointtofinancialmalfeasanceintheirfirm.Theanalystsusethefollowingworkflow:

    1. LoadtheemailrecordsintoaGreenplumdatabase.

    2. CreateaSolrindexoftheemailrecords.

    3. Runqueriesthatlookfortextstringsandtheirauthors.

    4. Refinethequeriesuntiltheypairadummycompanynamewithtopthreeorfourexecutivescorrespondingaboutsuspectoffshorefinancialtransactions.Withthisdata,theanalystscanfocustheinvestigationonspecificindividualsratherthanthethousandsofauthorsintheinitialdatasample.

    GPTextWorkflowGPTextworkswithGreenplumDatabaseandApacheSolrCloudtostoreandindexbigdataforinformationretrieval(query)purposes.High-levelworkflowsincludedataloadingandindexing,anddataquerying.

    Thistopicdescribesthefollowinginformation:

    DataLoadingandIndexingWorkflow

    QueryingDataWorkflow

    DataLoadingandIndexingWorkflowThefollowingdiagramshowstheGPTextworkflowforloadingandindexingdata.

    AllclientinteractionwiththesystemisthroughtheGreenplummasterinstance.

    1. LoaddataintoyourGreenplumDatabasesystem.Createadatabasetabletoholddataandthenaddthedatatothetable.Greenplumprovidesparalleldataloadingutilitiesandprotocolsthathelptotransformandloadexternaldatainvariousformatsandfromvarioussources.Fordetails,seetheGreenplumDatabaseAdministratorGuide,athttp://gpdb.docs.pivotal.io .

    ©CopyrightPivotalSoftware,Inc,2013-2018 18 3.1.0

    http://gpdb.docs.pivotal.io

  • 2. CreateanemptyGPTextindex.Usethe gptext.create_index() user-definedfunction(UDF)tocreateanemptyGPTextindexforthetable.EachGreenplumsegmentwillmanageasliceoftheindex,calledashard.SolrCloudcreatesmultiplereplicasforeachshard,distributedamongtheSolrinstances,andchoosesaleadreplicafortheGreenplumsegmenttooperateupon.Solrmanagesreplicationbetweenthereplicas.

    3. Populatetheindexwithdatafromthedatabasetable.Usethe gptext.index() UDFtoadddatatotheindex.ThisUDFworksbydispatchingaSQLquerytoexecuteoneachGreenplumsegment.ThesegmentsexecutethequeryandaddtheresultstotheirshardsusingSolrAPIs.

    4. Commitchangestotheindex.CommitchangestotheGPTextindexbycallingthe gptext.commit_index() UDF.Untilthechangesarecommitted,queriesexecutedontheindexcannotaccessanydataaddedtotheindexwith gptext.index() .Ifneeded,uncommittedchangescanberolledback.SolrCloudreplicateschangescommittedtotheleadreplicatotheshards’non-leadreplicas.

    QueryingDataWorkflowThefollowingdiagramshowsthehigh-levelGPTextqueryprocessworkflow:

    1. AusersubmitsaSQLquerydesignedtosearchtheindexeddata.AGPTextsearchqueryisaSQL SELECT statementonaGPTextsearchUDFthatcontainsfull-textsearchexpressions.

    2. TheGreenplummasterdispatchesthequerytotheGreenplumsegments.

    3. Eachsegmentexecutesthequery,usingtheSolrAPItosearchitsindexshard.SolrCloudexecutesthesearchqueryontheleadreplicafortheshard.

    4. TheGreenplumsegmentsreturntheresultsofthesearchquerytotheGreenplummaster.

    5. TheGreenplummasteraggregatestheresultsfromallsegmentsandreturnsthemtotheclient.

    TextAnalysisGPTextenablesanalysisofSolrindexeswithApacheMADlib,anopensourcelibraryforscalablein-databaseanalytics.MADlibprovidesdata-parallelimplementationsofmathematical,statistical,andmachinelearningmethodsforstructuredandunstructureddata.YoucanuseGPTexttoperformavarietyofMADlibanalyses.

    LearnmoreaboutApacheMADlibathttp://madlib.apache.org .A gppkg packageforMADlibisavailableonthePivotalnetworkathttp://network.pivotal.io .

    ©CopyrightPivotalSoftware,Inc,2013-2018 19 3.1.0

    http://madlib.apache.orghttp://network.pivotal.io

  • AdministeringGPTextGPTextadministrationincludessecurityconsiderations,monitoringSolrindexstatistics,managingandmonitoringZooKeeper,andtroubleshooting.

    ChangingGPTextServerConfigurationParametersConfigurationparametersusedwithGPTextarebuilt-intoGPTextwithdefaultvalues.YoucanchangethevaluesfortheseparametersbysettingthenewvaluesinaGreenplumDatabasesession.ThenewvaluesarestoredinZooKeeper.GPTextindexesusethevaluesofconfigurationparameterswhentheyarecreated.Changingconfigurationparametersaffectsnewindexes,butdoesnotaffectexistingindexes.

    SeeGPTextConfigurationParametersforacompletelistofconfigurationparameters.

    Aone-timeGreenplumDatabaseconfigurationchangeisneededforGreenplumDatabasetoallowsettinganddisplayingGPTextconfigurationvariables.Asthe gpadmin user,enterthefollowingcommandsinashell:

    $gpconfig-ccustom_variable_classes-v'gptext'$gpstop-u

    ThenconnecttoadatabasethatcontainstheGPTextschemaandexecutethe gptext.version() functiontoexposetheGPTextconfigurationvariables:

    =#select*fromgptext.version();

    ChangethevaluesofGPTextconfigurationvariablesusingthe SET commandinasessionwithadatabasethatcontainstheGPTextschema.Thefollowingexamplesetsvaluesforthreeconfigurationparametersina psql session:

    =#setgptext.idx_buffer_size=10485760;SET=#setgptext.idx_delim='|';SET=#setgptext.extension_factor=5;SET

    Youcanviewthecurrentvalueofaconfigurationparameterthatyouhavesetusingthe SHOW command:

    =#showgptext.idx_delim;gptext.idx_delim------------------|(1row)

    SecurityandGPTextIndexesGPTextsecurityisbasedonGreenplumDatabasesecurity.YourprivilegestoexecuteGPTextfunctionsdependonyourprivilegesforthedatabasetablethatisthesourcefortheindex.Forexample,ifyouhaveSELECTprivilegesforatableintheGreenplumDatabasedatabase,thenyouhaveSELECTprivilegesforanindexgeneratedfromthattable.

    ExecutingGPTextfunctionsrequiresoneofOWNER,SELECT,INSERT,UPDATE,orDELETEprivileges,dependingonthefunction.TheOWNERisthepersonwhocreatedthetableandhasallprivileges.SeetheGreenplumDatabaseAdministratorGuideforinformationaboutsettingprivileges.

    ZooKeeperAdministrationApacheZooKeeperenablescoordinationbetweentheApacheSolrandPivotalGPTextdistributedprocessesthroughasharednamespacethatresemblesafilesystem.InZooKeeper,anode(calledaznode)cancontaindata,likeafile,andcanhavechildznodes,likeadirectory.ZooKeeperreplicatesdatabetweenmultipleinstancesdeployedasaclustertoprovideahighlyavailable,fault-tolerantservice.BothSolrandGPTextstoreconfigurationfilesandsharestatusbywritingdatatoZooKeeperznodes.GPTextstoresinformationinthe /gptext znode.TheconfigurationfilesforaGPTextindexareinthe/gptext/configs/ znode.

    ThenumberofZooKeeperinstancesintheclusterdetermineshowmanyZooKeepernodefailurestheclustercantolerateandstillremainactive.Theserviceremainsavailableaslongasaclearmajorityofthenon-failednodesareabletocommunicatewitheachother.Totolerateafailureofnnodesthe

    n

    ©CopyrightPivotalSoftware,Inc,2013-2018 20 3.1.0

  • clustermusthave2 +1nodes.Aclusteroffivenodes,forexample,cantoleratetwofailednodes.

    ZooKeeperisveryfastforreadrequestsbecauseitstoresdatainmemory.IfZooKeeperbeginstoswapmemorytodisk,SolrandGPTextperformancewillsufferandcouldexperiencefailures,soitiscriticaltoallocatesufficientmemorytotheZooKeeperJavaprocesses.ToavoidZooKeeperinstancescompetingwithGreenplumDatabasesegmentsformemory,youshoulddeploytheZooKeeperinstancesandGreenplumDatabasesegmentsondifferenthosts.TheZooKeeperandGreenplumDatabasehostsmustbeonthesamenetworkandaccessiblewithpasswordlessSSHbythegpadminuser.YoucanusetheGreenplumDatabase gpssh-exkeys utilitytoshareSSHkeysbetweenZooKeeperandGreenplumDatabasehosts.

    YoumuststarttheZooKeeperclusterbeforeyoustartGPText.WhenyoustartGPText,theSolrnodeseachloadthereplicasforindexestheymanage.Withlargenumbersofindexes,shards,andreplicas,startinguptheclustercangenerateaveryhigh,atypicalloadonZooKeeper.ItcantakealongtimetogetallindexesloadedandsomeZooKeeperrequestsmaytimeoutwaitingforresponses.Usingthe gptext-start--

    slow_startoptionstartsSolrnodesoneata

    time,providingamoreorderedstart-upandlimitingthenumberofconcurrentZooKeeperrequests.

    TheGPTextcommand-lineutility zkManager canbeusedtomonitortheZooKeepercluster.IftheZooKeeperclusterisboundtoGPText,youcanalsostartandstoptheclusterusing zkManager .

    CheckingZooKeeperStatusUsethe zkManager utilityfromthecommandlinetochecktheZooKeeperclusterstatus.Theutilityliststhehosts,ports,latency,andfollower/leadermodeforeachZooKeeperinstance.Ifanodeisdown,itsmodeislistedasDown.

    TochecktheZooKeeperclusterstatus,runthe zkManagerstate command.

    $zkManagerstate20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-Executezookeeperstateprocess.20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-Checkzookeeperclusterstate...20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-HostportLatencymin/avg/maxMode20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-gpdb21890/0/22follower20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-gpdb21900/0/29leader20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-gpdb21880/0/27follower20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-Done.

    Inadatabasesession,youcanusethe gptext.zookeeper_hosts() functiontolisttheZooKeeperhosts.

    =#SELECT*FROMgptext.zookeeper_hosts();host|port--------+------gpdb51|2188gpdb51|2189gpdb51|2190(3rows)

    StartingandStoppingtheZooKeeperClusterIftheZooKeeperclusterwasinstalledbytheGPTextinstaller,the zkManager utilitycanstartorstoptheZooKeepercluster.Tostartthecluster,runthezkManagerstart

    command.

    $zkManagerstart20171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:-Executezookeeperstartprocess20171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:------------------------------------------------20171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:-StartingZookeeper:20171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:------------------------------------------------20171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:-HostZookeeperDir20171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:-gpdb/data/master/zoo020171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:-gpdb/data/master/zoo120171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:-gpdb/data/master/zoo220171016:16:14:48:017845zkManager:gpdb:gpadmin-[INFO]:-Checkzookeeperclusterstate...20171016:16:14:53:017845zkManager:gpdb:gpadmin-[INFO]:-Done.

    TostopZooKeeper,runthe zkManagerstop command.

    n

    ©CopyrightPivotalSoftware,Inc,2013-2018 21 3.1.0

  • $zkManagerstop20171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:-Executezookeeperstopprocess.20171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:------------------------------------------------20171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:-StopZookeeper:20171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:------------------------------------------------20171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:-HostZookeeperDir20171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:-gpdb/data/master/zoo020171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:-gpdb/data/master/zoo120171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:-gpdb/data/master/zoo220171016:16:14:09:016499zkManager:gpdb:gpadmin-[INFO]:-Done.

    SeethezkManagerreferenceformoreinformation.

    CheckingSolrCloudStatusYoucancheckthestatusoftheSolrCloudclusterandindexesbyrunningthe gptext-state utilityfromthecommandline.

    TocheckthestateoftheGPTextnodesandeachindex,runthe gptext-state utilitywiththe -D ( --details )option.Example:

    $gptext-state-D20180615:16:09:24:031986gptext-state:mdw:gpadmin-[INFO]:-ExecuteGPTextstate...20180615:16:09:25:031986gptext-state:mdw:gpadmin-[INFO]:-Checkzookeeperclusterstate...20180615:16:09:25:031986gptext-state:mdw:gpadmin-[INFO]:-CheckGPTextclusterstatus...20180615:16:09:25:031986gptext-state:mdw:gpadmin-[INFO]:-CurrentGPTextVersion:3.0.020180615:16:09:25:031986gptext-state:mdw:gpadmin-[INFO]:-Allnodesareupandrunning.20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:------------------------------------------------20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:-Indexstatedetails.20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:------------------------------------------------20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:-databaseindexnamestate20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:-demodemo.twitter.messageGreen20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:-demodemo.wikipedia.articlesGreen20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:-Done.

    ThiscommandreportsthestatusoftheGPTextnodesandstatusofeachGPTextindex.

    Run gptext-statelist toviewjusttheindexes.

    The gptext-statehealthcheck commandcheckstheGPTextconfigurationfiles,theindexstatus,requireddiskspace,userprivileges,andindexanddatabaseconsistency.Bydefault,therequireddiskspacecheckpassesifthereisatleast20%diskfree.Youcansetadifferentdiskfreethresholdusingthe--disk_free option.Forexample:

    [gpadmin@gpdb-sandbox~]$gptext-statehealthcheck--disk_free=2520160629:15:45:24:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-ExecutehealthcheckonGPTextcluster!20160629:15:45:24:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-CheckGPTextconfigfiles...20160629:15:45:24:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-GOOD20160629:15:45:24:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-CheckGPTextindexstatus...20160629:15:45:25:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-GOOD20160629:15:45:25:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-Checkingforrequireddiskspace...20160629:15:45:25:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-GOOD20160629:15:45:25:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-Checkingforrequireduserprivileges...20160629:15:45:25:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-GOOD20160629:15:45:25:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-Checkingforindexesanddatabaseconsistency...20160629:15:45:27:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-GOOD20160629:15:45:27:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-Done.

    Seethe gptext-state utilityreferenceforadditionaloptions.

    RecoveringGPTextNodesUsethe gptext-recover utilitytorecoverdownGPTextnodes,forexampleafterafailedGreenplumDatabasesegmenthostisrecovered.

    Withnoarguments,the gptext-recover utilitydiscoversdownGPTextnodesandrestartsthem.

    Withthe -f (or --force )option,ifaGPTextnodecannotberestartedandnoshardsaredown,thenodeisdeletedandcreatedagainonthesamehost.Missingreplicasareaddedandthefailednodeandfailedreplicasareremoved.

    The -H ( --new_hosts )optionallowsrecreatingdownGPTextnodesonnewhoststhatreplacefailedhosts.ThedownGPTextnodesaredeletedandrecreatedonthenewhosts.Theargumenttothe -H optionisacomma-separatedlistofthenewhoststhataretoreplacethefailedhosts.Thenumberof

    ©CopyrightPivotalSoftware,Inc,2013-2018 22 3.1.0

  • newhostsmustmatchthenumberoffailedhosts.Ifshardsaredown,itadvisesreindexing.Ifonlysomereplicasaredown,itrecreatesthereplicasonthenewhostsandupdates gptext.conf .

    The -r optionrecoversreplicas,butdoesnotattempttorecoveranydownnodes.

    Note:BeforerecoveringGPTextnodesonnewlyaddedhosts,ensurethatthefollowingGPTextprerequisiteshavebeeninstalledonthehost:

    Java1.8

    Python2.6

    TheLinux lsof utility

    ViewingSolrIndexStatisticsYoucanviewSolrindexstatisticsbyrunningthe gptext-state utilityfromthecommandline.

    TolistallGPTextindexes,enterthefollowingcommandatthecommandline:

    gptext-statelist

    Acommandlinethatretrievesallstatisticsforanindex:

    gptext-state--indexdemo.wikipedia.articles

    Acommandlinethatretrievesthenumberofdocumentsinanindex:

    gptext-state--indexdemo.wikipedia.articles--stats_columns=num_docs

    Acommandlinethatretrieves num_docs ,index size ,andthedateandtime last_modified :

    gptext-state--indexdemo.wikipedia.articles--stats_columnsnum_docs,size,last_modified

    BackingUpandRestoringGPTextIndexesWiththe gptext-backup managementutility,youcanbackupaGPTextindexsothat,ifneeded,youcanquicklyrecoverfromafailure.ThebackupcanberestoredtothesameGPTextsystemortoanothersystemwiththesamenumberofGreenplumDatabasesegments.

    The gptext-backup managementutilitybacksupanindexanditsconfigurationfilestoeitherasharedfilesystem,whichmustbemountedonandwritablebyeachhostintheGreenplumDatabasecluster,ortolocalstorageontheGreenplumDatabasemasterandsegmenthosts.

    BackingUptoaSharedFileSystemTobackuponasharedfilesystem,usethe -p ( --path )command-lineoptiontospecifythelocationofadirectoryonthemountedfilesystemandthe-n ( --name )optiontoprovideanameforthebackup.Specifytheindextobackupwiththe -i (--index )option.

    $gptext-backup-i-p--n

    The gptext-backup utilitythenchecksthat:

    theGPTextclusterisup

    thesharedfilesystemisvalid

    thebackupnamespecifiedwiththe -n optiondoesnotalreadyexistinthedirectoryspecifiedwiththe -p option

    Theutilitycreatesthenewdirectoryandthensavesonecopyofeachindexshardtothatdirectory,alongwiththeindex’sconfigurationfilesfromZooKeeper.

    Tosavetheconfigurationfilesonly,withnodata,addthe -c ( --backup_conf )command-lineoption.

    ©CopyrightPivotalSoftware,Inc,2013-2018 23 3.1.0

  • Torestoreanindexfromasharedfilesystem,usethe gptext-restore managementutility.TheGPTextsystemyourestoretomustbeonaGreenplumDatabaseclusterwiththesamenumberofsegments.Thedatabaseandschemafortheindexmustbepresent.

    The -i ( --index )optionspecifiesthenameoftheGPTextindexthatwillberestored.Iftheindexexists,youmustfirstdropitwiththe gptext.drop_index()user-definedfunction.

    The -p ( --path )optionspecifiesthelocationofthedirectorycontainingthebackupfiles—thedirectorythat gptext-backup createdonthesharedfilesystem.

    $gptext-restore-i-p

    Youcanaddthe -c optiontorestoreonlytheconfigurationfilestoZooKeeperandcreateanemptyGPTextindex,withoutrestoringanysavedindexdata.

    BackingUptoLocalStorageTobackuptolocalstorageontheGreenplumDatabasecluster,addthe local keywordtothe gptext-backup command-line.

    AlocalGPTextbackuphasauniquenameconstructedbyappendingatimestamptotheindexname.Youdonotusethe -n optionwithlocalbackups.

    $gptext-backuplocal-i

    Onthemasterhost,inthemasterdatadirectorybydefault,thebackuputilitysavesaJSONfilewithbackupmetadataandadirectorycontainingtheindex’sconfigurationfilesfromZooKeeper.

    TheutilitybacksupeachindexshardontheGreenplumDatabasesegmenthostwiththeGPTextnodethatmanagestheshard’sleadreplica.Bydefault,theshardbackupfilesaresavedinasegmentdatadirectory.

    The gptext-backup commandoutputreportsthelocationsofallbackupfiles.

    Youcanaddthe -p ( --path )optiontothe gptext-backup commandtospecifyalocaldirectorywherethebackupwillbesaved.ThedirectorymustbepresentoneveryGreenplumDatabasehostandmustbewriteablebythegpadminuser.

    $gptext-backuplocal-i-p

    ThebackupfileswillbesavedinthespecifieddirectoryoneachhostinsteadofintheGreenplumDatabasemasterandsegmentdatadirectories.

    Torestoreabackupsavedtolocalstorage,addthe local keywordtothe gptext-restore command-lineandspecifythepathtothebackupdirectoryonthemasterhost.

    $gptext-restorelocal-p

    The isthefullpathtothedirectorythe gptext-backup commandcreatedonthemasterhost,includingthetimestamp,forexample$MASTER_DATA_DIRECTORY/demo.twitter.message_2018-05-08T15:32:21.397779 .

    Seegptext-backupforsyntaxandexamplesforrunning gptext-backup .Seegptext-restoreforsyntaxandexamplesforrunning gptext-restore .

    ExpandingtheGPTextClusterThe gptext-expand managementutilityaddsGPTextnodestothecluster.Therearetwowaystoaddnodes:

    AddGPTextnodestoexistinghostsinthecluster.ThisoptionincreasesthenumberofGPTextnodesoneachhost.

    AddGPTextnodestonewhostsaddedbyusingtheGreenplumDatabase gpexpand managementutilitytoexpandtheGreenplumDatabasesystem.

    AddingGPTextNodestoExistingSegmentHostsToaddnodestoexistingsegmenthosts,runthe gptext-expand utilitywithacommandlikethefollowing:

    gptext-expand-e-p/data1/nodes,/data2/nodes

    ©CopyrightPivotalSoftware,Inc,2013-2018 24 3.1.0

  • ThisexampleaddstwoGPTextnodestoeachhost.

    The -e ( --existing )optionspecifiesthatnodesaretobeaddedtoexistinghosts.

    The -p ( --expand_paths )optionprovidesalistofdirectorieswherethenewnodes’datadirectoriesaretobecreated.TheseshouldbethesamedirectoriesthatcontaintheGreenplumDatabasesegmentdatadirectoriesandexistingGPTextdatadirectories.Thenumberofdirectoriesinthelististhenumberofnewnodesthatareadded.

    AdirectorycanberepeatedinthedirectorylistmultipletimestoincreasethenumberofnewGPTextnodestocreate.Forexample,ifthereiscurrentlyoneGPTextnodeperhostinthe /data1/nodes directory,youcouldaddthreenodeswithacommandlikethefollowing:

    gptext-expand-e-p/data1/nodes,/data2/nodes,/data2/nodes

    Thisaddsonenodetothe /data1/nodes directoryandtwonodestothe /data2/nodes directorysotherearetwoGPTextnodesineachdirectory.

    AddingGPTextnodesaffectsnewindexes,butnotexistingindexes.Replicasfornewindexeswillbedistributedacrossallofthenodes,includingbotholdnodesandthenewlycreatednodes.Replicasforindexesthatexistedbeforerunning gptext-expand arenotautomaticallymoved.Rebalancingexistingreplicasrequiresreindexing.

    AddingGPTextNodestoNewHostsCheckthatthefollowingGPTextprerequisitesareinstalledoneachnewhostaddedtotheGreenplumDatabasecluster:

    Java1.8

    Python2.6orgreater

    Linux lsof utility

    NewhostsmustbereachablebyallhostsintheGPTextcluster,includingexistinghostsandthenewhostsyouareadding.

    AfterexpandingtheGreenplumDatabaseclusterwiththe gpexpand managementutility,call gptext-expand withthe -H ( --new_hosts )optionandalistofthenewhostsonwhichtoinstallGPText:

    gptext-expand-Hnewhost1,newhost2

    The gptext-expand utilityinstallsGPTextbinariesonthenewhostsandthencreatesnewGPTextnodesonthenewhosts.

    ExpandingaGreenplumDatabaseclusterincreasesthenumberofsegments,sothenumberofGPTextindexshardsforexistingindexesmustbeincreasedtoequalthenewnumberofsegments.Thisrequiresreindexingallexistingdocuments.Newlycreatedindexeswillautomaticallybedistributedamongthenewshards.

    TroubleshootingGPTexterrorsareofthefollowingtypes:

    Solrerrors

    gptext errors

    MostoftheSolrerrorsareself-explanatory.

    gptext errorsarecausedbymisuseofafunctionorutility.Theyprovideamessagethattellsyouwhenyouhaveusedanincorrectfunctionorargument.

    MonitoringLogsYoucanexaminetheGreenplumDatabaseandSolrlogsformoreinformationiferrorsoccur.GreenplumDatabaselogsresidein:

    segment-directory/pg-log

    Solrlogsresidein:

    ©CopyrightPivotalSoftware,Inc,2013-2018 25 3.1.0

  • /solr/logs

    DeterminingSegmentStatuswithgptext-stateUsethe gptext-state utilitytodetermineifanyprimaryormirrorsegmentsaredown.See gptext-state intheGPTextManagementUtilitiesReference.

    ©CopyrightPivotalSoftware,Inc,2013-2018 26 3.1.0

  • GPTextHighAvailabilityTheGPTexthighavailabilityfeatureensuresthatyoucancontinueworkingwithGPTextindexesaslongaseachshardintheindexhasatleastoneworkingreplica.

    AGPTextindexhasoneshardforeachGreenplumsegment,sothereisaone-to-onecoorespondencebetweenGreenplumsegmentsandGPTextindexshards.TheshardmanagedbyaGreenplumsegmentisanindexofthedocumentsthataremanagedbythatsegment.

    TheGPTexthighavailabilitymechanismistomaintainmultiplecopies,orreplicas,oftheshard.TheZooKeeperservicethatmanagesSolrCloudchoosesaGPTextinstance(SolrCloudnode)foreachreplicatoensureevendistributionandhighavailability.Foreachshard,onereplicaiselectedleaderandtheGreenplumsegmentassociatedwiththeshardoperatesonthisleaderreplica.TheGPTextinstancemanagingtheleadreplicamayormaynotbeonanotherGreenplumhost,soindexingandsearchingoperationsarepassedovertheGreenplumcluster’sinterconnectnetwork.SolrCloudreplicateschangesmadetotheleaderreplicatotheremainingreplicas.

    ThefollowingfigureillustratestherelationshipsbetweenGreenplumsegmentsandGPTextindexshardsandreplicas.Theleaderreplicaforeachshardisshowningreenandthefollowersaregray.

    Thenumberofreplicastocreateforeachshard,thereplicationfactor,isaSolrCloudproperty.Bydefault,GPTextstartsSolrCloudwithareplicationfactorofthree.ThereplicationfactorforeachindividualindexisthevalueoftheSolrCloudreplicationfactorwhentheindexiscreated.Changingthereplicationfactordoesnotalterthereplicationfactorforexistingindexes.

    GreenplumSegmentorHostFailureIfaGreenplumprimarysegmentfailsanditsmirrorisactivated,GPTextfunctionsandutilitiescontinuetoaccesstheleaderreplica.Nointerventionisneeded.

    Ifahostintheclusterfails,bothGreenplumandGPTextareaffected.MirrorsfortheGreenplumprimarysegmentslocatedonthefailedhostareactivatedonotherhosts.SolrCloudelectsanewleaderreplicaforaffectedshards.BecauseGreenplumsegmentmirrorsandGPTextshardreplicasaredistributedthroughoutthecluster,asinglehostfailureshouldnotpreventtheclusterfromcontinuingtooperate.Theperformanceofdatabasequeriesandindexingoperationswillbeaffecteduntilthefailedhostisrecoveredandtheclusterisbroughtbackintobalance.

    ZooKeeperClusterAvailabilitySolrCloudisdependentonaworking,availableZooKeepercluster.ForZooKeepertobeactive,amajorityoftheZooKeeperclusternodesmustbeupandabletocommunicatewitheachother.AZooKeeperclusterwiththreenodescancontinuetooperateifoneofthenodesfails,sincetwoisamajorityofthree.Totoleratetwofailednodes,theclustermusthaveatleastfivenodessothatthenumberofworkingnodesremainingafterthefailureareamajority.Totoleratennodefailures,then,aZooKeeperclustermusthave2*n*+1nodes.ThisiswhyZooKeeperclustersusuallyhaveanoddnumberofnodes.

    Thebestpracticeforahigh-availabilityGPTextclusterisaZooKeeperclusterwithfiveorsevennodessothattheclustercantoleratetwoorthreefailednodes.

    ©CopyrightPivotalSoftware,Inc,2013-2018 27 3.1.0

  • ManagingGPTextClusterHealthGPTextdocumentindexingandsearchingservicesremainavailableaslongaseachshardofanindexhasatleastoneworkingreplica.Toensureavailabilityintheeventofafailure,itisimportanttomonitorthestatusoftheclusterandensurethatalloftheindexshardreplicasarehealthy.YoucanmonitortheSolrCloudclusterandindexesusingtheSolrCloudDashboardorusingGPTextfunctionsandmanagementutilities.AccesstheSolrCloudDashboardwithawebbrowseronanyGPTextinstancewithaURLsuchas http://sdw3:18983/solr .(TheportnumbersforGPTextinstancesaresetwiththeGPTEXT_PORT_BASE parameterintheinstallationparametersfileatinstallationtime.)

    RefertotheApacheSolrClouddocumentationforhelpusingtheSolrCloudDashboard.

    MonitoringtheClusterwithGPTextTheGPText gptext-state managementutilityallowsyoutoquerythestateoftheGPTextclusterandindexes.Youcanalsouse gptext.index_status() toviewthestatusofallindexesoraspecifiedindex.

    ToseetheGPTextclusterstaterunthe gptext-state command-lineutilitywiththe -d optiontospecifyadatabasethathastheGPTextschemainstalled.

    gptext-state-dmydb

    TheutilityreportsanyGPTextnodesthataredownandliststhestatusofeveryGPTextindex.Foreachindex,thedatabasename,indexname,andstatusarereported.Thestatuscolumncontains“Green”,“Yellow”,or“Red”:-Green–allreplicasforallshardsarehealthy-Yellow–allshardshaveatleastonehealthyreplicabutatleastonereplicaisdown-Red–noreplicasareavailableforatleastoneindexshard

    ToseethedistributionofindexshardsandreplicasintheGPTextcluster,executethisSQLstatement.

    SELECTindex_name,shard_name,replica_name,node_nameFROMgptext.index_summary()ORDERBYnode_name;

    TolistallGPTextindexes,runthe gptext-statelist command.

    gptext-statelist-dmydb

    The gptext-statehealthcheck commandchecksthehealthofthecluster.The -f flagspecifiesthepercentageofavailablediskspacerequiredtoreportahealthycluster.Thedefaultis10.

    gptext-statehealthcheck-f20-dmydb

    See gptext-state intheManagementUtilitiesreferenceforhelpwithadditional gptext-state options.

    Thegptext.index_status()user-definedfunctionreportsthestatusofallGPTextindexesoraspecifiedindex.

    SELECT*FROMgptext.index_status();

    Specifyanindexnametoreportonlythestatusofthatindex.

    SELECT*FROMgptext.index_status('demo.twitter.message');

    AddingandDroppingReplicasThe gptext-replica utilityaddsordropsareplicaofasingleindexshard.Usethe gptext.add_replica() and gptext.delete_replica() user-definedfunctionstoperformthesametasksfromwithinthedatabase.

    Ifareplicaofashardfails,use gptext-replica toaddanewreplicaandthendropthefailedreplicatobringtheindexbackto“Green”status.

    gptext-replicaadd-imydb.public.messages-sshard3

    Hereistheequivalent,usingthe gptext.add_replica() function:

    ©CopyrightPivotalSoftware,Inc,2013-2018 28 3.1.0

  • SELECT*FROMgptext.add_replica('mydb.public.messages',shard3);

    ZooKeeperdetermineswherethereplicawillbelocated,butyoucanalsospecifythenodewherethereplicaiscreated:

    gptext-replicaadd-imydb.public.messages-sshard3-nsdw3

    Inthe gptext.add_replica() function,addthenodenameasathirdargument.

    Todropareplica,call gptext.delete_replica() withthenameoftheindex,thenameoftheshard,andthenameofthereplica.Youcanfindthenameofthereplicabycalling gptext.index_status(index_name) .Thenameisintheformat core_noden .Anoptional -o flagspecifiesthatthereplicaistobedeletedonlyifitisdown.

    gptext-replicadrop-imydb.public.messages-sshard3-rcore_node4-o

    Hereistheequivalentoftheabovecommandusingthe gptext.delete_replica() user-definedfunction.

    SELECT*FROMgptext.delete_replica('mydb.public.messages','shard3','core_node4',true);

    ©CopyrightPivotalSoftware,Inc,2013-2018 29 3.1.0

  • GPTextBestPracticesEachGPText/ApacheSolrnodeisaJavaVirtualMachine(JVM)processandisallocatedmemoryatstartup.ThemaximumamountofmemorytheJVMwilluseissetwiththe -Xmx parameterontheJavacommandline.Performanceproblemsandoutofmemoryfailurescanoccurwhenthenodeshaveinsufficientmemory.

    OtherperformanceproblemscanresultfromresourcecontentionbetweentheGreenplumDatabase,Solr,andZooKeeperclusters.

    ThistopicdiscussesGPTextusecasesthatstressSolrJVMmemoryindifferentwaysandthebestpracticesforpreventingoralleviatingperformanceproblemsfrominsufficientJVMmemoryandothercauses.

    IndexingLargeNumbersofDocumentsIndexingdocumentsconsumesdatainSolrJVMmemory.Whentheindexiscommitted,partsofthememoryarereleased,butsomedataremainsinmemorytosupportfastsearch.Bydefault,Solrperformsanautomaticsoftcommitwhen1,000,000documentsareindexedor20minutes(1,200,000milliseconds)havepassed.Asoftcommitpushesdocumentsfrommemorytotheindex,freeingJVMmemory.Asoftcommitalsomakesthedocumentsvisibleinsearches.Asoftcommitdoesnot,however,maketheindexupdatesdurable;itisstillnecessarytocommittheindexwiththe gptext.commit()user-definedfunction.

    Youcanconfigureanindextoperformamorefrequentautomaticsoftcommitbyeditingthe solrconfig.xml filefortheindex:

    $gptext-configedit-fsolrconfig.xml-i..

    The elementisachildofthe element.Editthe and valuestoreducethetimebetweenautomaticcommits.Forexample,thefollowingsettingsperformanautocommitevery100,000documentsor10minutes.

    100000600000

    IndexingVeryLargeDocumentsIndexingverylargedocumentscanusealargeamountofJVMmemory.Tomanagethis,youcansetthe gptext.idx_buffer_size configurationparametertoreducethesizeoftheindexingbuffer.

    SeeChangingGPTextServerConfigurationParametersforinstructionstochangeconfigurationparametervalues.

    DeterminingtheNumberofGPTextNodestoDeployAGPTextnodeisaSolrinstancemanagedbyGPText.ThenodescanbedeployedontheGreenplumDatabaseclusterhostsoronseparatehostsaccessibletotheGreenplumDatabasecluster.ThenumberofnodesisconfiguredduringGPTextinstallation.

    ThemaximumrecommendednumberofGPTextnodesyoucandeployisthenumberofGreenplumDatabaseprimarysegments.However,thebestpracticerecommendationistodeployfewerGPTextnodeswithmorememoryratherthantodividethememoryavailabletoGPTextamongthemaximumnumberofGPTextnodes.Usethe JAVA_OPTS installationparametertosetmemorysizeforGPTextnodes.

    AsingleGPTextnodeperhostcaneasilyhandleseveralindexes.EachadditionalnodeconsumesadditionalCPUandmemoryresources,soitisdesirabletolimitthenumberofnodesperhost.FormostGPTextinstallations,asingleGPTextnodeperhostissufficient.

    IftheJVMhasaverylargeamountofmemory,however,garbagecollectioncancauselongpauseswhiletheJVMreorganizesmemory.Also,theJVMemploysamemoryaddressoptimizationthatcannotbeusedwhenJVMmemoryexceeds32GB,soatmorethan32GB,aGPTextnodelosescapacityandperformance.Therefore,noGPTextnodeshouldhavemorethan32GBofmemory.

    Forexample,ifyouhave48GBmemoryavailableforGPTextperhost,youshoulddeploytwoGPTextnodeswith24GBmemory.Ifyouhave128GBavailable,youshoulddeployatleastfourJVMs,andmoreifgarbagecollectionbecomesaproblem.

    ©CopyrightPivotalSoftware,Inc,2013-2018 30 3.1.0

  • ConfigureMaximumJVMHeapSizeEachSolrcorefileconsumesJVMheapmemory.AddingmoreindexesincreasesJVMswappingandgarbagecollectionfrequencysothatittakeslongertocreateindexesandtoloadthecorefileswhenGPTextisstarted.IfyoucontinuetocreateindexeswithoutincreasingtheJVMheap,anoutofmemoryerrorwilleventuallyoccur.

    MonitorperformanceatstartupandduringindexcreationandincreasetheJVMsizewhenyoubegintoseedegradedperformance.Youcanalsousetoolssuchasjconsole,includedwiththeJavaDeveloperKit,tomonitorJavaheapusage.Ifgarbagecollectionsareoccurringtoofrequentlyandfreeingtoolittlememory,JVMheapshouldbeincreased.

    TheJVMsizeisinitiallyconfiguredduringGPTextinstallationbysettingthe JAVA_OPTIONS parameterintheinstallationconfigurationfile.Afterinstallation,usethe gptext-configjvm commandtoincreasetheJVMheapsize.Forexample,this gptext-configjvm commandsetstheJVMmaximumheapoptionto4GB:

    $gptext-configjvm-o"-Xmx=4096M"

    ManageIndexingandSearchLoadsWithhighindexingorsearchload,JVMgarbagecollectionpausescancausetheSolroverseerqueuetobackup.ForaheavilyloadedGPTextsystem,youcanpreventsomeperformanceproblemsbyschedulingdocumentindexingfortimeswhensearchactivityislow.

    TermsQueriesandOutofMemoryErrorsThe gptext.terms() functionretrievestermsvectorsfromdocumentsthatmatchaquery.Anoutofmemoryerrormayoccurifthedocumentsarelarge,orifthequerymatchesalargenumberofdocumentsoneachnode.Otherfactorscancontributetooutofmemoryerrorswhenrunninga gptext.terms() query,includingthemaximummemoryavailabletotheSolrnodes(-Xmxvaluein JAVA_OPTS )andconcurrentqueries.

    Ifyouexperienceoutofmemoryerrorswith gptext.terms() youcansetalowervalueforthe term_batch_size GPTextconfigurationvariable.Thedefaultvalueis1000.Forexample,youcouldtryrunningthefailingquerywith term_batch_size setto500.Loweringthevaluemaypreventoutofmemoryerrors,butperformanceoftermsqueriescanbeaffected.

    SeeGPTextConfigurationParametersforhelpsettingGPTextconfigurationparameters.

    ConfigureFileSystemCachingforZooKeeperGoodSolrperformanceisdependentonfastresponseforZooKeeperrequests.ZooKeeperperformsbestwhenitsdatabaseiscachedsoitdoesnothavetogotodiskforlookups.IfyoufindthatZooKeeperJVMshavefrequentdiskaccesses,lookforwaystoimprovefilecachingormoveZooKeeperdiskstofasterstorage.

    TheZooKeeper zkClientTimeout parameteristhetimeaclientisallowedtonottalktoZooKeeperbeforehavingitssessionexpired.

    ©CopyrightPivotalSoftware,Inc,2013-2018 31 3.1.0

  • TroubleshootingHadoopConnectionProblemsThissectiondescribesHadoop-relatedproblemsandpotentialsolutionstotheseissues.

    DataNodeAccessErrorsYoumayexperienceHadoopaccesserrorswithGPTextifanyDataNodesintheHadoopclusterresideinamulti-homednetwork.GPTextusesanexternalIPaddresstoaccesstheHDFSNameNode.GPTextencountersanerrorwhentheNameNodeprovidesaninternalIPaddressforaDataNode.Inthissituation,additionalconfigurationisrequiredtoconfigureGPTexttoperformitsownDNSresolutionofDataNodehostnames.

    PerformthefollowingproceduretoexplicitlyconfigureDNSresolutionofDataNodehostnames:

    1. LocatealocalcopyoftheHadoopauthenticationconfigurationdirectorythatyoupreviouslyuploadedtoZooKeeper.Forexample,ifthedirectoryislocatedat /home/gpadmin/auths/hdfs_conf :

    $cd/home/gpadmin/auths/hdfs_conf$lscore-site.xmlhdfs-site.xmluser.txt

    2. Open hdfs-site.xml intheeditorofyourchoice.Forexample:

    $vihdfs-site.xml

    3. Addthefollowingpropertyblocktothefile,andthensavethefileandexit:

    dfs.client.use.datanode.hostnametrue

    ThispropertyallowsGPTexthoststoperformtheirownDNSresolutionofHDFSDataNodehostnames.

    4. Re-uploadthemodifiedconfigurationtoZooKeeper.Forexample,ifthe hdfs_conf directoryincludestheauthenticationconfigurationfilesforaHadoopclusterwith hdfs_bill_auth :

    $cd..$gptext-externalupload-thdfs-chdfs_bill_auth-phdfs_conf

    5. Determinethehostname-to-IPaddressmappingforallDataNodes,andaddtheassociatedentriesintothe /etc/hosts fileonallGPTextclienthosts.

    Kerberos-RelatedErrorsThefollowingproblemsarespecifictoHadoopclusterssecuredwithKerberos.

    ClockSkewAloginattempttoaHadoopclustersecuredwithKerberoswillfailifclockskewbetweenGPTextclienthostsandtheKerberosKDChostistoogreat.Inthissituation,youmayseethefollowingerrorintheSolrlog:

    java.io.IOException causedbya KrbException noting“Clockskewtoogreat”

    Toresolvethissituation,ensurethattheclocksontheKerberosKDChostandGPTextclienthostsaresynchronized.

    TimeoutErrorsAloginattempttoaHadoopclustersecuredwithKerberosmayfailwithtimeouterrorswhenthe kdc and admin_server settingsinthe krb5.conf filearespecifiedwithahostname,andtheGPTextclienthostscannotresolvethehostname.Inthissituation,youmayseeoneofthefollowingerrorsintheSolrlog:

    ©CopyrightPivotalSoftware,Inc,2013-2018 32 3.1.0

  • org.apache.solr.common.SolrException: Failed to login HDFS messagecausedbya java.io.IOException specifyingjavax.security.auth.login.LoginException: Receive timed out

    java.nio.channels.UnresolvedAddressException with SocketIOWithTimeout referencedinthestacktrace

    Inthissituation,youmaychooseeitherofthefollowing:

    UpdatetheKerberos krb5.conf filetospecifythe kdc and admin_server settingsusingIPaddresses.Or

    UpdateallGPTexthoststoperformtheirownDNSresolutionoftheKerberosKDCserver.

    Ifyouchoosetoupdatethe krb5.conf file:

    1. LocatealocalcopyoftheHadoopKerberosauthenticationconfigurationdirectorythatyoupreviouslyuploadedtoZooKeeper.Forexample,ifthedirectoryislocatedat /home/gpadmin/auths/hdfs_kerb_conf :

    $cd/home/gpadmin/auths/hdfs_kerb_conf$lscore-site.xmlhdfs-site.xmlkeytabkrb5.confuser.txt

    2. Open krb5.conf intheeditorofyourchoice.Forexample:

    $vikrb5.conf

    3. Replacethe KERBEROS blockattributeswiththeirequivalentIPaddressesandthensavethefileandexit.Forexample:

    [realms]KERBEROS={kdc=admin_server=}

    4. Re-uploadthemodifiedconfigurationtoZooKeeper.Forexample,ifthedirectorynamed hdfs_kerb_conf includestheauthenticationconfigurationfilesforaHadoopclusterdefinedwiththe hdfs_kerb_auth :

    $cd..$gptext-externalupload-thdfs-chdfs_kerb_auth-phdfs_kerb_conf

    Alternately,ifyouchoosetoconfiguretheGPTexthoststoperformtheirownDNSresolutionoftheKerberosKDCserver,addanentryfortheKDChostname-to-IPaddressmappingtothe /etc/hosts fileonallGPTextclienthosts.

    ©CopyrightPivotalSoftware,Inc,2013-2018 33 3.1.0

  • WorkingWithGPTextIndexesIndexingpreparesdocumentsfortextanalysisandfastqueryprocessing.ThistopicshowsyouhowtocreateGPTextindexesandadddocumentsfromGreenplumDatabasetablestothem,andhowtomaintainandcustomizeindexesforyourownapplications.

    ForhelpindexingandsearchingdocumentsstoredoutsideofGreenplumDatabaseseeWorkingWithGPTextExternalIndexes.

    SettingUptheSampleDatabaseTheexamplesinthisdocumentationworkwitha demo databasecontainingthreedatabasetables,called wikipedia.articles , twitter.message ,andstore.products .Ifyouwanttoruntheexamplesyourself,followtheinstructionsinthissectiontosetupthe demo database.

    1. LogintotheGreenplumDatabasemasterasthegpadminuserandcreatethe demo database.

    $createdbdemo

    2. Openaninteractiveshellforexecutingqueriesinthe demo database.

    $psqldemo

    3. Createthe articles tableinthe wikipedia schemawiththefollowingstatements.

    CREATESCHEMAwikipedia;CREATETABLEwikipedia.articles(idint8primarykey,date_timetimestamptz,titletext,contenttext,refstext)DISTRIBUTEDBY(id);

    4. Createthe message tableinthe twitter schemawiththefollowingstatements.

    CREATESCHEMAtwitter;CREATETABLEtwitter.message(idbigint,message_idbigint,spamboolean,created_attimestampwithouttimezone,sourcetext,retweetedboolean,favoritedboolean,truncatedboolean,in_reply_to_screen_nametext,in_reply_to_user_idbigint,author_idbigint,author_nametext,author_screen_nametext,author_langtext,author_urltext,author_descriptiontext,author_listed_countinteger,author_statuses_countinteger,author_followers_countinteger,author_friends_countinteger,author_created_attimestampwithouttimezone,author_locationtext,author_verifiedboolean,message_urltext,message_texttext)DISTRIBUTEDBY(id)PARTITIONBYRANGE(created_at)(START(DATE'2011-08-01')INCLUSIVEEND(DATE'2011-12-01')EXCLUSIVEEVERY(INTERVAL'1month'));CREATEINDEXid_idxONtwitter.messageUSINGbtree(id);

    5. CREATEthe store.products tablewiththesestatements.

    ©CopyrightPivotalSoftware,Inc,2013-2018 34 3.1.0

  • CREATESCHEMAstore;CREATETABLEstore.products(idbigint,titletext,categoryvarchar(32),brandvarchar(32),pricefloat)DISTRIBUTEDBY(id);

    6. Downloadtestdataforthethreetableshere .Right-clickthelink,savethefile,andthencopyittothegpadminuser’shomedirectory.

    7. Extractthedatafileswiththistarcommand.

    $tarxvfzgptext-demo-data.tgz

    8. Loadthewikipediadataintothe wikipedia.articles tableusingthe psql\COPY metacommand.

    \COPYwikipedia.articlesFROM'/home/gpadmin/demo/articles.csv'HEADERCSV;

    The articles tablenowcontainstextfrom23Wikipediaarticles.

    9. Loadthetwitterdataintothe twitter.message tableusingthefollowing psql\COPY metacommand.

    \COPYtwitter.messageFROM'/home/gpadmin/demo/twitter.csv'CSV;

    The message tablenowcontains1730tweetsfromAugusttoOctober,2011.

    10. Loadtheproductstableintothe store.products tablewiththefollowing psql\COPY metacommand.

    \COPYstore.productsFROM'/home/gpadmin/demo/products.csv'HEADERCSV;

    The products tablenowcontains50rows.Thistableisusedtodemonstratefacetedsearchqueries.SeeCreatingFacetedSearchQueries.

    SettinguptheGPTextCommand-lineEnvironmentToworkwithGPTextindexes,youmustfirstsetupyourenvironmentandaddtheGPTextschematothedatabasecontainingthedocuments(GreenplumDatabasedata)youwanttoindex.

    Tosettheenvironment,loginasthe gpadmin userandsourcetheGreenplumDatabaseandGPTextenvironmentscripts.TheGreenplumDatabaseenvironmentmustbesetbeforeyousourcetheGPTextenvironmentscript.Forexample,ifbothGreenplumDatabaseandGPTextareinstalledinthe/usr/local/ directory,enterthesecommands:

    $source/usr/local/greenplum-db-/greenplum_path.sh$source/usr/local/greenplum-text-/greenplum-text_path.sh

    Withtheenvironmentnowset,youcanaccesstheGPTextcommand-lineutilities.

    AddingtheGPTextSchematoaDatabaseUsethe gptext-installsql utilitytoaddtheGPTextschematodatabasescontainingdatayouwanttoindexwithGPText.Youperformthistaskonetimeforeachdatabase.Inthisexample,the gptext schemaisinstalledintothe demo database.

    $gptext-installsqldemo

    The gptext schemaprovidesuser-definedtypes,tables,views,andfunctionsforGPText.ThisschemaisreservedforGPText.Ifyoucreateanynewobjectsinthe gptext schema,theywillbelostwhenyoureinstalltheschemaorupgradeGPText.

    CreatingGPTextIndexesandIndexingData

    ©CopyrightPivotalSoftware,Inc,2013-2018 35 3.1.0

    http://docs-gptext-staging.cfapps.io/demo/gptext-demo-data.tgz

  • ThegeneralstepsforcreatingaGPTextindexandindexingdocumentsare:

    1. CreateanemptySolrindex

    2. Customizetheindex(optional)

    3. Populatetheindex

    4. Committheindex

    Afteryoucompletethesesteps,youcancreateandexecuteasearchqueryorimplementmachinelearningalgorithms.SearchingGPTextindexesisdescribedintheQueryingGPTextIndexestopic.

    ThefollowingstepsarecompletedbyexecutingSQLcommandsandGPTextfunctionsinthedatabase.RefertotheGPTextFunctionReferencefordetailsabouttheGPTextfunctionsdescribedinthefollowingexamples.

    CreateanemptyGPTextindexAGPTextindexisanApacheSolrcollectioncontainingdocumentsaddedfromaGreenplumDatabasetable.TherecanbeoneGPTextindexperGreenplumDatabasetable.EachrowinthedatabasetableisadocumentthatcanbeaddedtotheGPTextindex.

    Ifthedatabasetableispartitioned,thereisoneGPTextindexforallpartitions.Youmustspecifytheroottablenamewhencreatingtheindexandaddingdocumentstoit.GPTextprovidessearchsemanticsthatenablesearchingpartitionsefficiently.

    AGPTextexternalindexisaSolrindexfordocumentsthatarelocatedoutsideofGreenplumDatabase.GPTextprovidesuser-definedfunctionstocreateexternalindexesandinsertdocumentsintothem.SeeWorkingwithGPTextExternalIndexes.

    The gptext.create_index() functioncreatesanewGPTextindex.Thisfunctionhastwosignatures:

    gptext.create_index(,,,[,])

    or

    gptext.create_index(,,,,,[,])

    The and argumentsspecifythedatabasetablethatcontainsthesourcedocuments.

    The argumentisthenameofthetablecolumnthatcontainsauniqueidentifierforeachrow.The columncanbeoftypeint4 , int8 , varchar , text ,or uuid .

    The argumentisthenameofthetablecolumnthatcontainsthecontentyouwanttosearchbydefault.Forexample,ifyouwanttoindexandsearchjustthe column,youcanusethefirstsignatureandspecifythe content columnnameinthe argument.

    Thefinal,optionalargument, ,isaBooleanargument.Whentrue,thedefault,attemptingtoaddadocumentwithanidthatalreadyexistsintheindexgeneratesanerror.Ifyousettheargumenttofalse,youcanadddocumentswiththesameid,butwhenyousearchtheindexalldocumentswiththesameIDarereturned.

    Thefollowingcommandcreatesanindexforthe twitter.message table,withthe id columnastheuniqueIDfieldandthe message_text columnforthedefaultsearchcolumn:

    =#SELECT*FROMgptext.create_index('twitter','message','id','message_text');

    Toverifythatthe demo.twitter.message indexwascreated,call gptext.index_status() :

    ©CopyrightPivotalSoftware,Inc,2013-2018 36 3.1.0

  • =#SELECT*FROMgptext.index_status('demo.twitter.message');content_id|index_name|shard_name|shard_state|replica_name|replica_state|core|node_name|base_url|is_leader|partitioned|external_index------------+----------------------+------------+-------------+--------------+---------------+-----------------------------------------+-----------------+------------------------+-----------+-------------+----------------0|demo.twitter.message|shard0|active|core_node3|active|demo.twitter.message_shard0_replica_n1|sdw2:18983_solr|http://sdw2:18983/solr|t|t|f0|demo.twitter.message|shard0|active|core_node5|active|demo.twitter.message_shard0_replica_n2|sdw1:18983_solr|http://sdw1:18983/solr|f|t|f1|demo.twitter.message|shard1|active|core_node7|active|demo.twitter.message_shard1_replica_n4|sdw2:18983_solr|http://sdw2:18983/solr|t|t|f1|demo.twitter.message|shard1|active|core_node9|active|demo.twitter.message_shard1_replica_n6|sdw1:18983_solr|http://sdw1:18983/solr|f|t|f2|demo.twitter.message|shard2|active|core_node11|active|demo.twitter.message_shard2_replica_n8|sdw2:18983_solr|http://sdw2:18983/solr|t|t|f2|demo.twitter.message|shard2|active|core_node13|active|demo.twitter.message_shard2_replica_n10|sdw1:18983_solr|http://sdw1:18983/solr|f|t|f3|demo.twitter.message|shard3|active|core_node15|active|demo.twitter.message_shard3_replica_n12|sdw2:18983_solr|http://sdw2:18983/solr|t|t|f3|demo.twitter.message|shard3|active|core_node16|active|demo.twitter.message_shard3_replica_n14|sdw1:18983_solr|http://sdw1:18983/solr|f|t|f(8rows)

    ThisexampleexecutedonaGreenplumDatabaseclusterwithfourprimarysegments.Fourshardswerecreated,oneforeachsegment,andeachshardhastworeplicas.

    Youcanalsorunthe gptext-state-D

    command-lineutilitytoverifytheindexwascreated.Seethegptext-statereferencefordetails.

    TheGPTextindexforthe demo.twitter.message tableisconfigured,bydefault,toindexallcolumnsinthe twitter.message databasetable.Youcanwritesearchqueriesthatcontaincriteriausinganycolumninthetable.

    Ifyouwanttoindexandsearchasubsetofthetablecolumns,youcanusethesecond gptext.create_index() signature,specifyingthecolumnstoindexinthe argumentandthedatatypesofthosecolumnsinthe argument.The and argumentsaretextarrays.The

    idcolumnnameanddefaultsearchcolumnnamemustbeincludedinthearrays.

    Usethesecond gptext.create_index() signaturetocreateanindexforthe wikipedia.articles table.Thisindexwillallowyoutosearchonthe title , content ,andrefs columns.Notethattheidcolumnanddefaultsearchcolumnarestillspecifiedinseparateargumentsfollowingthe and

    arrays.

    =#SELECT*FROMgptext.create_index('wikipedia','articles','{id,title,content,refs}','{long,text_intl,text_intl,text_intl}','id','content',true);INFO:Createdindexdemo.wikipedia.articlescreate_index--------------t(1row)

    Becausethe date_time columnwasomittedfromthe and arrays,itwillnotbepossibletosearchthe wikipedia.articles indexondatewiththeGPTextsearchfunctions.

    Customizetheindex(optional)CreatingaGPTextindexgeneratesasetofconfigurationfilesfortheindex.Beforeyouadddocumentstotheindex,youcancustomizetheconfigurationfilestochangethewaydataisindexedandstored.Youcancustomizeanindexlater,afteryouhaveaddeddocumentstoit,butyoumustthenreindexthedatatotakeadvantageofyourcustomizations.

    Onecommoncustomizationistoremapdatatypesforsomedatabasecolumns.Inthe managed-schema configurationfileforanindex,GPTextmapsthedatatypesforeachfieldfromtheGreenplumDatabasetypetoanequivalentSolrdatatype.GPTextappliesdefaultmappings(seeGPTextandSolrDataTypeMappings),butyourindexmaybemoreeffectiveifyouuseadifferentmappingforsomefields.

    The demo.twitter.message table,forexample,hasa message_text textcolumnthatcontainstweets.Bydefault,GPTextmapstextcolumnstotheSolr text_intl(internationaltext)type.TheGPText text_sm (socialmediatext)typeisabettermappingforatextcolumnthatcontainssocialmediaidiomssuchasemoticons.

    Followthesestepstoremapthe message_text fieldtothe gtext_sm type.

    1. Usethe gptext-config utilitytoeditthe managed-schema fileforthe demo.twitter.message index.

    $gptext-configedit-idemo.twitter.message-fmanaged-schema

    The managed-schema fileloadsinatexteditor(normallyvi).

    2. Findthe elementforthe message_text field.

    ©CopyrightPivotalSoftware,Inc,2013-2018 37 3.1.0

  • 3. Changethe type attributefrom text_intl to text_sm .

    4. Savethefileandexittheeditor.

    TherearemanyotherwaystocustomizeaGPTextindex.Forexample,youcanomitfieldsfromtheindexbychangingthe indexed attributeofthe elementto false ,storethecontentsofthefieldintheindexbychangingthe stored attributeto true ,oruse gptext-config toeditthe stopwords.txt filetospecifyadditionalwordstoignorewhenindexing.

    SeeCustomizingGPTextIndexestolearnhowdatatypemappingdetermineshowSolranalyzesandindexesfieldcontentsandformorewaystocustomizeGPTextindexes.

    PopulatetheindexTopopulatetheindex,usethetablefunction gptext.index() ,whichhasthefollowingsyntax:

    SELECT*FROMgptext.index(TABLE(SELECT*FROM),);

    Toindexallrowsinthe twitter.message table,executethiscommand:

    =#SELECT*FROMgptext.index(TABLE(SELECT*FROMtwitter.message),'demo.twitter.message');dbid|num_docs------+----------2|8923|838(2rows)

    Thiscommandindexestherowsinthe wikipedia.articles table.

    =#SELECT*FROMgptext.index(TABLE(SELECT*FROMwikipedia.articles),'demo.wikipedia.articles');dbid|num_docs------+----------3|112|12(2rows)

    Theresultsofthiscommandshowthat23documentsfromtwosegmentswereaddedtotheindex.

    Thefirstargumentofthe gptext.index() functionisatableexpression. TABLE(SELECT*FROMwikipedia.articles)

    createsatableexpressionfromthearticles

    table,usingthetablefunction TABLE .

    Youcanchoosethedatatoindexorupdatebychangingtheinnerselectlistinthequerytoselecttherowsyouwanttoindex.Whenaddingnewdocumentstoanexistingindex,forexample,specifya WHERE clauseinthe gptext.index() calltochooseonlythenewrowstoindex.

    Theinner SELECT statementcouldalsobeaqueryonadifferenttablewiththesamestructure,oraresultsetconstructedwithanarbitrarilycomplexjoin,providedthecolumnsspecifiedinthe gptext.create_index() functionarepresentintheresults.Ifyouindexdatafromasourceotherthanthetableusedtocreatetheindex,besurethedistributionkeyfortheresultsetmatchesthedistributionkeyofthebasetable.TheGreenplumDatabase SELECTstatementhasa SCATTERBY clausethatyoucanusetospecifythedistributionkeyfortheresultsfromaquery.SeeSpecifyingadistributionkeywithSCATTERBYformoreaboutthedistributionpolicyandGPTextindexes.

    CommittheindexAfteryoucreateandpopulateanindex,youcommittheindexusing gptext.commit_index() .

    Thisexamplecommitsthedocumentsaddedtotheindexesinthepreviousexample.

    ©CopyrightPivotalSoftware,Inc,2013-2018 38 3.1.0

  • =#SELECT*FROMgptext.commit_index('demo.twitter.message');commit_index--------------t(1row)

    =#SELECT*FROMgptext.commit_index('demo.wikipedia.articles');commit_index--------------t(1row)

    The gptext.commit_index() functioncommitsanynewdataaddedtoordeletedfromtheindexsincethelastcommit.

    ManagingGPTextIndexesGPTextprovidescommand-lineutilitiesandfunctionsyoucanusetoperformtheseGPTextmanagementtasks:

    Configuringanindex

    Optimizinganindex

    SpecifyingadistributionpolicywithSCATTERBY

    Deletingfromanindex

    Droppinganindex

    Addingafieldtoanindex

    Droppingafieldfromanindex

    Listingallindexes

    Configuringaninde