172
Pivotal Greenplum ® Text Version 3.3.0 User Guide Rev: 01 © 2019 Pivotal Software, Inc.

Pivotal Greenplum Textgptext.docs.pivotal.io/archives/GPText-docs-330.pdf · 2020. 8. 14. · In Greenplum Database 6, the default output format is the hex format, which represents

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

  • PivotalGreenplum®Text

    Version3.3.0

    UserGuide

    Rev:01

    ©2019PivotalSoftware,Inc.

  • 23410151722293234364660728391144168169171

    TableofContents

    TableofContentsPivotal®Greenplum®Text3.3.0DocumentationPivotal®GPText3.3.0ReleaseNotesInstallingGPTextUpgradingGPTextIntroductiontoPivotalGPTextAdministeringGPTextGPTextHighAvailabilityGPTextBestPracticesTroubleshootingHadoopConnectionProblemsWorkingWithGPTextIndexesQueryingGPTextIndexesCustomizingGPTextIndexesWorkingWithGPTextExternalIndexesNaturalLanguageProcessingwithGPTextIndexesGPTextFunctionReferenceGPTextManagementUtilitiesGPTextandSolrDataTypeMappingsGPTextSchemaTablesGPTextConfigurationParameters

    ©CopyrightPivotalSoftware,Inc,2013-2019 2 3.3.0

  • Pivotal®Greenplum®Text3.3.0Documentation

    GPTextDocumentationPDF

    PivotalGPText3.3.0ReleaseNotes

    InstallingPivotalGPText

    UpgradingPivotalGPText

    UsingPivotalGPText

    GPTextReferences

    AdditionalResourcesPivotalGreenplumDatabase

    ApacheSolrWebSite

    ApacheMADlib

    ©CopyrightPivotalSoftware,Inc,2013-2019 3 3.3.0

    http://docs-gptext-staging.cfapps.io/archives/GPText-docs-320.pdfhttp://docs-gptext-staging.cfapps.io/330/topics/http://docs-gptext-staging.cfapps.io/330/topics/FuncRef_preface.htmlhttp://gpdb.docs.pivotal.iohttp://lucene.apache.org/solr/http://madlib.apache.org/

  • Pivotal®GPText3.3.0ReleaseNotesThisdocumentcontainsreleaseinformationforPivotalGPText3.3.0

    Released:June2019

    AboutPivotalGPTextPivotalGPTextjoinstheGreenplumDatabasemassivelyparallel-processingdatabaseserverwithApacheSolrCloudenterprisesearchandtheApacheMADlibAnalyticsLibrarytoprovidelarge-scaleanalyticsprocessingandbusinessdecisionsupport.GPTextincludesfreetextsearchaswellassupportfortextanalysis.

    GPTextincludesthefollowingfeatures:

    TheGPTextdatabaseschemaprovidesin-databaseaccesstoApacheSolrindexingandsearching

    BuildindexeswithdatabasedataorexternaldocumentsandsearchwiththeGPTextAPI

    Customtokenizersforinternationaltextandsocialmediatext

    AUniversalQueryProcessorthatacceptsquerieswithmixedsyntaxfromsupportedSolrqueryprocessors

    Facetedsearchresults

    Termhighlightinginresults

    Naturallanguageprocessing,includingpart-of-speechtaggingandnamedentityextraction

    Greateremphasisonhighavailability

    TheGPTextmanagementutilitysuiteincludescommand-lineutilitiestoperformthefollowingtasks:

    Start,stop,andmonitorZooKeeperandGPTextnodes

    ConfigureGPTextnodesandindexes

    Addanddeletereplicasforindexshards

    BackupandrestoreGPTextindexes

    RecoveraGPTextnode

    ExpandtheGPTextclusterbyaddingGPTextnodes

    PrerequisitesInstallingGPTextalsoinstallsApacheSolrCloudand,optionally,ApacheZooKeeper.

    FollowingareGPTextinstallationprerequisites.

    GPTextrunsonRedHatEnterpriseLinux5.x,6.x,and7.x.

    GPTextrunsonGreenplumDatabaseversion4.3.6orhigher,GreenplumDatabase5,orGreenplumDatabase6.GreenplumDatabase6requiresatleastGPText3.3.

    GPTextrequiresJava8,OpenJDK8,Java11,orOpenJDK11tobeinstalledoneachhostintheGreenplumDatabasecluster.AddtheJRE bindirectorytothe PATH onallhostsinthecluster.

    InstallandconfigureyourGreenplumDatabasesystembeforeyouinstallGPText.SeethePivotalGreenplumDatabaseInstallationGuideathttps://gpdb.docs.pivotal.io .

    Ensurethat nc (netcat)isinstalledonallGreenplumclusterhosts( sudo yum install nc ).

    Installing lsof onallclusterhostsisrecommended( sudo yum install lsof ).

    GPTextcannotbeinstalledontoasharedNFSmount.

    GPTextnodescanbeinstalledontheGreenplumDatabaseclusterhostsalongsidetheGreenplumsegmentsoronadditional,non-databasehostsaccessibleontheGreenplumclusternetwork.AllhostsparticipatingintheGPTextsystemmusthavethesameoperatingsystemandconfigurationandhavepasswordless-sshaccessforthegpadminuser.SeethePivotalGreenplumDatabaseInstallationGuideforinstructionstoconfigurehosts.

    IfyouplantoplaceGPTextnodesontheGreenplumDatabasesegmenthosts,ensurethatyoureservememoryforGPTextusewhenyouconfigureGreenplumDatabase.TodeterminethememorytosetasideforGPText,multiplythenumberofGPTextnodestocreateoneachGreenplumsegmenthostbytheJVMmaximumsize.SubtractthismemoryfromthephysicalRAMwhencalculatingthevaluefortheGreenplumDatabase

    ©CopyrightPivotalSoftware,Inc,2013-2019 4 3.3.0

    https://gpdb.docs.pivotal.io

  • gp_vmem_protect_limit serverconfigurationparameter.SeetheGreenplumDatabaseserverconfigurationparameter gp_vmem_protect_limit intheGreenplumDatabaseReferenceGuideforrecommendedmemorycalculationformulasorvisittheGPDBVirtualMemoryCalculator website.

    ApacheSolrrequiresaZooKeeperclusterwithatminimumthreenodes(fivenodesrecommended).Youcaninstalla“binding”ZooKeeperclusterwithGPTextontheGreenplumclusterhosts,oryoucanuseanexistingZooKeepercluster.WhendeployedalongsideGreenplumDatabasesegments,ZooKeeperperformancecanbeaffectedunderheavydatabaseload.Forbestperformance,installaZooKeeperclusteronseparatehostswithnetworkconnectivitytotheGreenplumnetwork.

    NewFeaturesandEnhancementsinGPText3.3.0

    UsingGPText3.3withGreenplumDatabase6.0GPText3.3canbeinstalledonaGreenplumDatabase6systemwithJava8orJava11.

    AGPTextbinarydistributionhasbeenaddedtoPivotalNetwork forRedHat7/CentOS7withGreenplumDatabase6.

    FollowingaredifferencesusingGPTextwithGreenplumDatabase6thanwithearlierGreenplumDatabasereleases:

    The custom_variable_classes serverconfigurationparameterhasbeenremovedinGreenplumDatabase6.WithearlierGreenplumDatabaseversions,itwasnecessarytoadd 'gptext' tothisparameterinordertosetGPTextconfigurationparameters.GreenplumDatabase6allowsyoutosetconfigurationparametersinadatabasesessionwithoutdeclaringavariableclass.

    InGreenplumDatabase4and5,thedefaultoutputformatforthebinarydatatype bytea isthePostgreSQLescapeformat,asequenceofASCIIcharacterswithescapesequenceswherebytescannotberepresentedwithASCII.InGreenplumDatabase6,thedefaultoutputformatisthehexformat,whichrepresentseachbytewithhexadecimaldigits.InGreenplumDatabase5,thehexoutputformatcanbespecifiedbysettingthebytea_output configurationparameterto hex .ToproducethesameoutputinGreenplumDatabase4,5,and6,youcansetthe bytea_output

    configurationparameterto escape .

    CustomConfigurationDirectoryAnewoptionalinstallationparameter, GPTEXT_CUSTOM_CONFIG_DIR ,canbesetinthe gptext_install_config filetospecifyadirectorytostorecustomconfigurationfiles.

    Bydefault,GPTextsavescustomconfigurationfilesunderthe $GPTEXTHOME/share/ directoryoneachSolrhost,forexample $GPTEXTHOME/share/external_ .

    Tospecifyadifferentdirectorytostoreexternalconfigurationfiles,beforeyouruntheGPTextinstaller,uncommentthe GPTEXT_CUSTOM_CONFIG_DIRparameterinthe gptext_install_config fileandspecifythefullpathtothedirectory.Forexample:

    GPTEXT_CUSTOM_CONFIG_DIR="/home/gpadmin/config_dir"

    ThegpadminusermusthavetheOSpermissionsrequiredtocreatethedirectory.

    Iftheparameterisset,theGPTextinstallerwillcreatethecustomconfigurationdirectoryoneverySolrhost.Configurationfilesyouuploadusingthegptext-externalupload

    commandwillbestoredunderthisdirectoryoneverySolrhosttoallowSolrtoaccesstheexternaldocumentsourcefromeveryhost.

    Forexampleifthe GPTEXT_CUSTOM_CONFIG_DIR parameterissetto /home/gpadmin/config_dir whenyouinstallGPText,ans3configurationwiththenames3_conf willbesavedinthedirectory /home/gpadmin/config_dir/external_source/s3/s3_conf oneachhost.

    NewFeaturesandEnhancementsinGPText3.2.0TheGPText3.2.0releaseprovidesthefollowingfeaturesandenhancements.

    LemmatizationGPText3.2.0enableslemmatizingtermsinGPTextindexes.YoucandefineSolranalysischainsthatincludetheApacheOpenNLPparts-of-speechfilterandthenewGPTextWordNetLemmatizerfilter,whichreplacestermswiththerootformoftheterm.TheWordNetLemmatizerfilterusesalexicaldatabasefromthePrincetonUniversityWordNet®projecttodeterminetherootform.

    ©CopyrightPivotalSoftware,Inc,2013-2019 5 3.3.0

    http://greenplum.org/calc/https://network.pivotal.io/products/pivotal-gpdb

  • GPTextConfigurationFilesLocationGPTextnowsavesconfigurationfiles gptext.conf , gptxtenvs.conf ,and zookeeper.conf onlyintheGreenplumDatabasemasterandstandbymasterdirectories.The gptext.conf fileisnolongersavedineachsegmentdatadirectory.

    FlexibleShardingBydefault,GPTextcreatesoneSolrindexshardforeachGreenplumDatabaseprimarysegment.Youcannowspecifyasmallernumberofshardsbysettingthe gptext.idx_num_shards parametertothenumberofshardsyouwantbeforeyoucreatetheindex.ThisworksforbothregularGPTextindexesandexternalindexes.

    When gptext.idx_num_shards issettothedefault(0),GPTextconfigurestheindextousetheSolr implicit router,withoneshardperGreenplumDatabasesegment.Whenthe gptext.idx_num_shards parameterischangedtothenumberofshardsdesired,GPTextcreatestheindexusingtheSolr compositeId routertoroutedocumentstoshards.The compositeId routerdoesnotsupportduplicateIDs,soifyousetthe if_check_id_uniqueness argumenttofalsewhenyoucallthe gptext.create_index() functionthe implicit routerisused,andtheindexwillhaveoneshardperGreenplumDatabasesegment.

    The content_id columnisremovedfromtheoutputofthe gptext.index_status() and gptext.index_summary() functions,sinceGreenplumDatabasesegmentsarenotalwaysassociatedwithasingleindexshard.

    SeeSpecifyingtheNumberofShardsformoreinformationaboutthisfeature.

    gptext-recoverUtilityWhenusingthe -f ( --force )option,the gptext-recover utilitynowverifiesthattherearenoindexesinaredstatebeforeproceeding.Ifanyindexisdown,theutilityexits.

    ZooKeeperUpgradeApacheZooKeeperincludedwithGPText3.2.0hasbeenupgradedtoversion3.4.11.ThisZooKeeperreleaseincludesbugfixesthatresolveaninconsistentclusterissuewithGPText(MPP-29742).

    NewFeaturesandEnhancementsinGPText3.1.0TheGPText3.1.0releaseprovidesthefollowingfeaturesandenhancements.

    ImprovementstoaidindevelopingandtestinganalyzerchainsThenew gptext.list_field_types() functionliststhefieldtypesdefinedinthe managed-schema configurationfileforanindex.

    Thenew gptext.get_field_type() functiondisplaystheindexandqueryanalyzerchainsforafieldtypeinJSONformat.

    Thenew gptext.analyzer() functionshowstheindexorqueryanalyzerchainoutputforagivenfieldtypeandinputtext.Thisfunctionisusefulfortestinganddebugginganalyzerchainsinteractivelywithoutmodifyingtheindex.

    Part-of-speechtaggingandnamedentityrecognitionGPTextincludesOpenNLPlibrariesandanalyzerclassestoclassifyindexedterms’parts-of-speech(POS),andtorecognizenamedentities,suchasthenamesofpersons,locations,andorganizations(NER).GPTextsavesNERtermsinthefield’stermsvector,prependedwithacodetoidentifythetypeofentityrecognized.Thisallowssearchingdocumentsbyentitytype.

    Thenew gptext.ner_terms() functionlistsNER-taggedtermsfordocumentsthatmatchaquery.

    GPTextincludestheOpenNLPmodelsfortheEnglishlanguage.YoucandownloadmodelsforotherlanguagesfromtheOpenNLPwebsiteandusethemwithGPText.

    Otherenhancementsandfixes

    ©CopyrightPivotalSoftware,Inc,2013-2019 6 3.3.0

  • Thefirstargumentofthe gptext.terms() function,ananytabledatatype,hasbeenmadeoptional.

    Fixedanerrorwherethe gptext.partition_status() functiondisplayedpartitioninformationforanindexafteritwasdropped.

    ApacheSolrupdatedtoSolrversion7.3GPText3.1.0includesApacheSolr7.3.SeethefollowingreleasedocumentsforinformationabouttheSolr7.3release.

    ApacheSolr7.3UpgradeNotes

    ApacheSolr7.3ReleaseHighlights

    FollowingareGPTextchangesandSolrusagenotesrelatedtotheSolr7.3upgrade.

    GPTextserver-sidecomponentsarerebuiltandtestedwiththenewSolrJARfiles.

    The managed-schema , solrconfig.xml andothercollectionconfigurationfilesareupdated.

    Thetop-level elementin solrconfig.xml isnowofficiallydeprecatedinfavoroftheequivalent syntax.ThiselementhasbeenoutofuseindefaultSolrinstallationsforseveralreleasesalready.

    The legacyCloud parameternowdefaultstofalse.Ifanentryforareplicadoesnotexistin state.json ,thatreplicawillnotberegistered.Thismayaffectuserswhobringupreplicasandtheyareautomaticallyregisteredasapartofashard.ItispossibletoreverttotheoldbehaviorbysettingthepropertylegacyCloud=true intheclusterpropertiesbyrunningthefollowingcommandintheGPTextinstallationdirectory:

    $./server/scripts/cloud-scripts/zkcli.sh-zkhost127.0.0.1:2181-cmdclusterprop-namelegacyCloud-valtrue

    WithearlierSolrreleases,ifyoudropanindexwhileaSolrnodewithareplicaoftheindexisdown,whenthedownnodecomesbackon-line,theindexcomesbackandcannotbedeleted.Solr7fixesthisbug.TheGPTextworkaroundforthisbugisremoved.

    PointFieldsaredefaultnumerictypes.Solrhasimplemented*PointFieldtypesacrosstheboard,toreplaceTrie*basednumericfields.AllTrie*fieldsarenowconsidereddeprecated,andwillberemovedinSolr8.IfyouareusingTrie*fieldsinyourschema,youshouldconsidermovingtoPointFieldsassoonasfeasible.ChangingtothenewPointFieldtypeswillrequireyoutore-indexyourdata.

    Thefollowingspatial-relatedfieldshavebeendeprecated:LatLonTypeGeoHashFieldFieldTypeSpatialTermQueryPrefixTreeFieldTypeUseoneofthesefieldtypesinstead:LatLonPointSpatialFieldSpatialRecursivePrefixTreeFieldRptWithGeometrySpatialField

    ToimproveparameterconsistencyintheCollectionsAPI,theparameternames fromNode fortheMOVEREPLICAcommandandsource,and target fortheREPLACENODEcommandhavebeendeprecatedandreplacedwith sourceNode and targetNode instead.Theoldnameswillcontinuetoworkforbackwardscompatibility,buttheywillberemovedinSolr8.

    Thereplicacorenamehaschangedfrom _shard#_replica# to _shard#_replica_# .Forexample,demo.wikipedia.articles_shard0_replica1 becomes demo.wikipedia.articles_shard0_replica_n1 .

    NewFeaturesandEnhancementsinGPText3.0.0GPText3.0.0allowsaddingdocumentsstoredinAmazonWebServicesS3bucketstoaGPTextexternalindex.ThisenhancementincludeschangestoenableuploadingAWScredentialstoZooKeeperandsupportforthe s3 documentsourcetypeforthe gptext.external_login() , gptext.external_logout() ,gptext.index_external() ,and gptext.index_external_dir() GPTextfunctions.

    The gptext-state utilitywiththe --index ( -i )optionnowincludesthedateandtimetheGPTextindexwaslastmodified.

    NewFeaturesandEnhancementsinGPText2.4.0GPText2.4.0allowsaddingdocumentsstoredinanauthenticatedFTPservertoaGPTextexternalindex.Thisenhancementincludeschangestoaddsupportforthe ftp typetothe gptext.external upload command-lineutilityandthe gptext.external_login() , gptext.external_logout(), gptext.index_external() ,and gptext.index_external_dir() GPTextfunctions.

    ©CopyrightPivotalSoftware,Inc,2013-2019 7 3.3.0

    https://lucene.apache.org/solr/guide/7_3/solr-upgrade-notes.htmlhttps://wiki.apache.org/solr/ReleaseNote73

  • NewFeaturesandEnhancementsinGPText2.3.1The gptext-backup command-lineutilitycannowbackupGPTextindexestolocalGPTextclusterstorageaswellasadirectoryonashareddrive.Forlocalbackups,backupmetadataandtheindexconfigurationfilesarebackeduptotheGreenplumDatabasemasterdatadirectoryandindexshardsarebackedupinthesegmentdatadirectoriesoneachhost.

    The gptext-backup utilityhasanewoptiontobackupjusttheindexconfigurationfilesfromZooKeeper,withnoindexdata.

    The gptext-restore uilityisupdatedtorestorebackupscreatedonlocalclusterstorage.

    The gptext-restore utilityhasanewoptiontorestoreonlytheconfigurationfilesfromabackup.ThisoptionloadstheconfigurationfilesintoZooKeeperandcreatesanemptyGPTextindex.

    NewFeaturesandEnhancementsinGPText2.3.0

    Revisedgptext-configUtilitySyntaxThe gptext-config command-lineutilitywasrevisedtohaveamoreuser-friendlysyntax.

    Anew list subcommandwasaddedto gptext-config youcanusetolistalloftheconfigurationfilesforaspecifiedGPTextindex.

    $gptext-configlist-i

    IndexDocumentsinaHadoopFileSystem(hdfs)DocumentSourceGPText2.3.0enablesyoutoadddocumentsstoredinahdfssystemtoaGPTextexternalindex.

    Thenew gptext-external command-lineutilityuploadsHadoopconfigurationandauthenticationfilestoanamedconfigurationinZooKeeper.Theutilityhassubcommands upload , list ,and delete tomanagetheconfigurationsyouhaveuploaded.

    Thenew gptext.external_login() functionlogsintothehdfssystemusingthenamedconfigurationyouhaveuploaded.Youcanlogintoonlyoneexternaldocumentsourceatatime.

    UseURLsoftheform hdfs:// withthe gptext.index() and gptext.index_external() functionstoadddocumentstoaGPTextexternalindex.

    Usethenew gptext.index_external_dir() functiontoaddalldocumentsinanhdfsdirectorytoaGPTextexternalindex.

    Logoutofthehdfsexternaldocumentsourcewiththenew gptext.external_logout() function.

    SeeAuthenticatingwithanExternalDocumentSourceforstepstoenableaccesstoanhdfsdocumentsource.

    KnownIssuesSeetheApacheJira forknownissuesinApacheSolr.

    FollowingareknownissuesinGPText.Workaroundsareprovidedwhenavailable.

    WildcardsinGPTextSearchOptionsSolrdoesnotreturnallfieldswhenthe fl Solrsearchoptioncontainsawildcardthatmatchesfieldnames.Forexample,givenatablewithcolumnscontenta and contentb ,specifying fl=contenta,contentb,(sum,1,1) correctlyreturnsthreefields.Specifying fl=cont*,sum(1,1) correctlyreturns contenta andcontentb ,butomitsthepseudo-field sum(1,1) .

    Specifyingawildcardtomatchallfields( fl=*,sum(1,1) )alsoomitsthepseudo-field.

    IndexLoadFailureAfterConfigurationFileErrorIfSolrfailstoloadanindexbecauseofaconfigurationfileerror,andthentheindexisdroppedwithoutfirstcorrectingtheconfigurationfileerror,the

    ©CopyrightPivotalSoftware,Inc,2013-2019 8 3.3.0

    https://issues.apache.org/jira/projects/SOLR/summary

  • indexcannotberecreateduntilGPTextisrestarted.Thiscanhappenifyouedit managed-schema or solrconfig.xml andintroduceanXMLsyntaxerrororatypoinconfigurationvalues.

    Workaround:

    1. Whenanindexfailstoload,checktheSolrlogtofindthecause.

    2. Ifthecauseisaconfigurationfileerror,suchasinvalidXML,usethe gptext-config utilitytoeditthefileandfixtheerror.Droppingtheindexwithoutfirstcorrectingtheerrorisnotrecommended.

    3. Ifyouhavedroppedanindexthatfailedtoloadwithoutfirstcorrectingthecauseofthefailure,youmustrestartGPTextbeforeyoucanrecreatetheindex.Run gptext-start -r torestartGPText.

    StartupFailurewithLargeNumbersofIndexesWhenthereisalargenumberofSolrcores,SolrCloudcanfailtorestartsuccessfully,witherrormessagesindicatingfailuretoelectleadersforshards.ThisisaknownSolrissue;seehttps://issues.apache.org/jira/browse/SOLR-5990 intheApacheSolrJiraforanexample.Becauseofthisissue,itisrecommendedtoavoiddesigningGPTextapplicationsthatcreatelargenumbersofindexes,shards,andreplicas.Thenumberofcoresyoucancreatebeforeyouobservethisbehaviorishardwaredependent,soyoushouldtesttodetermineyoursystem’slimits.Youcancreateandsuccessfullyoperatealargernumbersofindexesthancanberestartedsuccessfullylater,sobesuretotestrestartingGPTexttodetermineapracticallimit.

    SettingGPTextConfigurationParametersWithoutFirstSettingcustom_variable_classesInGreenplumDatabaseversionsbeforeGreenplumDatabase6,ifthe custom_variable_classes GreenplumDatabaseserverconfigurationparameterdoesnotincludethevalue“gptext”,attemptingtosetaGPTextconfigurationparameterreturnsanerrormessage,forexample:

    mydb-#setgptext.replication_factor=4;WARNING:PleaselogonagaintomakeGUCsettingtakeeffect.(GucValue.h:301)WARNING:PleaselogonagaintomakeGUCsettingtakeeffect.(GucValue.h:301)ERROR:unrecognizedconfigurationparameter"gptext.replication_factor"

    InGPText2.0,inadditiontotheerrormessage,thevalueoftheconfigurationparameterpersistedinZooKeeperiszero,replacingthepreviousvalueoftheparameter.

    mydb-#showgptext.replication_factor;gptext.replication_factor----------------------------0

    BeginningwithGPText2.1,theerrormessageisstillgenerated,howeverthevaluesavedinZooKeeperisthevaluespecifiedinthe set command,4intheprecedingexample.

    Topreventtheerrormessage,beforesettinganyGPTextconfigurationparameters,usethe gpconfig command-lineutilitytosetthe custom_variable_classesconfigurationparameter:

    $gpconfig-ccustom_variable_classes-v'gptext'

    InGreenplumDatabase6.0,the custom_variable_classes configurationparameterisremovedandcustomparameterscanbesetwithouterrors.

    ©CopyrightPivotalSoftware,Inc,2013-2019 9 3.3.0

    https://issues.apache.org/jira/browse/SOLR-5990

  • InstallingGPText

    PrerequisitesTheGPTextinstallationincludestheinstallationofApacheSolrCloudand,optionally,ApacheZooKeeper.

    IfyouareinstallinganewGPTextreleaseintoanexistingGPTextsystem,followtheinstructionsinUpgradingGPTextinstead.

    FollowingareGPTextinstallationprerequisites.

    InstallandconfigureyourGreenplumDatabasesystem,version4.3.6orhigher.SeethePivotalGreenplumDatabaseInstallationGuideathttps://gpdb.docs.pivotal.io .

    GPTextrunsonRedHatEnterpriseLinuxorCentOS5.x,6.x,or7.x.

    GPTextcannotbeinstalledontoasharedNFSmount.

    InstallaJRE1.8or1.11onallhostsinthecluster.

    Ensurethat nc (netcat)isinstalledonallGreenplumclusterhosts( yum install nc ).

    Installing lsof onallclusterhostsisrecommended( sudo yum install lsof ).

    GPTextnodescanbeinstalledontheGreenplumDatabaseclusterhostsalongsidetheGreenplumsegmentsoronadditional,non-databasehostsaccessibleontheGreenplumclusternetwork.AllhostsparticipatingintheGPTextsystemmusthavethesameoperatingsystemandconfigurationandhavepasswordless-sshaccessforthegpadminuser.SeethePivotalGreenplumDatabaseInstallationGuideforinstructionstoconfigurehosts.

    IfyouplantoplaceGPTextnodesontheGreenplumDatabasesegmenthosts,ensurethatyoureservememoryforGPTextusewhenyouconfigureGreenplumDatabase.TodeterminethememorytosetasideforGPText,multiplythenumberofGPTextnodestocreateoneachGreenplumsegmenthostbytheJVMmaximumsize.SubtractthismemoryfromthephysicalRAMwhencalculatingthevaluefortheGreenplumDatabasegp_vmem_protect_limit serverconfigurationparameter.SeetheGreenplumDatabaseserverconfigurationparameter gp_vmem_protect_limit intheGreenplumDatabaseReferenceGuideforrecommendedmemorycalculationformulasorvisittheGPDBVirtualMemoryCalculator website.

    ApacheSolrrequiresaZooKeeperclusterwithatminimumthreenodes.Youcaninstalla“binding”ZooKeeperclusterwithGPTextontheGreenplumclusterhosts,oryoucanuseanexistingZooKeepercluster.WhendeployedalongsideGreenplumDatabasesegments,ZooKeeperperformancecanbeaffectedunderheavydatabaseload.Forbestperformance,installaZooKeeperclusterwithatleastthreenodes(fivenodesrecommended)onseparatehostswithnetworkconnectivitytotheGreenplumnetwork.

    InstalltheGPTextBinaryDistribution1. OntheGreenplummasterhost,extracttheGPTextdistributionfile.Forexample:

    $cd/home/gpadmin$tarxvfzgreenplum-text--.tar.gz

    Thiscreatesthedirectory greenplum-text-- containingthefiles: gptext_install_config andtheGPTextinstallationbinary,whichhasanameintheformat greenplum-text--.bin .

    2. Ifnecessary,grantexecutepermissiontotheGPTextbinary.Forexample:

    $chmod+x/home/gpadmin/greenplum-text--.bin

    3. IfyouareinstallingGPTextinadirectorythatisonlywritablebyroot,suchasthedefaultdirectory /usr/local ,performthesestepsasroot:

    a. Sourcethe greenplum_path.sh fileintheGreenplumDatabaseinstallationdirectory.

    #source/usr/local/greenplum-db-/greenplum_path.sh

    b. LocateorcreateatextfilecontainingalistofthenamesofallhostswhereyouwillinstallGPText,oneperline,includingthemasterandstandbyhostnames.

    c. Startgpssh,specifyingthetextfilewithhostnames.

    #gpssh-fhostlist.txt

    d. Createtheinstallationdirectoryandthe greenplum-solr directoryandsettheownershipandpermissions.Forexample,ifyouareinstallingGPTextinthedefaultdirectory, /usr/local :

    ©CopyrightPivotalSoftware,Inc,2013-2019 10 3.3.0

    https://gpdb.docs.pivotal.iohttp://greenplum.org/calc/

  • =>mkdir/usr/local/greenplum-text-=>mkdir/usr/local/greenplum-solr=>chowngpadmin:gpadmin/usr/local/greenplum-text-=>chmod775/usr/local/greenplum-text-=>chowngpadmin:gpadmin/usr/local/greenplum-solr=>chmod775/usr/local/greenplum-solr=>exit

    e. Completetheremainingstepsasthegpadminuser.

    4. Editthe gptext_install_config filetosetparametersfortheinstallation.SeeSetInstallationParametersfordetails.

    5. RuntheGPTextinstallationbinaryas gpadmin onthemasterserver:

    $./greenplum-text--.bin-c

    6. AcceptthePivotallicenseagreement.

    OptionalTwo-PartGPTextInstallationTheGPTexttwo-partinstallationinstallsanddeploystheGPTextsoftwareinseparatesteps.Thisgivesyoutheoptiontoinstallthesoftwarefilestoaread-only,shareddirectorymountedonallGPTexthostsinthecluster,ratherthaninstallingthesoftwareoneveryGPTexthost.

    IfyouinstalltheGPTextsoftwareontoashareddrive,youmustsetthe GPTEXT_CUSTOM_CONFIG_DIR parameterintheinstallationconfigurationfile.ThisparameterspecifiesawritabledirectorythatexistsoneveryGPTexthostwhereGPTextcanstoreconfigurationfilesforexternaldatasources.SeeGPTextinstallationparametersformoreinformationaboutthisparameter.

    RuntheGPTextinstallationintwopartsbyfollowingthestepsinthissection.

    1. PrepareGPTextinstallationdirectoriesasdescribedinsteps1through3inInstalltheGPTextBinaries.

    2. RuntheGPTextinstallationbinaryas gpadmin onthemasterserver:

    $./greenplum-text-.bin-b

    Notethatthe -c optionisomitted.

    3. SourcetheGPTextenvironmentscriptintheGPTextinstallationdirectory:

    $source/greenplum-text_path.sh

    4. Editthe gptext_install_config filetosetparametersfortheGPTextdeployment.SeeSetInstallationParametersfordetails.Besuretouncommentandsetthe GPTEXT_CUSTOM_CONFIG_DIR parameterifyouinstalledthesoftwareonaread-onlydrive.

    5. DeploytheGPTextclusterwiththe gptext-deploy command.Thecommandrequiresthe -c optiontospecifytheinstallationconfigurationfile.Alsoincludethe -m optionbecauseyouinstalledtheGPTextsoftwaretoashareddrivemountedonallGPTexthosts.Ifyoudonotinclude -m , gptext-deploy copiestheGPTextsoftwaretoallGPTexthosts.

    $gptext-deploy-m-c

    SetInstallationParametersAGPTextconfigurationfilenamed gptext_install_config containsparameterstoconfiguretheGPTextinstallation.Editthefileandsettheparametersasdescribedinthefollowingtable.

    The GPTEXT_HOSTS and DATA_DIRECTORY installationparametersdeterminethenumberofGPTextnodesthataredeployed.Thenumberofdirectoriesincludedinthe DATA_DIRECTORY arrayisthenumberofGPTextnodesthatarecreatedperhost.

    The GPTEXT_HOSTS parameterdeterminesthenumberofhosts.Ifsettotheconstant "ALLSEGHOSTS" thenumberofGPTextnodehostsisthesameasthenumberofGreenplumsegmenthosts.If GPTEXT_HOSTS issettoanarrayofhostnames,thelengthofthearrayisthenumberofGPTextnodehosts.

    ©CopyrightPivotalSoftware,Inc,2013-2019 11 3.3.0

  • GPTextinstallationparameters

    GPTEXT_HOSTS

    AnarrayofhostnamesonwhichtoinstallGPText,orusetheconstant "ALLSEGHOSTS" toinstallGPTextonallGreenplumDatabasesegmenthosts.GPTexthostsmustbepasswordlessssh-accessiblebythegpadminuserfromallotherhostsintheGreenplumCluster.

    declare -a GPTEXT_HOSTS=(gptext_h1 gptext_h2 gptext_h3)

    GPTEXT_HOSTS="ALLSEGHOSTS"

    DATA_DIRECTORY

    AnarrayofdirectorypathswhereGPTextdatadirectoriesaretobecreated.ThenumberofdirectoriesinthearraydeterminesthenumberofGPTextnodesthatwillbecreatedoneachphysicalhost.If GPTEXT_HOSTS listsmultipleinterfacesperhost,theGPTextnodesarespreadevenlyacrosstheinterfaceaddresses.

    declare -a DATA_DIRECTORY=(/data/primary /data/primary)

    GPTEXT_CUSTOM_CONFIG_DIR

    ThepathtoadirectorywhereGPTextstoresuploadedexternaldatasourceconfigurationfilesandcustomlibraries.Ifyoudonotsetthisparameter,thedefaultistostorethesefilesinthe share subdirectoryoftheGPTextinstallationdirectory.Ifyoudospecifyadirectorywiththisparameter,thedirectoryiscreatedoneverySolrhostinthecluster,andexternalconfigurationfilesandcustomlibrarieswillbestoredthere,leavingtheGPTextinstallationdirectoryfreefromapplicationdata.

    JAVA_OPTS

    SetstheminimumandmaximummemoryeachSolrCloudJVMcanuse.

    JAVA_OPTS="-Xms1024M -Xmx2048M"

    GPTEXT_PORT_BASE

    GP_MAX_PORT_LIMIT

    SetarangeofportnumbersavailabletoGPTextnodes.GPTextfindsunusedportsinthespecifiedrange.

    GPTEXT_PORT_BASE=18983GP_MAX_PORT_LIMIT=28983

    ZOO_CLUSTER

    WhethertodeployaGPTextbindingZooKeeperclusteroruseanexistingZooKeepercluster.Ifsetto "BINDING" theinstallationdeploysaZooKeepercluster.TouseanexistingZooKeepercluster,setthisparametertoalistofZooKeepernodesintheformat"host1:port,host2:port,host3:port “.

    ZOO_CLUSTER="BINDING"

    ZOO_HOSTS

    If ZOO_CLUSTER issetto "BINDING" ,thisparameterisanarrayofthehostswheretheZooKeepernodesaretobeinstalled.Thearraymustcontain3,5,or7hostnames,forexample ZOO_HOSTS=(sdw1 sdw2 swd3 sdw4 sdw5) .IfyouareusingasinglehostforZooKeeper,specifyitmultipletimes,forexample, ZOO_HOSTS=(sdw1 sdw1 sdw1) .

    declare -a ZOO_HOSTS=(sdw1 sdw2 sdw3 sdw4 sdw5)

    ZOO_DATA_DIR

    TheZooKeeperdatadirectory,requiredwhen ZOO_CLUSTER issetto "BINDING" .

    ZOO_DATA_DIR="/data/master/"

    ThemaximumnumberofGPTextnodesisthenumberofGreenplumDatabaseprimarysegments.ThebestpracticerecommendationistodeployfewerGPTextnodeswithmorememoryratherthantodividethememoryavailabletoGPTextamongthemaximumnumberofGPTextnodesallowed.Forexample,ifthereareeightprimarysegmentsperhostintheGreenplumDatabasecluster,themaximumnumberofGPTextnodesperhostiseight,butyoushouldtestwithtwoorfourGPTextnodesperhost,adjustingthe JAVA_OPTS installationparametertodividethememoryreservedforGPTextamongthem.

    ©CopyrightPivotalSoftware,Inc,2013-2019 12 3.3.0

  • ZOO_GPTXTNODE

    ThenodepathinZooKeeperforGPText.Thisparameterisrequiredwhether ZOO_CLUSTER issetto "BINDING" oralistofhosts.

    ZOO_GPTXTNODE="gptext"

    ZOO_PORT_BASE

    ZOO_MAX_PORT_LIMIT

    ArangeofportnumberstousefortheZooKeepercluster.Unusedportsareallocatedfromwithinthisrange.Therangemustcontainatleast4000portnumbers.

    ZOO_PORT_BASE=2188ZOO_MAX_PORT_LIMIT=12188

    GPTEXT_JAVA_HOME

    ThehomedirectoryoftheJavainstallationtorunforZooKeeperandSolrprocesses.Ifnotset,theJREspecifiedinthe PATH and JAVA_HOMEenvironmentvariableswillbeused.

    GPTEXT_JAVA_HOME=/usr/java/jdk1.8.0_131

    StartingGPTextFirst,makesuretheGPTextcommand-lineutilitiesareinyourpathbysourcingtheGreenplumDatabaseandGPTextenvironmentscripts.ItisimportanttosourcetheGPTextenvironmentscripteachtimeyousourcetheGreenplumDatabasescript.Forexample:

    $source/usr/local/greenplum-db-/greenplum_path.sh$source/usr/local/greenplum-text-/greenplum-text_path.sh

    TouseGPTextinadatabase,youmustfirstusethe gptext-installsql managementutilitytoinstalltheGPTextuser-definedfunctionsandotherobjectsinthedatabase:

    $gptext-installsqldatabase[database2...]

    TheGPTextobjectsarecreatedinthe gptext schema.

    TheZooKeeperclustermustberunningbeforeyoustartGPText.IfyouinstalledaboundZooKeepercluster,startitwiththe zkManager command-lineutility.

    $zkManagerstart

    StartGPTextwiththe gptext-start utility.

    $gptext-start

    ConfigureGreenplumDatabaseGPTextconfigurationparametersaresavedinZooKeeper.Youcan,however,viewandsetGPTextconfigurationparametersinaGreenplumDatabasesessionusingthe SHOW and SET commands.

    IfyouareusingGreenplumDatabase4.3.xor5.x,youmustfirstdeclaretheGPTextcustomvariableclassbyaddingittotheGreenplumDatabasecustom_variable_classes configurationparameter.The custom_variable_classes parameterisremovedinGreenplumDatabase6,sothisstepisunnecessaryifyouhaveGreenplumDatabase6.

    The custom_variable_classes configurationparameterisacomma-separatedlistofclassnames.Itisunsetbydefault.Toseeifanycustomvariableclasseshavealreadybeenconfigured,runthis gpconfig commandatthecommandline.

    $gpconfig-scustom_variable_classes

    Ifnocustomvariableclasseshavebeenset,settheparameterwiththefollowingcommand.

    ©CopyrightPivotalSoftware,Inc,2013-2019 13 3.3.0

  • $gpconfig-ccustom_variable_classes-v'gptext'[gpadmin@gpsne~]$gpconfig-ccustom_variable_classes-v'gptext'20171029:12:29:11:028199gpconfig:gpsne:gpadmin-[INFO]:-completedsuccessfully

    Ifotherclasseshavebeenconfigured,add gptext totheexistinglist,separatedbyacomma.

    Run gpstop-u

    tohaveGreenplumDatabasereloadtheconfigurationfile.

    VieworsetGPTextConfigurationParametersWhenyouwanttovieworsetGPTextconfigurationparametersina psql session,firstexecutethe gptext.version() functiontoloadtheGPTextconfigurationparametersintothesession.

    =#SELECTgptext.version();version--------------------------------GreenplumTextAnalytics3.2.0(1row)

    =#SHOWgptext.idx_delim;gptext.idx_delim------------------,(1row)

    SeeSettingGPTextConfigurationParametersformoreaboutGPTextconfigurationparameters.

    UninstallingGPTextTouninstallGPText,runthe gptext-uninstall utility.YoumusthavesuperuserpermissionsonalldatabaseswithGPTextschemastorun gptext-uninstall .

    gptext-uninstall runsonlyifthereisatleastonedatabasewithaGPTextschema.

    Execute:

    $gptext-uninstall

    ©CopyrightPivotalSoftware,Inc,2013-2019 14 3.3.0

  • UpgradingGPTextUpgradingaGPTextsystemtoanewGPTextreleaseinstallsthenewGPTextsoftwarereleaseonallhostsintheGreenplumclusterandthenupgradestheGPTextsystem.

    UpgradingGPTextandGreenplumDatabaseattheSameTimeIfyouareupgradingtonewreleasesofGreenplumDatabaseandGPTextatthesametime,followthesesteps:

    1. CompletetheGreenplumDatabaseupgradefirstandensurethedatabaseisoperational.

    2. RuntheGPText gptext-migrator utilitytomigrateyourcurrentGPTextsystemtothenewlyupgradedGreenplumDatabasesystem.

    3. EnsurethatthecurrentversionofGPTextworkswiththenewGreenplumDatabaseversion.

    4. ProceedwiththeGPTextupgrade.

    UpgradingaGPTextReleaseUpgradingaGPTextreleaseisatwo-partprocess:installthenewsoftwarereleaseontheGreenplumclusterhostsandthenupgradetheexistingGPTextsystem.TheGPTextinstallerperformsthefirstpart,installingthenewsoftware.The gptext-upgrade utilityperformsthesecondpart,upgradingthecurrentGPTextsystemtothenewversion.

    TheGPTextinstallerdetectsanexistingGPTextsystemand,afterinstallingthenewsoftwarerelease,offerstorunthe gptext-upgrade utilityforyou.IfyouchoosetoupgradetheGPTextsystemlater,youcanrunthe gptext-upgrade utilityyourself.

    AllupgradetasksareexecutedontheGreenplummasterhostasthe gpadmin user.The gpadmin usermusthavewritepermissioninthedirectorywherethenewGPTextreleaseistobeinstalled, /usr/local/greenplum-text-- bydefault.

    TheGreenplumDatabase,ZooKeeper,andGPTextclustersmustberunning.TheprocedurestopsandrestartsGPTextduringtheupgrade.

    Followthesesteps:

    1. DownloadthenewGPTextreleaseforyourplatformfromPivotalNetwork .

    2. Extractthereleasepackage.

    $tarxfzgreenplum-text--.tar.gz

    3. MakesurethatZooKeeperandGPTextarerunning.

    $gptext-state

    4. RuntheGPTextinstaller.

    $./greenplum-text--.bin

    5. TheinstallerpromptsyoutoacceptthePivotallicenseagreementandtochooseandcreatetheinstallationdirectory.

    6. Theinstallerverifiestheenvironmenttoensurethatprerequisitesarepresent,suchasPythonandJava.Ifanyproblemsarediscovered,theinstalleroutputsanerrormessageandstops.Correcttheproblemidentifiedbythemessageandruntheinstalleragain.

    7. AfterthenewsoftwarehasbeeninstalledontheGreenplumcluster,theinstallerlooksforanexistingGPTextinstallation.IfanexistingGPTextsystemisfound,theinstallerasksifyouwishtoupgradeGPTextdirectly.

    Ifyouansweryes,theinstallerrunsthe gptext-upgrade script.The gptext-upgrade utilityvalidatestheenvironmenttoensureitcancompletetheupgrade,thenexecutestheupgradeandrestartstheGPTextsystem.Ifanyproblemsarediscovered, gptext-upgrade outputsamessageandquits.Fixtheindicatedproblemsandrunthegptext-upgradeutility(at /bin/gptext-upgrade )tocomplete

    WhenupgradingGPText,youdonotspecifyaninstallationconfigurationfileasyoudofortheinitialGPTextinstallation.

    ©CopyrightPivotalSoftware,Inc,2013-2019 15 3.3.0

    http://network.pivotal.io

  • theGPTextsystemupgrade.Ifyouanswerno,youmustrunthe gptext-upgrade scriptaftertheinstallercompletes.Seethegptext-upgradeutilityreferenceforinstructions.

    Important:Ifyouanswernoorifthe gptext-upgrade quitswithoutupgradingyoursoftware,followthesestepstore-run gptext-upgrade atalatertime:

    a. Sourcethe greenplum-text_path.sh scriptintheoldGPTextinstallationdirectory.Forexample:

    $ source /usr/local/greenplum-text-/greenplum-text_path.sh

    b. Runthe gptext-upgrade commandfromthenewGPTextinstallationdirectory:

    $ /usr/local/greenplum-text-/bin/gptext-upgrade

    8. Aftertheupgradehascompleted,sourcethe greenplum-text_path.sh inthenewGPTextreleasedirectoryandrun gptext-statehealthcheck toverifytheGPTextsystem:

    $source/usr/local/greenplum-text-/greenplum-text_path.sh$gptext-statehealthcheck

    ©CopyrightPivotalSoftware,Inc,2013-2019 16 3.3.0

  • IntroductiontoPivotalGPTextPivotalGPTextenablesprocessingmassquantitiesofrawtextdata(suchassocialmediafeedsore-maildatabases)intomission-criticalinformationthatguidesbusinessandprojectdecisions.GPTextjoinstheGreenplumDatabasemassivelyparallel-processingdatabaseserverwithApacheSolrCloudenterprisesearch.GPTextincludespowerfultextsearchaswellassupportfortextanalysis.GPTextsupportsbusinessdecisionmakingbyoffering:

    Multiplekindsofdata:GPTextsupportsbothsemi-structuredandunstructureddatasearches,whichexponentiallyincreasesthekindsofinformationyoucanfind.

    Multipledocumentsources:GPTextcanindexdocumentsstoredinGreenplumDatabasetablesordocumentsretrievedfromexternalstores,suchasHTTPorFTPservers,AmazonS3,orHadoophdfs.Mostdocumentformatsarerecognizedautomatically.

    Lessschemadependence:GPTextdoesnotrequirestaticschemastosuccessfullylocateinformation;schemascanchangeorbequitesimpleandstillreturntargetedresults.

    Naturallanguagetextprocessing:GPTextprovidesNLPcapabilitieswiththeintegratedApacheOpenNLPtoolkit.

    Textanalytics:YoucanuseApacheMADlibinGreenplumDatabaseforadvancedmachinelearning,graph,statisticsandanalyticsinGreenplumDatabase.

    Thischaptercontainsthefollowingtopics:

    GPTextSystemArchitecture

    GPTextSampleUseCase

    GPTextWorkflow

    TextAnalysis

    GPTextSystemArchitectureGPTextcombinesaGreenplumDatabaseclusterwithanApacheSolrCloudcluster.GreenplumDatabasesegmentsandGPTextnodescanbedeployedonthesamehostsorondifferenthostswithnetworkconnectivity.

    ThefollowingfigureshowstheprocessarchitectureofthecombinedGreenplumDatabaseandApacheSolrclusters.ThefigureshowsfourclusternodeswithfourGreenplumsegmentsandfourSolrinstancesdeployedoneach.AnApacheZooKeeperservicemanagestheSolrCloudcluster.ZooKeepernodesaredeployedonthreeofthefourhosts.GreenplumDatabaseusersaccessSolrCloudservicesviaGPTextuser-definedfunctionsinstalledinGreenplumdatabasesandcommand-lineutilities.

    ©CopyrightPivotalSoftware,Inc,2013-2019 17 3.3.0

  • ThefigureomitstheGreenplummasterhost,secondarymaster,andmirrorsegmentsfortheGreenplumprimarysegments.

    TheGreenplumsegments,Solrinstances,andZooKeepernodesmayallbedeployedonseparatehostsonthesamenetwork,dependingonapplicationandperformancerequirements.

    ThefollowingsectionsdescribehowGPTextintegratesSolrCloudwithGreenplumDatabaseandhowthetwoclustersworktogethertoprovideparalleltextsearchcapabilitiesinGreenplumDatabaseandmaintainhighavailability.

    GreenplumDatabaseClusterAGreenplumDatabaseclusteriscomprisedofthefollowingcomponents:

    Amasterdatabaseinstance,executingonadedicatedhost,conventionallynamed mdw .(Notillustrated)

    Asecondarymasterinstance,onahostconventionallynamed smdw ,actingasawarmstandbyforthemasterinstance.(Notillustrated)

    Anarrayofdatabaseprimarysegmentinstancesandmirrorsdeployedonsegmenthosts,byconvention sdw1 through sdwn .AsegmentinstanceisanindependentPostgresdatabaseservermanagingaportionofthedistributeddata.Eachsegmenthasamirror(notillustrated)onanotherhostintheclustertoprovideuninterruptedserviceincaseofasegmentorsegmenthostfailure.Thenumberofprimarysegmentsperhostisdeterminedbythehardwareconfiguration—thenumberandtypeofprocessorcores,theamountofphysicalRAM,localstoragecapacity,andnetworkcapacity—aswellasavailabilityandperformancerequirements.

    TheGreenplumDatabasemasterinstance,whichstoresnouserdata,coordinatestheworkofthesegmentinstances.DatabaseuserslogintothemasterinstanceandsubmitSQLqueries.Themasterinstancecreatesaplanforexecutingthequery,distributestheworktothesegments,andgathersandreturnstheresultstotheuser.

    ApacheSolrCloudApacheSolrisaserverprovidingaccesstoApacheLucenefull-textindexes.ApacheSolrCloudisahighlyavailable,faulttolerantclusterofApacheSolrservers.ThetermGPTextclusterisanotherwaytorefertoaSolrCloudclusterdeployedbyGPTextforusewithaGreenplumDatabasesystem.

    ASolrCloudclusteriscomprisedofthefollowingcomponents:

    AnApacheZooKeeperclustertomanagetheSolrCloudcluster.SolrCloudusesZooKeepertomanageserverandindexconfigurationsandtocoordinatethecluster’sactivities.GPTextcaninstallaZooKeeperclusterthatisboundtotheGPTextcluster,oritcanshareanexistingZooKeepercluster.If

    ©CopyrightPivotalSoftware,Inc,2013-2019 18 3.3.0

  • GPTextinstallstheZooKeepercluster,itcanbemanagedusingGPTextfunctionsandutilities.TheZooKeeperclustercanbedeployedonGreenplumDatabaseclusterhostsor,forbestperformance,onseparatehostsaccessibletotheGreenplumDatabasecluster.

    MultipleSolrCloudserverinstancesdeployedontheGreenplumsegmenthostsoronotherhostsonthesamenetwork.EachinstanceisaJVMprocessrunningSolrserver.SolrCloudinstancesuselocalstorage,whichmaybethesamelocalstoragevolumesthatstoreGreenplumDatabasedata.ThenumberofSolrCloudinstancesperhostcanbethesameasthenumberofGreenplumprimarysegmentsperhost,butthisisnotarequirement.ThenumberofinstancestoexecuteperhostisspecifiedduringGPTextinstallation.

    GPTextprovidesdocumentindexingandsearchcapabilitiesforGreenplumDatabasewithuser-definedfunctions(UDFs)thataccessSolrAPIsfromwithindatabasequeries.

    GPTextUDFsperformthefollowingtasks:

    createandmanageGPTextindexes

    providestatusinformationaboutindexes

    insertdocumentsintoindexesfromdatabasetablesor,forGPTextexternalindexes,fromdocumentsstoredoutsideofGreenplumDatabase

    searchindexes

    TherearealsoGPTextUDFsandcommand-lineutilitiestoconfigure,monitor,andmanagetheSolrCloudcluster,andtomanagereplicas,SolrCloud’shigh-availabilitymechanism.(Moreonreplicasinthenextsection.)

    ParallelisminGPTextIndexingandSearchingSolrClouddistributesdocumentindexesinslicescalledshards.EachshardismanagedbyaSolrCloudinstanceandZooKeeperensuresthattheshardsaredistributedevenlyamongtheSolrCloudinstances.TheSolrCloudinstancesandGreenplumsegmentsarenotrequiredtobeonthesamehosts.

    WithGPText,thedefaultnumberofshardsforanindexisthenumberofGreenplumDatabasesegments,sothateachsegmentoperatesonanequalportionoftheindex.Optionally,alessernumberofshardscanbespecifiedwhenyoucreateaGPTextindex,allowingindexingworkloadstobescaledforperformancerequirementsandresourceusage.

    HighAvailabilityforGPTextIndexesSolrCloudprovideshighavailabilitybymaintainingreplicasofshardsandprovidingautomaticfailoverifashardfailsorbecomesunavailable.Onereplicaofeachshardistheleadreplicaandanychangestoitareappliedtotheotherreplicas.Thereplicationfactor,whichdeterminesthenumberofreplicastomaintainforeachshard,issetwhentheindexiscreated.ReplicasmayalsobeaddedordroppedlaterusingGPTextUDFsorcommand-lineutilities.

    ZooKeeperdeterminesthelocationsofshardreplicasamongtheSolrnodesandhosts.WhenaddingareplicausingaGPTextUDForcommand-lineutility,anewshardcanbeexplicitlyplacedonaSolrCloudinstance.

    GPTextSampleUseCaseForensicfinancialanalystsneedtolocatecommunicationsamongcorporateexecutivesthatpointtofinancialmalfeasanceintheirfirm.Theanalystsusethefollowingworkflow:

    1. LoadtheemailrecordsintoaGreenplumdatabase.

    2. CreateaSolrindexoftheemailrecords.

    3. Runqueriesthatlookfortextstringsandtheirauthors.

    4. Refinethequeriesuntiltheypairadummycompanynamewithtopthreeorfourexecutivescorrespondingaboutsuspectoffshorefinancialtransactions.Withthisdata,theanalystscanfocustheinvestigationonspecificindividualsratherthanthethousandsofauthorsintheinitialdatasample.

    GPTextWorkflowGPTextworkswithGreenplumDatabaseandApacheSolrCloudtostoreandindexbigdataforinformationretrieval(query)purposes.High-levelworkflowsincludedataloadingandindexing,anddataquerying.

    Thistopicdescribesthefollowinginformation:

    ©CopyrightPivotalSoftware,Inc,2013-2019 19 3.3.0

  • DataLoadingandIndexingWorkflow

    QueryingDataWorkflow

    DataLoadingandIndexingWorkflowThefollowingdiagramshowstheGPTextworkflowforloadingandindexingdata.

    AllclientinteractionwiththesystemisthroughtheGreenplummasterinstance.

    1. LoaddataintoyourGreenplumDatabasesystem.Createadatabasetabletoholddataandthenaddthedatatothetable.Greenplumprovidesparalleldataloadingutilitiesandprotocolsthathelptotransformandloadexternaldatainvariousformatsandfromvarioussources.Fordetails,seetheGreenplumDatabaseAdministratorGuide,athttp://gpdb.docs.pivotal.io .Youcanalsocreateanexternalindexfordocumentsyouretrievefromawebserver,ftpserver,AmazonS3,orhdfs.Youcan

    2. CreateandconfigureanemptyGPTextindex.Usethe gptext.create_index() user-definedfunction(UDF)tocreateanemptyGPTextindexforadatabasetable.GPTextstoresconfigurationfilesfortheindexinZooKeeper.

    3. Customizetheindex,ifdesired,byeditingtheindexconfigurationfileswiththe gptext-config command-lineutility.Youcancustomizethewaydocumenttextistokenized,filtered,andtransformedbeforestoringintheindexandhowquerytextispreparedtosearchtheindex.

    4. Populatetheindexwithdatafromthedatabasetableorexternaldatasource.Usethe gptext.index() or gptext.index_external() UDFtoadddatatotheindex.TheseUDFsworkbydispatchingSQLqueriestoexecuteoneachGreenplumsegment.ThesegmentsexecutethequeriesandaddtheresultstotheindexusingSolrAPIs.

    5. Commitchangestotheindex.CommitchangestotheGPTextindexbycallingthe gptext.commit_index() UDF.Untilthechangesarecommitted,queriesexecutedontheindexcannotaccessanydataaddedtotheindexwith gptext.index() .Ifneeded,uncommittedchangescanberolledback.SolrCloudreplicateschangescommittedtotheleadreplicatotheshards’non-leadreplicas.

    QueryingDataWorkflowThefollowingdiagramshowsthehigh-levelGPTextqueryprocessworkflow:

    ©CopyrightPivotalSoftware,Inc,2013-2019 20 3.3.0

    http://gpdb.docs.pivotal.io

  • 1. AusersubmitsaSQLquerydesignedtosearchtheindexeddata.AGPTextsearchqueryisaSQL SELECT statementonaGPTextsearchUDFthatcontainsfull-textsearchexpressions.

    2. TheGreenplummasterdispatchesthequerytotheGreenplumDatabasesegments.

    3. Eachsegmentexecutesthequery,usingtheSolrAPItosearchitsindexshard.Solranalyzesandexecutesthesearchqueryontheleadreplicafortheshard.

    4. TheGreenplumDatabasesegmentsreturntheresultsofthesearchquerytotheGreenplumDatabasemaster.

    5. TheGreenplumDatabasemasteraggregatestheresultsfromallsegmentsandreturnsthemtotheclient.

    TextAnalysisGPTextenablesanalysisofSolrindexeswithApacheMADlib,anopensourcelibraryforscalablein-databaseanalytics.MADlibprovidesdata-parallelimplementationsofmathematical,statistical,andmachinelearningmethodsforstructuredandunstructureddata.YoucanuseGPTexttoperformavarietyofMADlibanalyses.

    LearnmoreaboutApacheMADlibathttp://madlib.apache.org .A gppkg packageforMADlibisavailableonthePivotalnetworkathttp://network.pivotal.io .

    TheApacheOpenNLPtoolkitprovidesadvancedmachinelearningtoolsfortokenizing,recognizing,andtaggingnaturallanguagetextthatyoucanenableforGPTextindexinandsearching.SeeNaturalLanguageProcessingwithGPTextIndexesformoreinformation.

    ©CopyrightPivotalSoftware,Inc,2013-2019 21 3.3.0

    http://madlib.apache.orghttp://network.pivotal.io

  • AdministeringGPTextGPTextadministrationincludessecurityconsiderations,monitoringSolrindexstatistics,managingandmonitoringZooKeeper,andtroubleshooting.

    ViewingtheClusterConfigurationGPTextdeploysApacheZooKeeperandApacheSolrnodesonhostsinyourGreenplumDatabasenetwork.EachnodeisaJVMserverprocesslisteningforrequestsfromothernodes.Usethe gptext-stateconfig commandtolistthehostandportforeachZooKeeperandSolrnodeandthememoryconfigurationforSolrnodes.

    $gptext-stateconfigs20181112:12:38:26:018080gptext-state:mdw:gpadmin-[INFO]:-ExecuteGPTextstate...20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-Checkzookeeperclusterstate...20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-ClusterConfigurations.20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:----------------------------------------------------------20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-JVMMin|MaxXms1024M|Xmx2048M20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-Nodeinformation20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:----------------------------------20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-HostNodeNamePortSolrDir20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-sdw1sdw1_solr:1898318983/data/gptext/solr020181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-sdw1sdw1_solr:1898418984/data/gptext/solr120181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-sdw2sdw2_solr:1898318983/data/gptext/solr020181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-sdw2sdw2_solr:1898418984/data/gptext/solr120181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-Zookeeperinformation20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:----------------------------------20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-HostPortZookeeperDir20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-mdw2189/data/zoo/zoo020181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-sdw22189/data/zoo/zoo020181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-sdw12189/data/zoo/zoo020181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-Done.

    Youdon’tneedthesedetailstousetheGPTextfunctionsandutilities,buttheinformationcanbeusefulformonitoringandtroubleshootingthecluster.Forexample,youcanaccesstheSolrAdminUIbybrowsingtotheURL http://: onanySolrnode.SeeUsingtheSolrAdministrationInterface forinformationabouttheSolrAdminUI.

    ChangingGPTextServerConfigurationParametersConfigurationparametersusedwithGPTextarebuilt-intoGPTextwithdefaultvalues.YousetnewvaluesfortheparametersinaGreenplumDatabasesessionusingthe SET command,thesamewayyousetGreenplumDatabasesessionparameters.Whenyouenterthe SET commandGPTextupdatesthevalueinZooKeepersothatthechangepersistsbetweendatabasesessions.

    WithGreenplumDatabase4.xand5.x,aone-timeGreenplumDatabaseconfigurationchangeisneededsothatGreenplumDatabaseallowsyoutosetanddisplayGPTextconfigurationparameters.Untilyouhaveperformedthisstep,anyattempttosetaGPTextparameterresultsinan“Unrecognizedconfigurationparameter”error.YoumustdeclareacustomvariableclassforGPText.

    Asthe gpadmin user,enterthefollowingcommandsinashell:

    $gpconfig-ccustom_variable_classes-v'gptext'$gpstop-u

    Oncethisstepiscompleted,youcanviewandsetGPTextconfigurationparametersin psql.

    ToviewGPTextconfigurationparameters,youfirstneedtofetchthemfromZooKeeperintoyourGreenplumDatabasesessionbyexecutingthegptext.version() UDF.

    =#SELECTgptext.version();version------------------------------------------------------GreenplumTextAnalytics3.2.0(1row)

    The custom_variable_classes configurationparameterisremovedinGreenplumDatabase6.Youcansetcustomvariablesinadatabasesessionwithouterror,sothisstepisnotneededforGreenplumDatabase6.

    ©CopyrightPivotalSoftware,Inc,2013-2019 22 3.3.0

    https://lucene.apache.org/solr/guide/7_3/using-the-solr-administration-user-interface.html#using-the-solr-administration-user-interface

  • Thenyoucanusethe SHOW commandtodisplayvaluesoftheparameters,forexample:

    =#SHOWgptext.idx_num_shards;gptext.idx_num_shards-----------------------0(1row)

    SeeGPTextConfigurationParametersforacompletelistofconfigurationparameters.

    GPTextusesthecurrentvaluesoftheconfigurationparameterswhenyoucreateanewindex,sochangingaconfigurationparameteraffectsnewindexes,butdoesnotaffectexistingindexes.

    ChangethevaluesofGPTextconfigurationvariablesusingthe SET commandinasessionwithadatabasethatcontainstheGPTextschema.Thefollowingexamplesetsvaluesforthreeconfigurationparametersina psql session:

    =#setgptext.idx_buffer_size=10485760;SET=#setgptext.idx_delim='|';SET=#setgptext.extension_factor=5;SET

    Youcanviewthenewvalueofaconfigurationparameterthatyouhavesetusingthe SHOW command:

    =#showgptext.idx_delim;gptext.idx_delim------------------|(1row)

    SecurityandGPTextIndexesGPTextsecurityisbasedonGreenplumDatabasesecurity.YourprivilegestoexecuteGPTextfunctionsdependonyourprivilegesforthedatabasetablethatisthesourcefortheindex.Forexample,ifyouhaveSELECTprivilegesforatableintheGreenplumDatabasedatabase,thenyouhaveSELECTprivilegesforanindexgeneratedfromthattable.

    ExecutingGPTextfunctionsrequiresoneofOWNER,SELECT,INSERT,UPDATE,orDELETEprivileges,dependingonthefunction.TheOWNERisthepersonwhocreatedthetableandhasallprivileges.SeetheGreenplumDatabaseAdministratorGuideforinformationaboutsettingprivileges.

    ZooKeeperAdministrationApacheZooKeeperenablescoordinationbetweentheApacheSolrandPivotalGPTextdistributedprocessesthroughasharednamespacethatresemblesafilesystem.InZooKeeper,anode(calledaznode)cancontaindata,likeafile,andcanhavechildznodes,likeadirectory.ZooKeeperreplicatesdatabetweenmultipleinstancesdeployedasaclustertoprovideahighlyavailable,fault-tolerantservice.BothSolrandGPTextstoreconfigurationfilesandsharestatusbywritingdatatoZooKeeperznodes.GPTextstoresinformationinthe /gptext znode.TheconfigurationfilesforaGPTextindexareinthe/gptext/configs/ znode.

    ThenumberofZooKeeperinstancesintheclusterdetermineshowmanyZooKeepernodefailurestheclustercantolerateandstillremainactive.Theserviceremainsavailableaslongasaclearmajorityofthenon-failednodesareabletocommunicatewitheachother.Totolerateafailureofnnodestheclustermusthave2 +1nodes.Aclusteroffivenodes,forexample,cantoleratetwofailednodes.

    ZooKeeperisveryfastforreadrequestsbecauseitstoresdatainmemory.IfZooKeeperbeginstoswapmemorytodisk,SolrandGPTextperformancewillsufferandcouldexperiencefailures,soitiscriticaltoallocatesufficientmemorytotheZooKeeperJavaprocesses.ToavoidZooKeeperinstancescompetingwithGreenplumDatabasesegmentsformemory,youshoulddeploytheZooKeeperinstancesandGreenplumDatabasesegmentsondifferenthosts.TheZooKeeperandGreenplumDatabasehostsmustbeonthesamenetworkandaccessiblewithpasswordlessSSHbythegpadminuser.YoucanusetheGreenplumDatabase gpssh-exkeys utilitytoshareSSHkeysbetweenZooKeeperandGreenplumDatabasehosts.

    YoumuststarttheZooKeeperclusterbeforeyoustartGPText.WhenyoustartGPText,theSolrnodeseachloadthereplicasforindexestheymanage.Withlargenumbersofindexes,shards,andreplicas,startinguptheclustercangenerateaveryhigh,atypicalloadonZooKeeper.ItcantakealongtimetogetallindexesloadedandsomeZooKeeperrequestsmaytimeoutwaitingforresponses.Usingthe gptext-start--

    slow_startoptionstartsSolrnodesoneata

    time,providingamoreorderedstart-upandlimitingthenumberofconcurrentZooKeeperrequests.

    n

    ©CopyrightPivotalSoftware,Inc,2013-2019 23 3.3.0

  • TheGPTextcommand-lineutility zkManager canbeusedtomonitortheZooKeepercluster.IftheZooKeeperclusterisboundtoGPText,youcanalsostartandstoptheclusterusing zkManager .

    CheckingZooKeeperStatusUsethe zkManager utilityfromthecommandlinetochecktheZooKeeperclusterstatus.Theutilityliststhehosts,ports,latency,andfollower/leadermodeforeachZooKeeperinstance.Ifanodeisdown,itsmodeislistedasDown.

    TochecktheZooKeeperclusterstatus,runthe zkManagerstate command.

    $zkManagerstate20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-Executezookeeperstateprocess.20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-Checkzookeeperclusterstate...20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-HostportLatencymin/avg/maxMode20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-gpdb21890/0/22follower20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-gpdb21900/0/29leader20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-gpdb21880/0/27follower20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-Done.

    Inadatabasesession,youcanusethe gptext.zookeeper_hosts() functiontolisttheZooKeeperhosts.

    =#SELECT*FROMgptext.zookeeper_hosts();host|port--------+------gpdb51|2188gpdb51|2189gpdb51|2190(3rows)

    StartingandStoppingtheZooKeeperClusterIftheZooKeeperclusterwasinstalledbytheGPTextinstaller,the zkManager utilitycanstartorstoptheZooKeepercluster.Tostartthecluster,runthezkManagerstart

    command.

    $zkManagerstart20171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:-Executezookeeperstartprocess20171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:------------------------------------------------20171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:-StartingZookeeper:20171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:------------------------------------------------20171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:-HostZookeeperDir20171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:-gpdb/data/master/zoo020171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:-gpdb/data/master/zoo120171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:-gpdb/data/master/zoo220171016:16:14:48:017845zkManager:gpdb:gpadmin-[INFO]:-Checkzookeeperclusterstate...20171016:16:14:53:017845zkManager:gpdb:gpadmin-[INFO]:-Done.

    TostopZooKeeper,runthe zkManagerstop command.

    $zkManagerstop20171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:-Executezookeeperstopprocess.20171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:------------------------------------------------20171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:-StopZookeeper:20171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:------------------------------------------------20171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:-HostZookeeperDir20171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:-gpdb/data/master/zoo020171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:-gpdb/data/master/zoo120171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:-gpdb/data/master/zoo220171016:16:14:09:016499zkManager:gpdb:gpadmin-[INFO]:-Done.

    SeethezkManagerreferenceformoreinformation.

    CheckingSolrCloudStatusYoucancheckthestatusoftheSolrCloudclusterandindexesbyrunningthe gptext-state utilityfromthecommandline.

    ©CopyrightPivotalSoftware,Inc,2013-2019 24 3.3.0

  • TocheckthestateoftheGPTextnodesandeachindex,runthe gptext-state utilitywiththe -D ( --details )option.Example:

    $gptext-state-D20180615:16:09:24:031986gptext-state:mdw:gpadmin-[INFO]:-ExecuteGPTextstate...20180615:16:09:25:031986gptext-state:mdw:gpadmin-[INFO]:-Checkzookeeperclusterstate...20180615:16:09:25:031986gptext-state:mdw:gpadmin-[INFO]:-CheckGPTextclusterstatus...20180615:16:09:25:031986gptext-state:mdw:gpadmin-[INFO]:-CurrentGPTextVersion:3.0.020180615:16:09:25:031986gptext-state:mdw:gpadmin-[INFO]:-Allnodesareupandrunning.20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:------------------------------------------------20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:-Indexstatedetails.20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:------------------------------------------------20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:-databaseindexnamestate20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:-demodemo.twitter.messageGreen20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:-demodemo.wikipedia.articlesGreen20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:-Done.

    ThiscommandreportsthestatusoftheGPTextnodesandstatusofeachGPTextindex.

    Run gptext-statelist toviewjusttheindexes.

    The gptext-statehealthcheck commandcheckstheGPTextconfigurationfiles,theindexstatus,requireddiskspace,userprivileges,andindexanddatabaseconsistency.Bydefault,therequireddiskspacecheckpassesifthereisatleast20%diskfree.Youcansetadifferentdiskfreethresholdusingthe--disk_free option.Forexample:

    [gpadmin@gpdb-sandbox~]$gptext-statehealthcheck--disk_free=2520160629:15:45:24:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-ExecutehealthcheckonGPTextcluster!20160629:15:45:24:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-CheckGPTextconfigfiles...20160629:15:45:24:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-GOOD20160629:15:45:24:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-CheckGPTextindexstatus...20160629:15:45:25:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-GOOD20160629:15:45:25:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-Checkingforrequireddiskspace...20160629:15:45:25:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-GOOD20160629:15:45:25:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-Checkingforrequireduserprivileges...20160629:15:45:25:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-GOOD20160629:15:45:25:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-Checkingforindexesanddatabaseconsistency...20160629:15:45:27:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-GOOD20160629:15:45:27:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-Done.

    Seethe gptext-state utilityreferenceforadditionaloptions.

    RecoveringGPTextNodesUsethe gptext-recover utilitytorecoverdownGPTextnodes,forexampleafterafailedGreenplumDatabasesegmenthostisrecovered.

    Withnoarguments,the gptext-recover utilitydiscoversdownGPTextnodesandrestartsthem.

    Withthe -f (or --force )option,ifaGPTextnodecannotberestartedandnoshardsaredown,thenodeisdeletedandcreatedagainonthesamehost.Missingreplicasareaddedandthefailednodeandfailedreplicasareremoved.Iftheindexisinaredstate gptext-recover-

    fwillprintamessageandexit.

    The -H ( --new_hosts )optionallowsrecreatingdownGPTextnodesonnewhoststhatreplacefailedhosts.ThedownGPTextnodesaredeletedandrecreatedonthenewhosts.Theargumenttothe -H optionisacomma-separatedlistofthenewhoststhataretoreplacethefailedhosts.Thenumberofnewhostsmustmatchthenumberoffailedhosts.Ifshardsaredown,itadvisesreindexing.Ifonlysomereplicasaredown,itrecreatesthereplicasonthenewhostsandupdates gptext.conf .

    The -r optionrecoversreplicas,butdoesnotattempttorecoveranydownnodes.

    Note:BeforerecoveringGPTextnodesonnewlyaddedhosts,ensurethatthefollowingGPTextprerequisiteshavebeeninstalledonthehost:

    Java1.8

    Python2.6

    TheLinux lsof utility

    ViewingSolrIndexStatisticsYoucanviewSolrindexstatisticsbyrunningthe gptext-state utilityfromthecommandline.

    ©CopyrightPivotalSoftware,Inc,2013-2019 25 3.3.0

  • TolistallGPTextindexes,enterthefollowingcommandatthecommandline:

    gptext-statelist

    Acommandlinethatretrievesallstatisticsforanindex:

    gptext-state--indexdemo.wikipedia.articles

    Acommandlinethatretrievesthenumberofdocumentsinanindex:

    gptext-state--indexdemo.wikipedia.articles--stats_columns=num_docs

    Acommandlinethatretrieves num_docs ,index size ,andthedateandtime last_modified :

    gptext-state--indexdemo.wikipedia.articles--stats_columnsnum_docs,size,last_modified

    BackingUpandRestoringGPTextIndexesWiththe gptext-backup managementutility,youcanbackupaGPTextindexsothat,ifneeded,youcanquicklyrecoverfromafailure.ThebackupcanberestoredtothesameGPTextsystemortoanothersystemwiththesamenumberofGreenplumDatabasesegments.

    The gptext-backup managementutilitybacksupanindexanditsconfigurationfilestoeitherasharedfilesystem,whichmustbemountedonandwritablebyeachhostintheGreenplumDatabasecluster,ortolocalstorageontheGreenplumDatabasemasterandsegmenthosts.

    BackingUptoaSharedFileSystemTobackuponasharedfilesystem,usethe -p ( --path )command-lineoptiontospecifythelocationofadirectoryonthemountedfilesystemandthe-n ( --name )optiontoprovideanameforthebackup.Specifytheindextobackupwiththe -i (--index )option.

    $gptext-backup-i-p--n

    The gptext-backup utilitythenchecksthat:

    theGPTextclusterisup

    thesharedfilesystemisvalid

    thebackupnamespecifiedwiththe -n optiondoesnotalreadyexistinthedirectoryspecifiedwiththe -p option

    Theutilitycreatesthenewdirectoryandthensavesonecopyofeachindexshardtothatdirectory,alongwiththeindex’sconfigurationfilesfromZooKeeper.

    Tosavetheconfigurationfilesonly,withnodata,addthe -c ( --backup_conf )command-lineoption.

    Torestoreanindexfromasharedfilesystem,usethe gptext-restore managementutility.TheGPTextsystemyourestoretomustbeonaGreenplumDatabaseclusterwiththesamenumberofsegments.Thedatabaseandschemafortheindexmustbepresent.

    The -i ( --index )optionspecifiesthenameoftheGPTextindexthatwillberestored.Iftheindexexists,youmustfirstdropitwiththe gptext.drop_index()user-definedfunction.

    The -p ( --path )optionspecifiesthelocationofthedirectorycontainingthebackupfiles—thedirectorythat gptext-backup createdonthesharedfilesystem.

    $gptext-restore-i-p

    Youcanaddthe -c optiontorestoreonlytheconfigurationfilestoZooKeeperandcreateanemptyGPTextindex,withoutrestoringanysavedindexdata.

    BackingUptoLocalStorage

    ©CopyrightPivotalSoftware,Inc,2013-2019 26 3.3.0

  • TobackuptolocalstorageontheGreenplumDatabasecluster,addthe local keywordtothe gptext-backup command-line.

    AlocalGPTextbackuphasauniquenameconstructedbyappendingatimestamptotheindexname.Youdonotusethe -n optionwithlocalbackups.

    $gptext-backuplocal-i

    Onthemasterhost,inthemasterdatadirectorybydefault,thebackuputilitysavesaJSONfilewithbackupmetadataandadirectorycontainingtheindex’sconfigurationfilesfromZooKeeper.

    TheutilitybacksupeachindexshardontheGreenplumDatabasesegmenthostwiththeGPTextnodethatmanagestheshard’sleadreplica.Bydefault,theshardbackupfilesaresavedinasegmentdatadirectory.

    The gptext-backup commandoutputreportsthelocationsofallbackupfiles.

    Youcanaddthe -p ( --path )optiontothe gptext-backup commandtospecifyalocaldirectorywherethebackupwillbesaved.ThedirectorymustbepresentoneveryGreenplumDatabasehostandmustbewriteablebythegpadminuser.

    $gptext-backuplocal-i-p

    ThebackupfileswillbesavedinthespecifieddirectoryoneachhostinsteadofintheGreenplumDatabasemasterandsegmentdatadirectories.

    Torestoreabackupsavedtolocalstorage,addthe local keywordtothe gptext-restore command-lineandspecifythepathtothebackupdirectoryonthemasterhost.

    $gptext-restorelocal-p

    The isthefullpathtothedirectorythe gptext-backup commandcreatedonthemasterhost,includingthetimestamp,forexample$MASTER_DATA_DIRECTORY/demo.twitter.message_2018-05-08T15:32:21.397779 .

    Seegptext-backupforsyntaxandexamplesforrunning gptext-backup .Seegptext-restoreforsyntaxandexamplesforrunning gptext-restore .

    ExpandingtheGPTextClusterThe gptext-expand managementutilityaddsGPTextnodestothecluster.Therearetwowaystoaddnodes:

    AddGPTextnodestoexistinghostsinthecluster.ThisoptionincreasesthenumberofGPTextnodesoneachhost.

    AddGPTextnodestonewhostsaddedbyusingtheGreenplumDatabase gpexpand managementutilitytoexpandtheGreenplumDatabasesystem.

    AddingGPTextNodestoExistingSegmentHostsToaddnodestoexistingsegmenthosts,runthe gptext-expand utilitywithacommandlikethefollowing:

    gptext-expand-e-p/data1/nodes,/data2/nodes

    ThisexampleaddstwoGPTextnodestoeachhost.

    The -e ( --existing )optionspecifiesthatnodesaretobeaddedtoexistinghosts.

    The -p ( --expand_paths )optionprovidesalistofdirectorieswherethenewnodes’datadirectoriesaretobecreated.TheseshouldbethesamedirectoriesthatcontaintheGreenplumDatabasesegmentdatadirectoriesandexistingGPTextdatadirectories.Thenumberofdirectoriesinthelististhenumberofnewnodesthatareadded.

    AdirectorycanberepeatedinthedirectorylistmultipletimestoincreasethenumberofnewGPTextnodestocreate.Forexample,ifthereiscurrentlyoneGPTextnodeperhostinthe /data1/nodes directory,youcouldaddthreenodeswithacommandlikethefollowing:

    gptext-expand-e-p/data1/nodes,/data2/nodes,/data2/nodes

    Thisaddsonenodetothe /data1/nodes directoryandtwonodestothe /data2/nodes directorysotherearetwoGPTextnodesineachdirectory.

    AddingGPTextnodesaffectsnewindexes,butnotexistingindexes.Replicasfornewindexeswillbedistributedacrossallofthenodes,includingbothold

    ©CopyrightPivotalSoftware,Inc,2013-2019 27 3.3.0

  • nodesandthenewlycreatednodes.Replicasforindexesthatexistedbeforerunning gptext-expand arenotautomaticallymoved.Rebalancingexistingreplicasrequiresreindexing.

    AddingGPTextNodestoNewHostsCheckthatthefollowingGPTextprerequisitesareinstalledoneachnewhostaddedtotheGreenplumDatabasecluster:

    Java1.8

    Python2.6orgreater

    Linux lsof utility

    NewhostsmustbereachablebyallhostsintheGPTextcluster,includingexistinghostsandthenewhostsyouareadding.

    AfterexpandingtheGreenplumDatabaseclusterwiththe gpexpand managementutility,call gptext-expand withthe -H ( --new_hosts )optionandalistofthenewhostsonwhichtoinstallGPText:

    gptext-expand-Hnewhost1,newhost2

    The gptext-expand utilityinstallsGPTextbinariesonthenewhostsandthencreatesnewGPTextnodesonthenewhosts.

    ExpandingaGreenplumDatabaseclusterincreasesthenumberofsegments,sothenumberofGPTextindexshardsforexistingindexesmustbeincreasedtoequalthenewnumberofsegments.Thisrequiresreindexingallexistingdocuments.Newlycreatedindexeswillautomaticallybedistributedamongthenewshards.

    TroubleshootingGPTexterrorsareofthefollowingtypes:

    Solrerrors

    gptext errors

    MostoftheSolrerrorsareself-explanatory.

    gptext errorsarecausedbymisuseofafunctionorutility.Theyprovideamessagethattellsyouwhenyouhaveusedanincorrectfunctionorargument.

    MonitoringLogsYoucanexaminetheGreenplumDatabaseandSolrlogsformoreinformationiferrorsoccur.GreenplumDatabaselogsresidein:

    segment-directory/pg-log

    Solrlogsresidein:

    /solr/logs

    DeterminingSegmentStatuswithgptext-stateUsethe gptext-state utilitytodetermineifanyprimaryormirrorsegmentsaredown.See gptext-state intheGPTextManagementUtilitiesReference.

    ©CopyrightPivotalSoftware,Inc,2013-2019 28 3.3.0

  • GPTextHighAvailabilityTheGPTexthighavailabilityfeatureensuresthatyoucancontinueworkingwithGPTextindexesaslongaseachshardintheindexhasatleastoneworkingreplica.

    AGPTextindexhasoneshardforeachGreenplumsegment,sothereisaone-to-onecorrespondencebetweenGreenplumsegmentsandGPTextindexshards.TheshardmanagedbyaGreenplumsegmentisanindexofthedocumentsthataremanagedbythatsegment.

    TheGPTexthighavailabilitymechanismistomaintainmultiplecopies,orreplicas,oftheshard.TheZooKeeperservicethatmanagesSolrCloudchoosesaGPTextinstance(SolrCloudnode)foreachreplicatoensureevendistributionandhighavailability.Foreachshard,onereplicaiselectedleaderandtheGreenplumsegmentassociatedwiththeshardoperatesonthisleaderreplica.TheGPTextinstancemanagingtheleadreplicamayormaynotbeonanotherGreenplumhost,soindexingandsearchingoperationsarepassedovertheGreenplumcluster’sinterconnectnetwork.SolrCloudreplicateschangesmadetotheleaderreplicatotheremainingreplicas.

    ThefollowingfigureillustratestherelationshipsbetweenGreenplumsegmentsandGPTextindexshardsandreplicas.Theleaderreplicaforeachshardisshowningreenandthefollowersaregray.

    Thenumberofreplicastocreateforeachshard,thereplicationfactor,isaSolrCloudproperty.Bydefault,GPTextstartsSolrCloudwithareplicationfactorofthree.ThereplicationfactorforeachindividualindexisthevalueoftheSolrCloudreplicationfactorwhentheindexiscreated.Changingthereplicationfactordoesnotalterthereplicationfactorforexistingindexes.

    GreenplumSegmentorHostFailureIfaGreenplumprimarysegmentfailsanditsmirrorisactivated,GPTextfunctionsandutilitiescontinuetoaccesstheleaderreplica.Nointerventionisneeded.

    Ifahostintheclusterfails,bothGreenplumandGPTextareaffected.MirrorsfortheGreenplumprimarysegmentslocatedonthefailedhostareactivatedonotherhosts.SolrCloudelectsanewleaderreplicaforaffectedshards.BecauseGreenplumsegmentmirrorsandGPTextshardreplicasaredistributedthroughoutthecluster,asinglehostfailureshouldnotpreventtheclusterfromcontinuingtooperate.Theperformanceofdatabasequeriesandindexingoperationswillbeaffecteduntilthefailedhostisrecoveredandtheclusterisbroughtbackintobalance.

    ZooKeeperClusterAvailabilitySolrCloudisdependentonaworking,availableZooKeepercluster.ForZooKeepertobeactive,amajorityoftheZooKeeperclusternodesmustbeupandabletocommunicatewitheachother.AZooKeeperclusterwiththreenodescancontinuetooperateifoneofthenodesfails,sincetwoisamajorityofthree.Totoleratetwofailednodes,theclustermusthaveatleastfivenodessothatthenumberofworkingnodesremainingafterthefailureareamajority.Totoleratennodefailures,then,aZooKeeperclustermusthave2*n*+1nodes.ThisiswhyZooKeeperclustersusuallyhaveanoddnumberofnodes.

    Thebestpracticeforahigh-availabilityGPTextclusterisaZooKeeperclusterwithfiveorsevennodessothattheclustercantoleratetwoorthreefailednodes.

    ©CopyrightPivotalSoftware,Inc,2013-2019 29 3.3.0

  • ManagingGPTextClusterHealthGPTextdocumentindexingandsearchingservicesremainavailableaslongaseachshardofanindexhasatleastoneworkingreplica.Toensureavailabilityintheeventofafailure,itisimportanttomonitorthestatusoftheclusterandensurethatalloftheindexshardreplicasarehealthy.YoucanmonitortheSolrCloudclusterandindexesusingtheSolrCloudDashboardorusingGPTextfunctionsandmanagementutilities.AccesstheSolrCloudDashboardwithawebbrowseronanyGPTextinstancewithaURLsuchas http://sdw3:18983/solr .(TheportnumbersforGPTextinstancesaresetwiththeGPTEXT_PORT_BASE parameterintheinstallationparametersfileatinstallationtime.)

    RefertotheApacheSolrClouddocumentationforhelpusingtheSolrCloudDashboard.

    MonitoringtheClusterwithGPTextTheGPText gptext-state managementutilityallowsyoutoquerythestateoftheGPTextclusterandindexes.Youcanalsouse gptext.index_status() toviewthestatusofallindexesoraspecifiedindex.

    ToseetheGPTextclusterstaterunthe gptext-state command-lineutilitywiththe -d optiontospecifyadatabasethathastheGPTextschemainstalled.

    gptext-state-dmydb

    TheutilityreportsanyGPTextnodesthataredownandliststhestatusofeveryGPTextindex.Foreachindex,thedatabasename,indexname,andstatusarereported.Thestatuscolumncontains“Green”,“Yellow”,or“Red”:-Green–allreplicasforallshardsarehealthy-Yellow–allshardshaveatleastonehealthyreplicabutatleastonereplicaisdown-Red–noreplicasareavailableforatleastoneindexshard

    ToseethedistributionofindexshardsandreplicasintheGPTextcluster,executethisSQLstatement.

    SELECTindex_name,shard_name,replica_name,node_nameFROMgptext.index_summary()ORDERBYnode_name;

    TolistallGPTextindexes,runthe gptext-statelist command.

    gptext-statelist-dmydb

    The gptext-statehealthcheck commandchecksthehealthofthecluster.The -f flagspecifiesthepercentageofavailablediskspacerequiredtoreportahealthycluster.Thedefaultis10.

    gptext-statehealthcheck-f20-dmydb

    See gptext-state intheManagementUtilitiesreferenceforhelpwithadditional gptext-state options.

    Thegptext.index_status()user-definedfunctionreportsthestatusofallGPTextindexesoraspecifiedindex.

    SELECT*FROMgptext.index_status();

    Specifyanindexnametoreportonlythestatusofthatindex.

    SELECT*FROMgptext.index_status('demo.twitter.message');

    AddingandDroppingReplicasThe gptext-replica utilityaddsordropsareplicaofasingleindexshard.Usethe gptext.add_replica() and gptext.delete_replica() user-definedfunctionstoperformthesametasksfromwithinthedatabase.

    Ifareplicaofashardfails,use gptext-replica toaddanewreplicaandthendropthefailedreplicatobringtheindexbackto“Green”status.

    gptext-replicaadd-imydb.public.messages-sshard3

    Hereistheequivalent,usingthe gptext.add_replica() function:

    ©CopyrightPivotalSoftware,Inc,2013-2019 30 3.3.0

  • SELECT*FROMgptext.add_replica('mydb.public.messages',shard3);

    ZooKeeperdetermineswherethereplicawillbelocated,butyoucanalsospecifythenodewherethereplicaiscreated:

    gptext-replicaadd-imydb.public.messages-sshard3-nsdw3

    Inthe gptext.add_replica() function,addthenodenameasathirdargument.

    Todropareplica,call gptext.delete_replica() withthenameoftheindex,thenameoftheshard,andthenameofthereplica.Youcanfindthenameofthereplicabycalling gptext.index_status(index_name) .Thenameisintheformat core_noden .Anoptional -o flagspecifiesthatthereplicaistobedeletedonlyifitisdown.

    gptext-replicadrop-imydb.public.messages-sshard3-rcore_node4-o

    Hereistheequivalentoftheabovecommandusingthe gptext.delete_replica() user-definedfunction.

    SELECT*FROMgptext.delete_replica('mydb.public.messages','shard3','core_node4',true);

    ©CopyrightPivotalSoftware,Inc,2013-2019 31 3.3.0

  • GPTextBestPracticesEachGPText/ApacheSolrnodeisaJavaVirtualMachine(JVM)processandisallocatedmemoryatstartup.ThemaximumamountofmemorytheJVMwilluseissetwiththe -Xmx parameterontheJavacommandline.Performanceproblemsandoutofmemoryfailurescanoccurwhenthenodeshaveinsufficientmemory.

    OtherperformanceproblemscanresultfromresourcecontentionbetweentheGreenplumDatabase,Solr,andZooKeeperclusters.

    ThistopicdiscussesGPTextusecasesthatstressSolrJVMmemoryindifferentwaysandthebestpracticesforpreventingoralleviatingperformanceproblemsfrominsufficientJVMmemoryandothercauses.

    IndexingLargeNumbersofDocumentsIndexingdocumentsconsumesdatainSolrJVMmemory.Whentheindexiscommitted,partsofthememoryarereleased,butsomedataremainsinmemorytosupportfastsearch.Bydefault,Solrperformsanautomaticsoftcommitwhen1,000,000documentsareindexedor20minutes(1,200,000milliseconds)havepassed.Asoftcommitpushesdocumentsfrommemorytotheindex,freeingJVMmemory.Asoftcommitalsomakesthedocumentsvisibleinsearches.Asoftcommitdoesnot,however,maketheindexupdatesdurable;itisstillnecessarytocommittheindexwiththe gptext.commit()user-definedfunction.

    Youcanconfigureanindextoperformamorefrequentautomaticsoftcommitbyeditingthe solrconfig.xml filefortheindex:

    $gptext-configedit-fsolrconfig.xml-i..

    The elementisachildofthe element.Editthe and valuestoreducethetimebetweenautomaticcommits.Forexample,thefollowingsettingsperformanautocommitevery100,000documentsor10minutes.

    100000600000

    IndexingVeryLargeDocumentsIndexingverylargedocumentscanusealargeamountofJVMmemory.Tomanagethis,youcansetthe gptext.idx_buffer_size configurationparametertoreducethesizeoftheindexingbuffer.

    SeeChangingGPTextServerConfigurationParametersforinstructionstochangeconfigurationparametervalues.

    DeterminingtheNumberofGPTextNodestoDeployAGPTextnodeisaSolrinstancemanagedbyGPText.ThenodescanbedeployedontheGreenplumDatabaseclusterhostsoronseparatehostsaccessibletotheGreenplumDatabasecluster.ThenumberofnodesisconfiguredduringGPTextinstallation.

    ThemaximumrecommendednumberofGPTextnodesyoucandeployisthenumberofGreenplumDatabaseprimarysegments.However,thebestpracticerecommendationistodeployfewerGPTextnodeswithmorememoryratherthantodividethememoryavailabletoGPTextamongthemaximumnumberofGPTextnodes.Usethe JAVA_OPTS installationparametertosetmemorysizeforGPTextnodes.

    AsingleGPTextnodeperhostcaneasilyhandleseveralindexes.EachadditionalnodeconsumesadditionalCPUandmemoryresources,soitisdesirabletolimitthenumberofnodesperhost.FormostGPTextinstallations,asingleGPTextnodeperhostissufficient.

    IftheJVMhasaverylargeamountofmemory,however,garbagecollectioncancauselongpauseswhiletheJVMreorganizesmemory.Also,theJVMemploysamemoryaddressoptimizationthatcannotbeusedwhenJVMmemoryexceeds32GB,soatmorethan32GB,aGPTextnodelosescapacityandperformance.Therefore,noGPTextnodeshouldhavemorethan32GBofmemory.

    Forexample,ifyouhave48GBmemoryavailableforGPTextperhost,youshoulddeploytwoGPTextnodeswith24GBmemory.Ifyouhave128GBavailable,youshoulddeployatleastfourJVMs,andmoreifgarbagecollectionbecomesaproblem.

    ©CopyrightPivotalSoftware,Inc,2013-2019 32 3.3.0

  • ConfigureMaximumJVMHeapSizeEachSolrcorefileconsumesJVMheapmemory.AddingmoreindexesincreasesJVMswappingandgarbagecollectionfrequencysothatittakeslongertocreateindexesandtoloadthecorefileswhenGPTextisstarted.IfyoucontinuetocreateindexeswithoutincreasingtheJVMheap,anoutofmemoryerrorwilleventuallyoccur.

    MonitorperformanceatstartupandduringindexcreationandincreasetheJVMsizewhenyoubegintoseedegradedperformance.Youcanalsousetoolssuchasjconsole,includedwiththeJavaDeveloperKit,tomonitorJavaheapusage.Ifgarbagecollectionsareoccurringtoofrequentlyandfreeingtoolittlememory,JVMheapshouldbeincreased.

    TheJVMsizeisinitiallyconfiguredduringGPTextinstallationbysettingthe JAVA_OPTIONS parameterintheinstallationconfigurationfile.Afterinstallation,usethe gptext-configjvm commandtoincreasetheJVMheapsize.Forexample,this gptext-configjvm commandsetstheJVMmaximumheapoptionto4GB:

    $gptext-configjvm-o"-Xmx=4096M"

    ManageIndexingandSearchLoadsWithhighindexingorsearchload,JVMgarbagecollectionpausescancausetheSolroverseerqueuetobackup.ForaheavilyloadedGPTextsystem,youcanpreventsomeperformanceproblemsbyschedulingdocumentindexingfortimeswhensearchactivityislow.

    TermsQueriesandOutofMemoryErrorsThe gptext.terms() functionretrievestermsvectorsfromdocumentsthatmatchaquery.Anoutofmemoryerrormayoccurifthedocumentsarelarge,orifthequerymatchesalargenumberofdocumentsoneachnode.Otherfactorscancontributetooutofmemoryerrorswhenrunninga gptext.terms() query,includingthemaximummemoryavailabletotheSolrnodes(-Xmxvaluein JAVA_OPTS )andconcurrentqueries.

    Ifyouexperienceoutofmemoryerrorswith gptext.terms() youcansetalowervalueforthe term_batch_size GPTextconfigurationvariable.Thedefaultvalueis1000.Forexample,youcouldtryrunningthefailingquerywith term_batch_size setto500.Loweringthevaluemaypreventoutofmemoryerrors,butperformanceoftermsqueriescanbeaffected.

    SeeGPTextConfigurationParametersforhelpsettingGPTextconfigurationparameters.

    ConfigureFileSystemCachingforZooKeeperGoodSolrperformanceisdependentonfastresponseforZooKeeperrequests.ZooKeeperperformsbestwhenitsdatabaseiscachedsoitdoesnothavetogotodiskforlookups.IfyoufindthatZooKeeperJVMshavefrequentdiskaccesses,lookforwaystoimprovefilecachingormoveZooKeeperdiskstofasterstorage.

    TheZooKeeper zkClientTimeout parameteristhetimeaclientisallowedtonottalktoZooKeeperbeforehavingitssessionexpired.

    ©CopyrightPivotalSoftware,Inc,2013-2019 33 3.3.0

  • TroubleshootingHadoopConnectionProblemsThissectiondescribesHadoop-relatedproblemsandpotentialsolutionstotheseissues.

    DataNodeAccessErrorsYoumayexperienceHadoopaccesserrorswithGPTextifanyDataNodesintheHadoopclusterresideinamulti-homednetwork.GPTextusesanexternalIPaddresstoaccesstheHDFSNameNode.GPTextencountersanerrorwhentheNameNodeprovidesaninternalIPaddressforaDataNode.Inthissituation,additionalconfigurationisrequiredtoconfigureGPTexttoperformitsownDNSresolutionofDataNodehostnames.

    PerformthefollowingproceduretoexplicitlyconfigureDNSresolutionofDataNodehostnames:

    1. LocatealocalcopyoftheHadoopauthenticationconfigurationdirectorythatyoupreviouslyuploadedtoZooKeeper.Forexample,ifthedirectoryislocatedat /home/gpadmin/auths/hdfs_conf :

    $cd/home/gpadmin/auths/hdfs_conf$lscore-site.xmlhdfs-site.xmluser.txt

    2. Open hdfs-site.xml intheeditorofyourchoice.Forexample:

    $vihdfs-site.xml

    3. Addthefollowingpropertyblocktothefile,andthensavethefileandexit:

    dfs.client.use.datanode.hostnametrue

    ThispropertyallowsGPTexthoststoperformtheirownDNSresolutionofHDFSDataNodehostnames.

    4. Re-uploadthemodifiedconfigurationtoZooKeeper.Forexample,ifthe hdfs_conf directoryincludestheauthenticationconfigurationfilesforaHadoopclusterwith hdfs_bill_auth :

    $cd..$gptext-externalupload-thdfs-chdfs_bill_auth-phdfs_conf

    5. Determinethehostname-to-IPaddressmappingforallDataNodes,andaddtheassociatedentriesintothe /etc/hosts fileonallGPTextclienthosts.

    Kerberos-RelatedErrorsThefollowingproblemsarespecifictoHadoopclusterssecuredwithKerberos.

    ClockSkewAloginattempttoaHadoopclustersecuredwithKerberoswillfailifclockskewbetweenGPTextclienthostsandtheKerberosKDChostistoogreat.Inthissituation,youmayseethefollowingerrorintheSolrlog:

    java.io.IOException causedbya KrbException noting“Clockskewtoogreat”

    Toresolvethissituation,ensurethattheclocksontheKerberosKDChostandGPTextclienthostsaresynchronized.

    TimeoutErrorsAloginattempttoaHadoopclustersecuredwithKerberosmayfailwithtimeouterrorswhenthe kdc and admin_server settingsinthe krb5.conf filearespecifiedwithahostname,andtheGPTextclienthostscannotresolvethehostname.Inthissituation,youmayseeoneofthefollowingerrorsintheSolrlog:

    ©CopyrightPivotalSoftware,Inc,2013-2019 34 3.3.0

  • org.apache.solr.common.SolrException: Failed to login HDFS messagecausedbya java.io.IOException specifyingjavax.security.auth.login.LoginException: Receive timed out

    java.nio.channels.UnresolvedAddressException with SocketIOWithTimeout referencedinthestacktrace

    Inthissituation,youmaychooseeitherofthefollowing:

    UpdatetheKerberos krb5.conf filetospecifythe kdc and admin_server settingsusingIPaddresses.Or

    UpdateallGPTexthoststoperformtheirownDNSresolutionoftheKerberosKDCserver.

    Ifyouchoosetoupdatethe krb5.conf file:

    1. LocatealocalcopyoftheHadoopKerberosauthenticationconfigurationdirectorythatyoupreviouslyuploadedtoZooKeeper.Forexample,ifthedirectoryislocatedat /home/gpadmin/auths/hdfs_kerb_conf :

    $cd/home/gpadmin/auths/hdfs_kerb_conf$lscore-site.xmlhdfs-site.xmlkeytabkrb5.confuser.txt

    2. Open krb5.conf intheeditorofyourchoice.Forexample:

    $vikrb5.conf

    3. Replacethe KERBEROS blockattributeswiththeirequivalentIPaddressesandthensavethefileandexit.Forexample:

    [realms]KERBEROS={kdc=admin_server=}

    4. Re-uploadthemodifiedconfigurationtoZooKeeper.Forexample,ifthedirectorynamed hdfs_kerb_conf includestheauthenticationconfigurationfilesforaHadoopclusterdefinedwiththe hdfs_kerb_auth :

    $cd..$gptext-externalupload-thdfs-chdfs_kerb_auth-phdfs_kerb_conf

    Alternately,ifyouchoosetoconfiguretheGPTexthoststoperformtheirownDNSresolutionoftheKerberosKDCserver,addanentryfortheKDChostname-to-IPaddressmappingtothe /etc/hosts fileonallGPTextclienthosts.

    ©CopyrightPivotalSoftware,Inc,2013-2019 35 3.3.0

  • WorkingWithGPTextIndexesIndexingpreparesdocumentsfortextanalysisandfastqueryprocessing.ThistopicshowsyouhowtocreateGPTextindexesandadddocumentsfromGreenplumDatabasetablestothem,andhowtomaintainandcustomizeindexesforyourownapplications.

    ForhelpindexingandsearchingdocumentsstoredoutsideofGreenplumDatabaseseeWorkingWithGPTextExternalIndexes.

    SettingUptheSampleDatabaseTheexamplesinthisdocumentationworkwitha demo databasecontainingthreedatabasetables,called wikipedia.articles , twitter.message ,andstore.products .Ifyouwanttoruntheexamplesyourself,followtheinstructionsinthissectiontosetupthe demo database.

    1. LogintotheGreenplumDatabasemasterasthegpadminuserandcreatethe demo database.

    $createdbdemo

    2. Openaninteractiveshellforexecutingqueriesinthe demo database.

    $psqldemo

    3. Createthe articles tableinthe wikipedia schemawiththefollowingstatements.

    CREATESCHEMAwikipedia;CREATETABLEwikipedia.articles(idint8primarykey,date_timetimestamptz,titletext,contenttext,refstext)DISTRIBUTEDBY(id);

    4. Createthe message tableinthe twitter schemawiththefollowingstatements.

    CREATESCHEMAtwitter;CREATETABLEtwitter.message(idbigint,message_idbigint,spamboolean,created_attimestampwithouttimezone,sourcetext,retweetedboolean,favoritedboolean,truncatedboolean,in_reply_to_screen_nametext,in_reply_to_user_idbigint,author_idbigint,author_nametext,author_screen_nametext,author_langtext,author_urltext,author_descriptiontext,author_listed_countinteger,author_statuses_countinteger,author_followers_countinteger,author_friends_countinteger,author_created_attimestampwithouttimezone,author_locationtext,author_verifiedboolean,message_urltext,message_texttext)DISTRIBUTEDBY(id)PARTITIONBYRANGE(created_at)(START(DATE'2011-08-01')INCLUSIVEEND(DATE'2011-12-01')EXCLUSIVEEVERY(INTERVAL'1month'));CREATEINDEXid_idxONtwitter.messageUSINGbtree(id);

    5. CREATEthe store.products tablewiththesestatements.

    ©CopyrightPivotalSoftware,Inc,2013-2019 36 3.3.0

  • CREATESCHEMAstore;CREATETABLEstore.products(idbigint,titletext,categoryvarchar(32),brandvarchar(32),pricefloat)DISTRIBUTEDBY(id);

    6. Downloadtestdataforthethreetableshere .Right-clickthelink,savethefile,andthencopyittothegpadminuser’shomedirectory.

    7. Extractthedatafileswiththistarcommand.

    $tarxvfzgptext-demo-data.tgz

    8. Loadthewikipediadataintothe wikipedia.articles tableusingthe psql\COPY metacommand.

    \COPYwikipedia.articlesFROM'/home/gpadmin/demo/articles.csv'HEADERCSV;

    The articles tablenowcontainstextfrom23Wikipediaarticles.

    9. Loadthetwitterdataintothe twitter.message tableusingthefollowing psql\COPY metacommand.

    \COPYtwitter.messageFROM'/home/gpadmin/demo/twitter.csv'CSV;

    The message tablenowcontains1730tweetsfromAugusttoOctober,2011.

    10. Loadtheproductstableintothe store.products tablewiththefollowing psql\COPY metacommand.

    \COPYstore.productsFROM'/home/gpadmin/demo/products.csv'HEADERCSV;

    The products tablenowcontains50rows.Thistableisusedtodemonstratefacetedsearchqueries.SeeCreatingFacetedSearchQueries.

    SettinguptheGPTextCommand-lineEnvironmentToworkwithGPTextindexes,youmustfirstsetupyourenvironmentandaddtheGPTextschematothedatabasecontainingthedocuments(GreenplumDatabasedata)youwanttoindex.

    Tosettheenvironment,loginasthe gpadmin userandsourcetheGreenplumDatabaseandGPTextenvironmentscripts.TheGreenplumDatabaseenvironme