5
GPText: Greenplum Parallel Statistical Text Analysis Framework Kun Li University of Florida E403 CSE Building Gainesville, FL 32611 USA [email protected]fl.edu Christan Grant University of Florida E457 CSE Building Gainesville, FL 32611 USA [email protected]fl.edu Daisy Zhe Wang University of Florida E456 CSE Building Gainesville, FL 32611 USA [email protected]fl.edu Sunny Khatri Greenplum 1900 S Norfolk St #125 San Mateo, CA 94403 USA [email protected] George Chitouras Greenplum 1900 S Norfolk St #125 San Mateo, CA 94403 USA [email protected] ABSTRACT Many companies keep large amounts of text data inside of rela- tional databases. Several challenges exist in using state-of-the-art systems to perform analysis on such datasets. First, expensive big data transfer cost must be paid up front to move data between databases and analytics systems. Second, many popular text an- alytics packages do not scale up to production sized datasets. In this paper, we introduce GPText, Greenplum parallel statistical text analysis framework that addresses the above problems by support- ing statistical inference and learning algorithms natively in a mas- sively parallel processing database system. GPText seamlessly in- tegrates the Solr search engine and applies statistical algorithms such as k-means and LDA using MADLib, an open source library for scalable in-database analytics which can be installed on Post- greSQL and Greenplum. In addition, GPText also developed and contributed a linear-chain conditional random field(CRF) module to MADLib to enable information extraction tasks such as part-of- speech tagging and named entity recognition. We show the perfor- mance and scalability of the parallel CRF implementation. Finally, we describe an eDiscovery application built on the GPText frame- work. Categories and Subject Descriptors H.2.4 [Database Management]: [Systems, parallel databases]; I.2.7 [Artificial Intelligence]: [Natural Language Processing, text anal- ysis] General Terms Design, Performance, Management Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DanaC’13, June 23, 2013, New York, NY, USA Copyright 2013 ACM 978-1-4503-2202-7/13/6 ...$15.00. Keywords RDBMS, Massive parallel processing, Text analytics 1. INTRODUCTION Text analytics has gained much attention in the big data research community due to the large amounts of text data generated in orga- nizations such as companies, government and hospitals everyday in the form of emails, electronic notes and internal documents. Many companies store this text data in relational databases because they relay on databases for their daily business needs. A good under- standing of this unstructured text data is crucial for companies to make business decision, for doctors to assess their patients, and for lawyers to accelerate document review processes. Traditional business intelligence pulls content from databases into other massive data warehouses to analyze the data. The typ- ical “data movement process” involves moving information from the database for analysis using external tools and storing the final product back into the database. This movement process is time consuming and prohibitive for interactive analytics. Minimizing the movement of data is a huge incentive for businesses and re- searchers. One way to achieve this is for the datastore to be in the same location as the analytic engine. While Hadoop has become a popular platform to perform large- scale data analytics, newer parallel processing relational databases can also leverage more nodes and cores to handle large-scale datasets. Greenplum database, built upon the open source database Post- greSQL, is a parallel database that adopt a shared nothing mas- sive parallel processing (MPP) architecture. Database researchers and vendors are capitalizing on the increase in database cores and nodes and investing in open source data analytics ventures such as the MADLib project [3, 7]. MADLib is an open source library for scalable in-database analytics on Greenplum and PostgreSQL. It provides parallel implementation of many machine learning algo- rithms. In this paper, we motivate in-database text analytics by show- ing the GPText, a powerful and scalable text analysis framework developed on Greenplum MPP database. The GPText inherits scal- able indexing, keyword search, and faceted search functionalities from an effective integration of the Solr [6] search engine. The GP- Text uses the CRF package, which was contributed to the MADLib open-source library. We show that we can use SQL and user defined aggregates to implement conditional random fields (CRFs) meth-

Greenplum Text Analytics

Embed Size (px)

DESCRIPTION

Greenplum

Citation preview

Page 1: Greenplum Text Analytics

GPText: Greenplum Parallel Statistical Text AnalysisFramework

Kun LiUniversity of FloridaE403 CSE Building

Gainesville, FL 32611 [email protected]

Christan GrantUniversity of FloridaE457 CSE Building

Gainesville, FL 32611 [email protected]

Daisy Zhe WangUniversity of FloridaE456 CSE Building

Gainesville, FL 32611 [email protected]

Sunny KhatriGreenplum

1900 S Norfolk St #125San Mateo, CA 94403 USA

[email protected]

George ChitourasGreenplum

1900 S Norfolk St #125San Mateo, CA 94403 USA

[email protected]

ABSTRACTMany companies keep large amounts of text data inside of rela-tional databases. Several challenges exist in using state-of-the-artsystems to perform analysis on such datasets. First, expensive bigdata transfer cost must be paid up front to move data betweendatabases and analytics systems. Second, many popular text an-alytics packages do not scale up to production sized datasets. Inthis paper, we introduce GPText, Greenplum parallel statistical textanalysis framework that addresses the above problems by support-ing statistical inference and learning algorithms natively in a mas-sively parallel processing database system. GPText seamlessly in-tegrates the Solr search engine and applies statistical algorithmssuch as k-means and LDA using MADLib, an open source libraryfor scalable in-database analytics which can be installed on Post-greSQL and Greenplum. In addition, GPText also developed andcontributed a linear-chain conditional random field(CRF) moduleto MADLib to enable information extraction tasks such as part-of-speech tagging and named entity recognition. We show the perfor-mance and scalability of the parallel CRF implementation. Finally,we describe an eDiscovery application built on the GPText frame-work.

Categories and Subject DescriptorsH.2.4 [Database Management]: [Systems, parallel databases]; I.2.7[Artificial Intelligence]: [Natural Language Processing, text anal-ysis]

General TermsDesign, Performance, Management

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.DanaC’13, June 23, 2013, New York, NY, USACopyright 2013 ACM 978-1-4503-2202-7/13/6 ...$15.00.

KeywordsRDBMS, Massive parallel processing, Text analytics

1. INTRODUCTIONText analytics has gained much attention in the big data research

community due to the large amounts of text data generated in orga-nizations such as companies, government and hospitals everyday inthe form of emails, electronic notes and internal documents. Manycompanies store this text data in relational databases because theyrelay on databases for their daily business needs. A good under-standing of this unstructured text data is crucial for companies tomake business decision, for doctors to assess their patients, and forlawyers to accelerate document review processes.

Traditional business intelligence pulls content from databasesinto other massive data warehouses to analyze the data. The typ-ical “data movement process” involves moving information fromthe database for analysis using external tools and storing the finalproduct back into the database. This movement process is timeconsuming and prohibitive for interactive analytics. Minimizingthe movement of data is a huge incentive for businesses and re-searchers. One way to achieve this is for the datastore to be in thesame location as the analytic engine.

While Hadoop has become a popular platform to perform large-scale data analytics, newer parallel processing relational databasescan also leverage more nodes and cores to handle large-scale datasets.Greenplum database, built upon the open source database Post-greSQL, is a parallel database that adopt a shared nothing mas-sive parallel processing (MPP) architecture. Database researchersand vendors are capitalizing on the increase in database cores andnodes and investing in open source data analytics ventures such asthe MADLib project [3, 7]. MADLib is an open source library forscalable in-database analytics on Greenplum and PostgreSQL. Itprovides parallel implementation of many machine learning algo-rithms.

In this paper, we motivate in-database text analytics by show-ing the GPText, a powerful and scalable text analysis frameworkdeveloped on Greenplum MPP database. The GPText inherits scal-able indexing, keyword search, and faceted search functionalitiesfrom an effective integration of the Solr [6] search engine. The GP-Text uses the CRF package, which was contributed to the MADLibopen-source library. We show that we can use SQL and user definedaggregates to implement conditional random fields (CRFs) meth-

Page 2: Greenplum Text Analytics

ods for information extraction in parallel. The experiment showssublinear improvement in runtime for both CRF learning and in-ference with linear increase in the number of cores. As far as weknow, GPText is the first toolkit for statistical text analysis in rela-tional database management systems. Finally, we describe the needand requirement for eDiscovery applications and show that GPTextis an effective platform to develop such sophisticated text analysisapplications.

2. RELATED WORKResearchers have created systems for large scale text analytics

including GATE, PurpleSox and SystemT [4, 2, 8]. While bothGATE and SystemT uses rule-based approach for information ex-traction (IE), PurpleSox uses statistical IE models. However, noneof the above system are native built for a MPP framework.

The parallel CRF in-database implementation follows the MADmethodology [7]. In a similar vein, researchers show that most ofthe machine learning algorithms can be expressed through a unifiedRDBMS architectures [5].

There are several implementations conditional random fields andbut only a few large scale implementations for NLP tasks. Oneexample is the PCRFs [11] that are implemented over massivelyparallel processing systems supporting Message Passing Interface(MPI) such as Cray XT3, SGI Altix, and IBM SP. However, this isnot implemented over RDBMs.

3. GREENPLUM TEXT ANALYTICSGPText runs on Greenplum database(GP), which is a shared noth-

ing massive parallel processing(MPP) database . The Greenplumdatabase adopts the most widely used master-slave parallel com-puting model where one master node orchestrates multiple slaves towork until the task is completed and there is no direct communica-tion between slaves. As shown in Figure 1, it is a collection of Post-greSQL instances including one master instance and multiple slaveinstances(segments). The master node accepts SQL queries fromclients, then divide the workloads and send sub-tasks to the seg-ments. Besides harnessing the power of a collection of computernodes, each node can also be configured with multiple segments tofully utilize the muticore processor. To provide high availability,GP provides the options to deploy redundant standby master andmirroring segments in case of master segments and primary seg-ments failure.

The embarrassing processing capability powered by the Green-plum MPP framework lays the cornerstone to enable GPText toprocess the production sized text data. Besides the underling MPPframework, GPText also inherited the features that a traditionaldatabase system brings to GPText for example, online expansion,data recovery and performance monitoring to name a few. On topof the underling MPP framework, there are two building blocks,MADLib and Solr as illustrated in Figure 1 which distinguish GP-Text from many of the existing text analysis tools. MADLib makesGPText capable of doing sophisticated text data analysis tasks, suchas part-of-speech tagging, named entity recognition, document clas-sification and topic modeling with a vast amount of parallelism.Solr is reliable and scalable text search platform from Apache Luceneproject and it has been widely deployed in web servers. The majorfeatures includes powerful full-text search, faceted search, near realtime indexing. As shown in Figure 1, GPText uses Solr to createdistributed indexing . Each primary segment is associated with oneand only Solr instance where the index of the data in the primarysegment is stored for the purpose of load balancing. GPText hasall the features that Solr has since Solr is integrated into GPText

Figure 1: The GPText architecture over Greenplum database.

seamlessly. The authors submitted the CRF statistical package fortext analysis to MADLib (more details on section 4).

3.1 In-database document representationIn GPText, a document can represented as a vector of counts

against a token dictionary, which is a vector of unique tokens in thedataset. For efficient storage and memory, GPText uses a sparsevector representation for each document instead of a naive vectorrepresentation. The following are an example with two differentvector representations of a document. The dictionary contains allthe unique terms (i.e., 1-grams) exist in the corpus.

Dictionary:{am, before, being, bothered, corpus, document, in, is, me,never, now, one, really, second, the, third, this, until}.Document:{i, am, second, document, in, the, corpus}.Naive vector representation:{1,0,0,0,1,1,1,1,0,0,0,0,0,0,1,1}.Sparse vector representation:{(1,3,1,1,1,1,6,1,1):(1,0,1,1,1,1,0,1,1)}.

GPText adopts the run-length encoding to compress the naivevector representation using pairs of count-value vectors. The firstarray in the vector representation contains count values for the cor-responding values in the second array. Although not apparent inthis example, the advantage of the sparse vector representation isdramatic in real world documents, where most of the elements inthe vector are 0.

3.2 ML-based Advanced Text AnalysisGPText relies on multiple machine learning modules in MADLib

to perform statistical text analysis. Three of the commonly usedmodules are k-means for document clustering, multinomial naivebayes (multinomialNB) for document classification, and latent dirich-let allocation(LDA) for topic modeling for dimensionality reduc-

Page 3: Greenplum Text Analytics

Figure 2: The MADLib CRF overall system architecture.

tion and feature extraction. Performing k-means, multinomialNB,LDA in GPText follows the same pipeline as follows:

1. Create Solr index for documents.2. Configure and populate Solr index.3. Create terms table for each document.4. Create dictionary of the unique terms across documents.5. Construct term vector using the tf-idf for each document.6. Run MADLib k-means/MultinomiaNB/LDA algorithm.The following section details the implementation of the CRF

learning and inference modules that we developed for GPText ap-plications as part of MADLib.

4. CRF FOR IE OVER MPP DATABASESConditional random field(CRF) is a type of discriminative undi-

rected probabilistic graphical model. Linear-chain CRFs are spe-cial CRFs, which assume that the next state depends only on thecurrent state. Linear-chain CRFs achieve start-of-the-art accuracyin many real world natural language processing (NLP) tasks such aspart of speech tagging(POS) and named entity recognition(NER).

4.1 Implementation OverviewIn Figure 2 illustrate the detailed implementation of the CRF

module developed for IE tasks in GPText. The top box shows thepipeline of the training phase. The bottom box shows the pipelinefor the inference phase. We use declarative SQL statements to ex-tract all features from text. Any features in the state of art packagescan be extracted using one single SQL clause. All of the com-mon features described in literature can be extracted with one SQLstatement. The extracted features are stored in a relation for ei-ther single or two-state features. For inference, we use user-definedfunctions to calculate the maximum-a-priori (MAP) configurationand probability for each document. Inference over all documentscan be naively parallel since the inference over one document is in-dependent with other documents. For learning, the computation ofgradient and log-likelihood over all training documents dominatesthe overall training time. Fortunately, it can run parallel since theoverall gradient and log-likelihood is simply the summation of thegradient and log-likelihood for each document. This parallelizationcan be expressed with the parallel programming paradigm, user-defined aggregates which is supported in parallel databases.

4.2 Feature Extraction Using SQLs

SELECT doc2.pos, doc2.doc_id, ’E.’,ARRAY[doc1.label, doc2.label]

FROM segmenttbl doc1, segmenttbl doc2WHERE doc1.doc_id = doc2.doc_id AND

doc1.pos + 1 = doc2.pos

Figure 3: Query for extracting edge features

SELECT start_pos, doc_id, ’R_’ || r.name,ARRAY[-1, label]

FROM regextbl r, segmenttbl sWHERE s.seg_text ∼ r.pattern

Figure 4: Query for extracting regex features

Text feature extraction is a step in most statistical text analysismethods. We are able to implement all of the seven types of fea-tures used in POS and NER using exactly seven SQL statements.These features include:Dictionary: does this token exist in a dictionary?Regex: does this token match a regular expression?Edge: is the label of a token correlated with the label of a previous

token?Word: does this token appear in the training data?Unknown: does this token appeared in the training data below cer-

tain threshold?Start/End: is this token first/last in the token sequence?

There are many advantages for extracting features using SQLs.The SQL statements hide a lot of the complexity present in the ac-tual operation. It turns out that each type of feature can be extractedout using exactly one SQL statement, making the feature extractioncode extremely succinct. Secondly, SQLs statements are naivelyparallel due to the set semantics supported by relational DBMS’s.For example, we compute features for each distinct token and avoidre-computing the features for repeated tokens.

In Figure 3 and Figure 4 we show how to extract edge and regexfeatures, respectively. Figure 3 extracts adjacent labels from sen-tences and stores them in an array. Figure 4 shows a query thatselects all the sentences that satisfies any of the regular expressionspresent in the table ‘regextable’.

4.3 Parallel Linear-chain CRF Training

Algorithm 1 CRF training(z1:M )

Input: z1:M , . Document setConvergence(),Initialization(),Transition(),Finalization()

Output: Coefficients w ∈ RN

Initialization/Precondition: iteration = 01: Initialization(state)2: repeat3: state← LoadState()4: for all m ∈ 1..M do5: state← Transition(state, zm)

6: state← Finalization(state)7: WriteState(state)8: until Convergence(state, iteration)9: return state.w

Page 4: Greenplum Text Analytics

Programming Model.In Algorithm 1 we show the parallel CRF training strategy. The

algorithm is expressed as an user-defined aggregate. User-definedaggregates are composed of three parts: a transition function (Al-gorithms 2), a merge function1, and a finalization function (Algo-rithm 3). Following we describe these functions.

In line 1 of Algorithm 1 the Initialization function creates a‘state’ object in the database. This object contains a coefficient (w),gradient (∇) and log-likelihood (L) variables. This state is loaded(line 3) and saved (line 7) between iterations. We compute the gra-dient and log-likelihood of each segment in parallel (line 4) muchlike a Map function. Then line 6 computes the new coefficientsmuch like a reduce function.

Transition strategies.Algorithm 2 contains the logic of computing the gradient and

log-likelihood for each tuple using the forward-backward algorithm.The overall gradient and log-likelihood is simply the summation ofgradient and log-likelihood of each tuple. This algorithm is invokedin parallel over many segments and the result of these functions arecombined using the merge function.

Algorithm 2 transition-lbfgs(state, zm)

Input: state, . Transition statezm, . A DocumentGradient()

Output: state1: {state.∇, state.L} ← Gradient(state, zm)2: state.num_rows←state.num_rows+ 13: return state

Finalization strategy.The finalization function invokes the L-BFGS convex solver to

get a new coefficient vector. To avoid overfitting, we choose to

Algorithm 3 finalization-lbfgs(state)

Input: state,LBFGS () . Convex optimization solver

Output: state1: {state.∇, state.L} ←penalty(state.∇, state.L)2: instance← LBFGS .init(state)3: instance.lbfgs() . invoke the L-BFGS solver4: return instance.state

penalize the log-likelihood with a spherical Gaussian weight prior.Limited-memory BFGS (L-BFGS), a variation of the Broyden-

Fletcher-Goldfarb-Shanno (BFGS) algorithm is a leading methodfor large scale non-constraint convex optimization method. Wetranslate an in-memory Java implementation [10] to a C++ in-databaseimplementation. Before each iteration of L-BFGS optimization, weneed to initialize the L-BFGS with the current state object. At theend of each iteration, we need to dump the updated variables to thedatabase state for next iteration.

4.4 Parallel Linear-chain CRF InferenceThe Viterbi algorithm is applied to find the top-k most likely

labellings of a document for CRF models. The CRF inference is

1The merge function takes two states and sums their gradient andlog-likelihood.

Figure 5: Linear-chain CRF training scalability

Figure 6: Linear-chain CRF inference scalability

naively parallel since the inference over one document is indepen-dent of other documents. We chose to implement a SQL clauseto drive the Viterbi inference over all documents and each functioncall will finish labeling of one document. In parallel databases, e.g.,Greenplum, Viterbi can be run in parallel over different subsets ofthe documents on a single-node or muti-node cluster and each nodecan have multiple cores.

5. EXPERIMENTS AND RESULTSIn order to evaluate the performance and scalability of the linear-

chain CRF learning and inference on Greenplum, we conduct ex-periments on various data sizes over on a 32-core machine with 2Thard drive and 64GB memory. We use CoNLL2000 dataset con-taining 8936 tagged sentences for learning. This dataset is labeledwith 45 POS tags. To evaluate the inference performance, we ex-tracted 1.2 million sentences from the New York Times dataset. InFigure 5 and Figure 6 we show the algorithm is sublinear andimproves with an increase in the number of segments. Our POSimplementation achieves .9715 accuracy, this is consistent with thestate of the art [9].

6. GPTEXT APPLICATIONWith the support of Greenplum MPP data processing framework,

efficient Solr indexing/search engine, and parallelized statisticaltext analysis modules, GPText positions itself to be a great plat-form for many applications that need to apply scalable text analyt-ics methods with varying sophistications over unstructured data indatabases. One of such application is eDiscovery.

eDiscovery has become increasingly important in legal processes.Large enterprises keep terabytes of text data such as emails andinternal documents in their databases. Traditional civil litigationoften involves reviews of large amounts of documents both in theplaintiffs side and defendants side. eDiscovery provides tools to

Page 5: Greenplum Text Analytics

Figure 7: GPText application

pre-filter documents for review to provide a more speedy, inexpen-sive and accurate solution. Traditionally, simple keyword searchare used for such document retrieval tasks in eDiscovery. Recently,predictive coding based on more sophisticated statistical text anal-ysis methods are gaining more and more attention in civil litigationsince it provides higher precision and recall in retrieval. In 2012,judges in several cases [1] approves the use of the predictive codingbased on the indication of Rules of the Federal Rules of Civil Pro-cedure. GPText developed a prototype of eDiscovery tool using theEnron dataset. Figure 7 is a snapshot of the eDiscovery application.In the top pane, keyword ’illegal’ is specified and Solr is used to re-trieve all relevant emails that contains the keyword displayed in thebottom pane. As shown on the middle pane on the right, it alsosupports topic discovery in the email corpus using LDA and classi-fication using k-means. K-means use LDA and CRF to reduce thedimensionalities of features to speed up the k-means classification.CRF is also used to extracted named entities. As you can see thetopics are displayed in the Results panel. The tool also providesvisualization of aggregate information and faceted search over theemail corpus.

7. CONCLUSIONWe introduce GPText, a parallel statistical text analysis frame-

work over MPP database. With the seamless integration with Solrand MADLib, GPText is a framework with powerful search en-gine and advanced statistical text analysis capabilities. We imple-mented a parallel CRF inference and training module for IE. Thefunctionalities and scalability provided by GPText positions itselfto be a great tools for sophisticated text analytics applications suchas eDiscovery. As future work, we plan to support learning to rankand active learning algorithms in GPText.

AcknowledgmentsChristan Grant is funded by NSF under Grant No. DGE-0802270.This work was supported by a gift from EMC Greenplum.

8. REFERENCES[1] L. Aguiar and J. A. Friedman. Pre-

dictive coding: What it is and what you need to know about it.http://newsandinsight.thomsonreuters.com/Legal/Insight/2013.

[2] P. Bohannon, S. Merugu, C. Yu, V. Agarwal, P. DeRose,A. Iyer, A. Jain, V. Kakade, M. Muralidharan,R. Ramakrishnan, and W. Shen. Purple sox extractionmanagement system. SIGMOD Rec., 37(4):21–27, Mar.2009.

[3] J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, andC. Welton. Mad skills: new analysis practices for big data.Proc. VLDB Endow., 2(2):1481–1492, Aug. 2009.

[4] H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan,N. Aswani, I. Roberts, G. Gorrell, A. Funk, A. Roberts,D. Damljanovic, T. Heitz, M. A. Greenwood, H. Saggion,J. Petrak, Y. Li, and W. Peters. Text Processing with GATE(Version 6). 2011.

[5] X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a unifiedarchitecture for in-rdbms analytics. In Proceedings of the2012 ACM SIGMOD International Conference onManagement of Data, SIGMOD ’12, pages 325–336, NewYork, NY, USA, 2012. ACM.

[6] T. A. S. Foundation. Apache solr.http://lucene.apache.org/solr.

[7] J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang,E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li,and A. Kumar. The madlib analytics library: or mad skills,the sql. Proc. VLDB Endow., 5(12):1700–1711, Aug. 2012.

[8] Y. Li, F. R. Reiss, and L. Chiticariu. Systemt: a declarativeinformation extraction system. In Proceedings of the 49thAnnual Meeting of the Association for ComputationalLinguistics: Human Language Technologies: SystemsDemonstrations, HLT ’11, pages 109–114, Stroudsburg, PA,USA, 2011. Association for Computational Linguistics.

[9] C. D. Manning. Part-of-speech tagging from 97% to 100%:is it time for some linguistics? In Proceedings of the 12thinternational conference on Computational linguistics andintelligent text processing - Volume Part I, CICLing’11,pages 171–189, Berlin, Heidelberg, 2011. Springer-Verlag.

[10] J. Morales and J. Nocedal. Automatic preconditioning bylimited memory quasi-newton updating. SIAM Journal onOptimization, 10(4):1079–1096, 2000.

[11] H. Phan and M. Le Nguyen. Flexcrfs: Flexible conditionalrandom fields, 2004.