Upload
stat
View
319
Download
1
Tags:
Embed Size (px)
Citation preview
Requirement AnalysisTHE STAT PROJECTTHE STAT PROJECT
Milestone 1 Report
To design a framework, how many variations we need to protect? How many
functionalities we need to provide for supporting all these variations?
QUESTIONSQUESTIONS
Variation for importing dataset (File Sources)
Variations for importing dataset (File formats)
Variations for importing dataset (Schemas)
Even if we only consider dataset in XML, each dataset may have its own schema.
Reuters dataset example
Simplified approach
One approach: High Level Reader Class, - ReutersReader- RCV1ReaderOnce written, can be shared by community
One approach: High Level Reader Class, - ReutersReader- RCV1ReaderOnce written, can be shared by community
Observation: for the sake of comparison, researchers usually deal with a few famous dataset (e.g., Reuters, RCV-1)
Able to persist and read back memory objects
Able to visualize memory objects
STAT (brief) Domain Model
Note: We ignore texts on connectors for brevity. Some connections are not drawn because of space limitation Note: We ignore texts on connectors for brevity. Some connections are not drawn because of space limitation
STAT framework sample code (conceptual)
Domain Concept: RawCorpus
A collection of RawDocument, supporting collection operations: - Add new RawDocument element - Remove existing RawDocument element - Accessing elements in the collection - …
Domain Concept: RawCorpus
abstract class RawCorpus {List<RawDocument> rawDocuments;RawDocument getDocument(int index);void setDocument(int index, T doc);void removeDocument(int index);
}
Domain Concept: RawDocument
An object with one or more string fields, serving as a non-processed, in-memory representation of a document unit - Like Java beans with getter and setter - All fields must be string type, even for numbers
Domain Concept: RawDocument
class MyRawDocument extends RawDocument {String title;String author;String body;String date;String numOfClicks;String topicType;…
}
abstract class RawDocument {public RawDocument() {}
}
Domain Concept: Processor
An object that processes RawCorpus and produces Corpus. - Linguistic: Tokenizer, Stemmer, StopRemover, PosTagger, … - Machine learning: Feature-specific, document-specific
Domain Concept: Corpus
An object representing a collection of Document for use by machine learning side of framework. This object provides a notion of splits which is commonly used (e.g., train, test)
Domain Concept: Trainer
A representation of a machine learning algorithm, which can learn from a Corpus and produce a Model.
Domain Concept: Model
An object of what machine learning algorithm (i.e., Trainer) creates to store parameters that are "learned" from the data (i.e., Corpus)
Domain Concept: Classifier
An object that maps Documents to target values (label, number, probability). It takes a Corpus and a Model as inputs, and produces a Prediction associated with the Corpus according to the Model.
Domain Concept: Prediction
A collection of target values (label, number, probability) that associate with a Corpus, i.e., a collection of Document.
Domain Concept: Evaluator
An object used for comparing the Prediction against its associated Corpus and generating Evaluation
Domain Concept: Evaluation
A representation of evaluation result given by a Evaluator, in a summarized manner.
THE STAT PROJECTTHE STAT PROJECT
Thanks
CorpusCorpus
ReaderReader ProcessorProcessorRawCorpusRawCorpus
TrainerTrainerModelModel
ClassifierClassifier
PredictionPrediction
EvaluatorEvaluator
EvaluationEvaluation
STAT (brief) Domain Model
Note: We ignore texts on connectors for brevity. Some connections are not drawn because of space limitation Note: We ignore texts on connectors for brevity. Some connections are not drawn because of space limitation
WriterWriter
VocabularyVocabulary
CorpusCorpusReaderReader ProcessorProcessorRawCorpusRawCorpus
TrainerTrainer
ModelModelClassifierClassifierPredictionPredictionEvaluatorEvaluator
EvaluationEvaluation WriterWriter
STAT Domain Model
Note: We ignore texts above lines for brevity
CorpusCorpus
ReaderReader
ProcessorProcessor
RawCorpusRawCorpus
TrainerTrainerModelModel
ClassifierClassifier
PredictionPrediction
EvaluatorEvaluator
EvaluationEvaluation
STAT Domain Model
Note: We ignore texts above lines for brevity
DocumentDocument
RawDocumentRawDocument