Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
SHARING COMMON
FONCTIONALITIES ESSNET
ESSnet SCFE DELIVERABLE D4-2
GRAPHAN – Graphical data analyses service requirements
analys
Project acronym:
SCFE
Project title:
“Sharing common functionalities in the ESS”
Name(s), title(s) and organization or the author(s):
Rudi Seljak
Zvone Klun
Simon Pelicon
Tomaž Špeh
Statistical Office of the Republic of Slovenia
Tel: +386 1 241 64 00
e-mail: [email protected]
This document is licensed under a Creative Commons License: Attribution-ShareAlike 4.0 International
2
TITRE DU DOCUMENT
1. Introduction 3
1.1. Purpose 3
1.2. References 3
2. Service description 4
2.1. Business Function Identification 4
2.1.1. Service name 4
2.1.2. Service version 4
2.1.3. Business Process - GSBPM 4
2.1.4. Service description 4
2.1.5. Purpose - Business Goals 5
Figure 1: Graphical analysis service scope 5
2.2. Outcomes 6
2.3. Input, Output metadata 6
2.3.1. GSIM objects 6
2.3.1.1. Input 6
2.3.1.2. Output 6
2.4. Description of the selected methods 6
2.4.1. Method 1 6
2.4.2. Method 2 6
2.4.3. Method 3 7
2.5. Pseudo code 7
3. Specific requirements 13
3.1. User interfaces 13
3.2. Use case diagrams 17
3.3. Activity diagram 18
3.4. Class diagram 19
3.5. Design constraints 19
3.6. State diagram - rest service interface 20
3
TITRE DU DOCUMENT
1. INTRODUCTION
The objective of WP4 - Identification of re-usable services and analysis of requirements was to identify
services that can be candidates for re-use in the ESS and to analyse the functional and technical
requirements of one service for re-use in at least 3 ESS members. This document contains a complete list
of requirements taking into account similarities and differences of requirements among the 3 ESS
members.
The theoretical framework from the first task of the WP4 has been practically implemented for the case of
selected statistical service GRAPHAN - Graphical data analysis service. The main criterion for the selection
of the service was its potential for wide usage across the statistical domains and organisations, as well as
the existence of the firm theoretical framework for its theoretical, methodological description. The
selection of the service was discussed and agreed by ESSNet partners, TF SERV and Eurostat.
Functional and technical requirements of the selected statistical service were analysed on the basis of the
mutually agreed methodology. The main principle was that the analyses should be based on the practical
experiences of statistical organisations. In the process of setting up the harmonised methodology the
already available methodological documents were studied and the results of already performed projects
and studies were taken into account.
The Netherland (CBS), Hungary (HCSO) and Croatia (Croatian Bureau of Statistics) participate in the
process of analysing functional and technical requirements. Proposals and comments were taken into
consideration in the final document.
1.1. PURPOSE
The purpose of this document is to give a detailed description of the requirements for the Graphical data
analysis service. It illustrates the purpose and complete declaration for the development of the Graphical
data analyses system. It also explains system constraints, interface and potential interactions with other
external services. This document is primarily intended to be proposed to statistical institutions for review
and coordination of requirements and as a reference for developing the first version of the system for the
development team. This document includes the pseudo code for developing service methods and REST
Api specification for developing REST service.
1.2. REFERENCES
Generic Statistical Business Process Model (GSBPM)
The GSBPM provides a reference framework for classifying and understanding the statistical production
activities of an NSI. It covers the entire production cycle of official statistics, including their evaluation and
gathering of user needs, the design and build aspects, and the collection, production and dissemination of
statistics. The GSBPM is a common reference framework for all NSIs and it is widely used within them. It is
the key instrument used to identify and define services.
Generic Statistical Information Model (GSIM)
The GSIM provides a reference framework and conceptual information objects for statistics. One of the
key aspects of the GSIM is that it provides a common language to describe statistical information,
therefore enabling sharing and modernisation. The GSIM is used to describe the conceptual inputs and
outputs of statistical services.
4
TITRE DU DOCUMENT
Common Statistical Production Architecture (CSPA)
The CSPA provides a framework, principles and guidelines to develop statistical services. The aims of the
CSPA are to foster international collaboration to develop and share interoperable, reusable statistical
services. The CSPA is based on the principles of Service Oriented Architecture (SOA) and builds on the
GSBPM and the GSIM in order to define the statistical context for the SOA approach.
ESS Enterprise Architecture Reference Framework (ESS EARF)
The ESS EARF provides a series of artefacts that support and guide the implementation of Vision 2020.
The ESS EARF provides a Capability Model and a series of application building blocks for the ESS, as well
as related architectural design principles. The ESS uses the EARF for governance of programmes and
projects, ensuring that the deliverables of these are aligned with the EARF artefacts.
ESS Statistical Production Reference Architecture (SPRA)
The SPRA expands the Information System Architecture of the ESS EARF. It provides principles, examples
and guidance on how the different application building blocks interrelate and what services they support.
The SPRA can be used to guide the identification and definition of statistical services; priority should be
given to starting from the business architecture domain and the GSBPM.
2. SERVICE DESCRIPTION
2.1. BUSINESS FUNCTION IDENTIFICATION
2.1.1. SERVICE NAME
GRAPHAN - Graphical data analysis service
2.1.2. SERVICE VERSION
Version 1.0 - Initial release
2.1.3. BUSINESS PROCESS - GSBPM
GSBPM 5.0 – Sub-process 5.3 [Review and Validate]
GSBPM 6.0 – Sub-process 5.3 [Validate outputs]
2.1.4. SERVICE DESCRIPTION
The aim of this service is to provide a tool for graphical representation of the data, enabling statisticians a
detailed insight into the data distribution, leading to detection of its deviations and suspicious values or
patterns. This procedure is mainly intended for detecting irregularities on the macro level; therefore, we
can classify it into the so-called macro editing part of the process. With this tool the suspicious and
potentially erroneous values are only detected. Other tools should be used for data correction.
5
TITRE DU DOCUMENT
On the general level the service should provide two basic types of analyses: cross-sectional and
longitudinal. The cross-sectional analysis explores the data of only one survey instance, mostly exploring
the univariate or multivariate data distribution. On the other hand, the data set that is explored with the
longitudinal analysis consists of the data from several survey instances. This analysis hence focuses on the
longitudinal aspect of the data distribution, aiming at detecting the irregularities in the temporal data
distribution.
2.1.5. PURPOSE - BUSINESS GOALS
The graphical analysis as an activity can be placed in different stages of the statistical process. It can for
instance be used at the very beginning of the data processing cycle, exploring the raw incoming data, or at
the very end of the processing, when the final aggregated data are verified before the tabulation and
dissemination activities. The main goal is to provide visual representation of the data, mostly on the
aggregated level, but it can also be used for visual representation of the microdata for the selected unit
(e.g. to explore its movement through time) or for visual representation of both levels (micro and macro)
together (e.g. to compare temporal movement of the microdata of the selected unit and the temporal
movement of the aggregate).
The basis for the graphical analyses is the incoming set of microdata. The visual representation should be
enabled on the level of the whole data set or on the level of the selected statistical domain.
Although this activity is classified as one of the “macro editing activities”, it is not aimed to directly
validate the aggregates (e.g. with the macro validation rules or more specifically VTL validation rules).
Another service should be defined and specified for this purpose.
Although this activity can be well used to (visually) detect outlying values in the data distribution, its aim
is not to explicitly list the outlying values by using the calculation procedure. Another tool, where the
different outlier detection methods will be incorporated, should be used for that purpose.
FIGURE 1: GRAPHICAL ANALYSIS SERVICE SCOPE
6
TITRE DU DOCUMENT
2.2. OUTCOMES
The outcome of the graphical data analyses service is a business function providing survey statisticians a
flexible and user-friendly tool that will enable quick insight into the data distribution and consequently
enable quick detection of eventual irregularities in the data on micro and on macro level. The service
should be a “methods-based service”, meaning that it should encompass several different methods for
graphical visualisation. The tool should be opened in the sense that new methods could easily be added.
2.3. INPUT, OUTPUT METADATA
2.3.1. GSIM OBJECTS
2.3.1.1. INPUT
The service will use the following inputs:
● Input microdata dataset to be analysed
● Structural metadata (name and description of the table, variables, etc.)
● Processing metadata (processing rules)
2.3.1.2. OUTPUT
The service should provide the following outputs:
● Machine readable and presentable charts and/or other (required) images
● Output tables with accompanying results (e.g. analysed aggregates, correlation coefficients,
regression coefficients, etc.)
2.4. DESCRIPTION OF THE SELECTED METHODS
2.4.1. METHOD 1
Notation: M1
Title: Scatter plot for selected variables
Type of analysis: Cross-sectional
Description: A scatter plot for two selected variables is plotted. The scatter plot can be plotted for the
variables’ values in the entire input dataset or only for the selected domain determined by the selected
categorical variable and its unique value. The range of values for which the scatter plot is plotted can
additionally be limited by the given logical expression.
2.4.2. METHOD 2
Notation: M2
Title: Bar chart of values of statistics in the selected domain categories
Type of analysis: Cross-sectional
7
TITRE DU DOCUMENT
Description: For the selected statistics and the selected domain (determined by the categorical variable),
a bar chart is created where the value of the statistics for each domain variable category is presented. The
values of the selected statistics are calculated from the input dataset on the appropriately provided
process metadata. The range of values for which the bar chart is plotted can additionally be limited by the
given logical expression.
2.4.3. METHOD 3
Notation: M3
Title: Line chart of values of statistics with and without selected unit
Type of analysis: Longitudinal
Description: For the selected statistics, selected time period, selected domain (determined by the
categorical variable) and selected unit, a line chart is created where for each survey reference period
(inside the given upper and lower time limits) the value of the statistics with and without selected unit is
presented. The values of the selected statistics are calculated from the input dataset on the appropriately
provided process metadata. The range of values for which the line chart is plotted can additionally be
limited by the given logical expression.
2.5. PSEUDO CODE
OBTAIN the respective process metadata
READ the input data set
IF METHOD=M1 THEN
OBTAIN the process metadata for method M1
● VAR1: Variable 1
● VAR2: Variable 2
● LOG_COND: Logical condition to reduce the dataset to be analysed (optional)
● DOM_VAR: Domain variable (optional)
● DOM_CAT: Value of the domain variable (category) to determine domain »cell« where the analyses
will be performed (optional)
IF ( LOG_COND Is Not Null) THEN
DATASET → DATASET (Where LOG_COND)
END IF
IF (DOM_VAR Is Not Null) THEN
DATASET → DATASET (Where DOM_VAR=DOM_CAT)
END IF
PLOT scatter plot from DATASET
CREATE output image of scatter plot
8
TITRE DU DOCUMENT
COMPUTE covariance matrix
CREATE output table COVMAT
VAR1 VAR2
VAR1 Var1 Cov1,2
VAR2 Cov2,1 Var2
CREATE output table OUTTABLE (IDENT, VAR1, VAR2)
END IF
IF METHOD=M2 THEN
OBTAIN the process metadata for method M2
● STAT_TYPE: Type of the statistical aggregate to be presented ; select from the following list: TOTAL,
AVERAGE, MEDIAN, RATIO OF TOTALS, CHAINED INDEX
● IF STAT_TYPE IN (“TOTAL“, “AVERAGE“, “MEDIAN”, “CHAINED INDEX“) THEN OBTAIN
■ VAR1: Variable 1
■ ELSE IF STAT_TYPE=“RATIO OF TOTALS“ THEN OBTAIN
■ VAR1: Variable 1
■ VAR2: Variable 2
● LOG_COND: Logical condition to reduce the dataset to be analysed (optional)
● DOM_VAR: Domain variable
● W: Weight (optional)
IF NOT ( LOG_COND Is Null) THEN
DATASET → DATASET (Where LOG_COND)
END IF
CALCULATE values of the statistics in the DATASET for each domain category
IF STAT_TYPE= “TOTAL” and WEIGHT is Null THEN
FOR each category from DOM_VAR
9
TITRE DU DOCUMENT
(n: number of units in (reduced) DATASET)
END FOR
END IF
IF STAT_TYPE= “TOTAL” and WEIGHT is Not Null THEN
FOR each category from DOM_VAR
(𝑛: number of units in (reduced) DATASET
𝑤𝑖: weight)
END FOR
END IF
IF STAT_TYPE= “AVERAGE” and WEIGHT is Null THEN THEN
FOR each category from DOM_VAR
(n: number of units in (reduced) DATASET)
END FOR
END IF
IF STAT_TYPE= “AVERAGE” and WEIGHT is Not Null THEN
FOR each category from DOM_VAR
(n: number of units in (reduced) DATASET
10
TITRE DU DOCUMENT
𝑤𝑖 : weight)
END FOR
END IF
IF STAT_TYPE=“RATIO OF TOTALS“ and WEIGHT is Null THEN
FOR each category from DOM_VAR
(n: number of units in (reduced) DATASET)
END FOR
END IF
IF STAT_TYPE=“RATIO OF TOTALS“ and WEIGHT is Not Null THEN
FOR each category from DOM_VAR
(n: number of units in (reduced) DATASET
𝑤𝑖 : weight)
END FOR
CREATE output table OUTTABLE (DOM, DOM_CAT, STAT_VAL)
PLOT bar chart from OUTTABLE
CREATE output image of bar chart
END IF
IF METHOD=M3 THEN
OBTAIN the process metadata for method M3
● STAT_TYPE: Type of the statistical aggregate to be presented ; select from the following list: TOTAL,
AVERAGE, RATIO OF TOTALS, CHAINED INDEX
11
TITRE DU DOCUMENT
● IF STAT_TYPE IN (“TOTAL“,“AVERAGE “,“CHAINED INDEX“) THEN OBTAIN
■ VAR1: Variable 1
■ ELSE IF STAT_TYPE=“RATIO OF TOTALS“ THEN OBTAIN
■ VAR1: Variable 1
■ VAR2: Variable 2
● LOG_COND: Logical condition to reduce the dataset to be analysed (optional)
● DOM_VAR: Domain variable
● DOM_CAT: Value of the domain variable (category) to determine domain »cell« where the analyses
will be performed (optional)
● VAR_REF: Date variable, which provides the reference period of the survey
● DATE_S: Value of the variable VAR_REF that represents the starting reference period of the time
series to be presented
● DATE_E: Value of the variable VAR_REF that represents the ending reference period of the time series
to be presented
● VAR_ID: Variable that represents the unique (in one reference period) identifier of the units
● ID_UNIT: Identification of the selected unit
● W: Weight (optional)
IF NOT ( LOG_COND Is Null) THEN
DATASET → DATASET (Where LOG_COND)
END IF
IF NOT (DOM_VAR Is Null) THEN
DATASET → DATASET (Where DOM_VAR=DOM_CAT)
END IF
IF STAT_TYPE= “TOTAL” THEN
FOR each value t of the variable VAR_REF (where VAR_REF>= DATE_S
and VAR_REF<= DATE_E)
DATASET → DATASET (Where VAR_REF =t)
IF Weight is Null THEN
(n: number of units in (reduced) DATASET)
DATASET → DATASET (Where VAR_REF =t AND VAR_ID <> ID_UNIT)
(n: number of units in (reduced) DATASET)
12
TITRE DU DOCUMENT
IF Weight is Not Null THEN
(n: number of units in (reduced) DATASET
𝑤𝑖 : weight)
DATASET → DATASET (Where VAR_REF =t AND VAR_ID <> ID_UNIT)
(n: number of units in (reduced) DATASET
𝑤𝑖 : weight)
END FOR
END IF
IF STAT_TYPE= “ AVERAGE” THEN
FOR each value t of the variable VAR_REF (where VAR_REF>= DATE_S
and VAR_REF<= DATE_E)
DATASET → DATASET (Where VAR_REF =t)
(n: number of units in (reduced) DATASET)
DATASET → DATASET (Where VAR_REF =t AND VAR_ID <> ID_UNIT)
(n: number of units in (reduced) DATASET)
END FOR
END IF
13
TITRE DU DOCUMENT
IF STAT_TYPE= “RATIO OF TOTALS” THEN
FOR each value t of the variable VAR_REF (where VAR_REF>= DATE_S
and VAR_REF<= DATE_E)
DATASET → DATASET (Where VAR_REF =t)
(n: number of units in (reduced) DATASET)
DATASET → DATASET (Where VAR_REF =t AND VAR_ID <> ID_UNIT)
(n: number of units in (reduced) DATASET)
END FOR
END IF
CREATE output table OUTTABLE (REF_PERIOD, DOM, DOM_CAT, STAT_WITH, STAT_WITHOUT)
PLOT two line charts from OUTTABLE
CREATE output image of line chart
END IF
3. SPECIFIC REQUIREMENTS
This section contains all of the functional and quality requirements of the system. It gives detailed
description of the system and all its features.
3.1. USER INTERFACES
● Login and select Microdata & Method:
14
TITRE DU DOCUMENT
15
TITRE DU DOCUMENT
● Parameters for selected method:
16
TITRE DU DOCUMENT
● Example of result:
17
TITRE DU DOCUMENT
3.2. USE CASE DIAGRAMS
18
TITRE DU DOCUMENT
3.3. ACTIVITY DIAGRAM
19
TITRE DU DOCUMENT
3.4. CLASS DIAGRAM
3.5. DESIGN CONSTRAINTS
The graphical analysis service is designed as a passive and stateless service following the principles of a
REST design pattern. Key principles governing the service architecture design are as follows.
• Service loose coupling: Architectural design excludes direct interaction between services to minimize
dependencies.
• Service abstraction and autonomy: The service is a standalone component and internal logic is hidden
from the outside world. Services have control over the logic they encapsulate.
20
TITRE DU DOCUMENT
• Service statelessness: in order to design scalable services by separating them from their state data
whenever possible. This results in reduction of the resources consumed by a service as the actual state
data management is delegated to an external component or to an architectural extension. By reducing
resource consumption, the service can handle more requests in a reliable manner.
• Service granularity: The service is designed to perform its function with an optimal scope and on the
right granular level. The service has no functions in scope beyond the execution of graphical analyses.
• Service reusability: Design and documentation of the service promotes shareability and reusability In
line with the CSPA principles.
The envisioned high level target architecture is presented in the diagram below.
The main underlying architectural principles for this software are object-orientation and REST (and I18N
principles), for these reasons:
● Object-orientation means thinking in business objects, actors, “nouns”. This is also known as
“domain-driven design”. It mimics real-world thinking and is most typically the main underlying
principle of the programming language in use, thus easy to implement.
● REST is an architecture principle that reduces service interfaces (as in e.g. web client – backend
server communication) to CRUD (create / read / update / delete) operations on business objects,
actors, “nouns”.
● I18N or internationalization / localization which typically means preparing the software to work
with different UI languages.
To apply these principles, actual software requirements specifications are composed by using commonly
used UML diagrams plus UI mockups. For each aspect of the system (e.g. for each use case) artefacts are
presented below.
3.6. STATE DIAGRAM - REST SERVICE INTERFACE
21
TITRE DU DOCUMENT
CRUD chart operation
Create new chart
Start/Stop chart analysis
22
TITRE DU DOCUMENT
Show chart or chart table data
23
TITRE DU DOCUMENT
24
TITRE DU DOCUMENT
25
TITRE DU DOCUMENT
26
TITRE DU DOCUMENT
27
TITRE DU DOCUMENT
28
TITRE DU DOCUMENT
29
TITRE DU DOCUMENT
30
TITRE DU DOCUMENT
31
TITRE DU DOCUMENT
32
TITRE DU DOCUMENT
33
TITRE DU DOCUMENT
34
TITRE DU DOCUMENT
35
TITRE DU DOCUMENT
36
TITRE DU DOCUMENT
37
TITRE DU DOCUMENT