63
Talend Data Management Platform Online Training

Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

Embed Size (px)

Citation preview

Page 1: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

Talend Data Management Platform Online Training

Page 2: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

Exercise Book Last updated

Date TIS Version 12 July 2010 4.02

5 January 2011 4.1.1 31 March 2011 4.1.2 28 June 2011 4.1.2

06 March 2012 5.0.2 21 February 2013 5.2.1

29 April 2014 5.4.1

07 March 2016 6.1.1

Any total or partial reproduction without the consent of the author or beneficiary, devisee or legatee is not allowed

(law of 11 March 1957, par. 1 of article 40). Representation or reproduction, by any means, would be considered

an infringement of copyright under articles 425 et.seq. of the Penal Code. The law of 11 March 1957, par. 2 and 3

of article 41, allows the creation of copies and reproductions exclusively for the private use of the copier and not

for collective use on the one hand while on the other it allows analysts to use short quotes for purposes of

illustration.

Page 3: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

2

Table of Contents

DEMO 1: INTRODUCTION TO TALEND DATA MANAGEMENT 3

DEMO 2: INTRODUCTION TO TALEND ADMINISTRATION CENTER 3

EXERCISE 1: CREATING A JOBLET 3

EXERCISE 2: DEDUPLICATION/MATCHING PROCESS 12

EXERCISE 3: ACTIVITY MONITORING CONSOLE (AMC) 35

DEMO 3: MORE ADMINISTRATION CENTER 38

EXERCISE 4: PARALLELIZE PROCESSES FOR EFFICIENCY 39

EXERCISE 5: DATA MASKING 41

EXERCISE 6: PARSING XML DOCUMENTS 44

EXERCISE 7: DISTANT RUN 60

Page 4: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

3

Demo 1: Introduction to Talend Data Management

Demo 2: Introduction to Talend Administration Center

• Users, Projects, Authorizations, Servers/Virtual Servers

• Job Conductor: Tasks and Triggers

Exercise 1: Creating a Joblet

Objectives:

§ Navigate through Talend Data Integration Studio repository. § Create a set of reusable components and the links between them for reuse in other

jobs. Prerequisites:

§ Familiarity with Talend Data Integration Studio. § Job Exercise0 exists. § MySQL database and tables exist and are populated (CIF.customer) and

(CRM.cust). Description:

Joblets are a method of reusing your Talend processes. They allow you to centralize upcoming modifications on these common functions. Joblets can either be created from scratch or extracted from an existing job. In the following example, a key operation will be extracted from an existing job and reused with a different data source.

Step 1: Open and run job Exercise0

• Open Exercise0, which is under “Job Designs‟ in the repository. Notice that a single field is being extracted from an existing database connection and transformed. The Names in this field are represented similar to “Mr. Leland Crenshaw” and will be transformed into the format “Crenshaw, Leland Mr.”. The last names will also be alphabetized. Right-click on the cust input table and choose the “Data Previewer”. Note the format of the Name column. This is the format of the input data. Run Exercise0.

Page 5: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

4

• After Exercise0 is run, preview the data in the output file and notice how the name formatting and sorting transformation worked.

Page 6: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

5

Step 2: Refactor key steps as a Joblet

• Extract the key components in this transformation and refactor as an independent

Joblet. To do this, Highlight, by holding down the <ctrl> key and left-clicking, the 3rd -5th components only (tExtractDelimitedFields, tSortRow, tMap). Alternately, you can “rubber-band” around this group of three components. When all three of them have the dots around them, right-click in the box and choose the bottom sub-menu item ‘Refactor to Joblet‟. Name your new joblet “NameXform”.

• When prompted if you want to get the schema of the target component, click on

the “No” button.

• After clicking on No, you will see the following.

New joblet NameXForm in Joblet Designs

Palette has new joblet components

Page 7: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

6

Step 3: Make a new database connection (CIF)

• Open the Metadata portion of the Repository. Right click on “Db Connection”‟ and choose “Create connection” from the pop-up menu. In the next panel, name the new connection “CIF” and choose next. Enter the following connection parameters into the respective fields from the top down (some may already be preset):

New Joblet is created

New Joblet is referenced in Job

Page 8: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

7

• From the Repository, open Metadata->DB Connections and highlight the CIF

connection. Use the right

mouse button to select the Retrieve

Schema option

• Click “Next‟ on the first screen and then choose the tables that you want to retrieve.

• The last screen will show you the schemas for the tables and columns.

Page 9: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

8

Step 4: Use the new joblet as a single component in a new job

• Right click on Job Designs from within the repository and create a new job named EX1_UseNewJoblet.

• Drag and drop the CIF.customer input component as Input to the design

workspace.

• Drag and drop a tMap component from the Palette (Processing category) to the right of the “customer” input component.

• Drag and drop the new joblet NameXform from the palette (Joblets) to the right of the tMap component.

• Drag and drop a tFileDelimitedOutput component to the right of the NameXform component.

• Highlight the customer input table and use the right mouse button to create a mapping to the tMap component.

Page 10: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

9

• Double click on the tMap component to bring up the tMap editor.

• Use the green + button at the top right hand side to add a new output. Call it “nameIn”

• Click on the green + at the bottom right hand of the editor to add a new column. Call it nameIn

• Map Name from the left hand side (input) to nameIn on the right hand side

(output).

Click to add a new output.

Click to add a new column

Page 11: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

10

• Click on the OK button in the tMap editor to apply the changes.

• Map the main connection from the NameXform joblet component to the output file component.

• Run and observe the results of your job and joblet.

• You can observe

the output of the

transformation by highlighting and right- clicking the output file component and choosing Data Viewer from the sub-menu.

Page 12: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

11

THIS CONCLUDES EXERCISE 1

Page 13: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

12

Exercise 2: Deduplication/Matching Process

Objectives: • Demonstrate the ability within the Talend Data Management Platform to use

the match analysis within the Talend Profiler to find duplicates within your data. • Demonstrate the ability to use the match analysis you created in the Talend

Profiler to create data quality jobs to allow you to act upon the duplicate data. Description:

This exercise explores the deduplication/matching capabilities of the Talend Platform for Data Management, focusing on the following:

• Data Profiler – allows you to create a match analysis to discover duplicate records in your data.

• Data Integration – allows you to use the results of the match analysis and create DQ jobs to do the following:

o Write duplicate data to the Data Stewardship Console o Auto-survive duplicate records

Step 1: Creating a New Matching Analysis

• Switch to the Profiling Perspective.

• Create a new Matching analysis to check for duplicate records and call it accts_dups.

• Select the data source using the ‘Select Data’ button.

Page 14: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

13

• Select ‘metadata’ under FileDelimited connection-­>accounts_for_dups and select all of the columns.

• Press the OK button and the selected data will be displayed. As you can see,

the first two records are duplicates.

• Select the Blocking Key. The Blocking Key is used to group the data before

searching for the duplicate records. Assign ‘state’ as the Blocking Key by

clicking on ‘Select Blocking Key’ and then clicking on the ‘state’ column.

• Scroll down and set the Algorithm for state to ‘exact’ and press the ‘Chart’

button to see the blocking distribution of the data based on state.

Page 15: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

14

• Select the matching keys by clicking on the ‘Select Matching Key’ button and

then clicking the columns that you want to use for matching. In this exercise,

you will use acctName, addr, city, state and zip.

• Scroll down and configure the Matching Function and Confidence Weight for

the matching keys and press the Chart button to see the results.

• This shows that there are 690 unique records, 75 duplicate records with a

group size of 2, 14 duplicate records with a group size of 3.

Page 16: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

15

• Scroll up to see the detailed results in the data. Based on the color coding,

the first two records are unique records and the next two records are

duplicate records.

• Scroll to the right to see the additional columns added by the matching

process.

GID: Unique Group Identifier GRP_SIZE: If this is equal to 1, then it is a unique record;; else it’s part of a group and tells how many duplicate records are in that group. MASTER: The first record of every group will be the master record and this is signified by a value of ‘true’;; else it has a value of ‘false’. This is used when comparing the records and reflected in the SCORE column. SCORE: This shows how closely the current record is related to the MASTER record and is based on a high score of 1.0. GRP_QUALITY: This is the lowest score of the group and is used when determining which ‘bucket’ to put the duplicates in (discussed later).

Page 17: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

16

Step 2: Running the match analysis

• Click on the run button to execute the analysis. This will switch to the

results tab. The data is split into three buckets:

Unique Records: this is based on the ‘Match Threshold’. Set to .85 by default. This is used to control the tightness/looseness of your matches. Matched Records: this is based on the ‘Confident Match Threshold’. Set to .90 by default. If the GRP_QUALITY is >= ‘Confident Match Threshold’ then it will be a matched record. Suspect Records: this is based on the ‘Confident Match Threshold’. Set to .90 by default. If the GRP_QUALITY is < ‘Confident Match Threshold’ then it will be a matched record.

Step 3: Changing the parameters

• Click on the ‘Analysis Settings’ tab and change the ‘Confidence Weight’ for

acctName to 10 and press ‘Chart’ to see the results.

Page 18: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

17

• Change the ‘Match Threshold’ to 0.80 and press ‘Chart’ to see the results.

• Change the ‘Confident match threshold’ to 0.95 and press run to see the

results.

Step 4: Exporting the Match Rule

• Click on the export button so you can export the match rule so you can use it

later on in a data integration job.

Page 19: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

18

• Save the rule as ‘match_rule’ and click on ‘Finish’.

• The new rule now appears in the repository.

Page 20: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

19

Step 5: Create a matching job

• Switch to the Integration perspective

• Create a new job named EX5_Matching.

• Add the following components to the job designer and configure based on the

examples below.

• Drag the AccountDataForDeDuplication from the repository under Metadata-­

>File delimited.

• Configure the tGenKey component. This component is needed because you

used a blocking key. If you are using a single blocking key and it is set up as

‘Exact’, then you do not need to add the tGenKey component. You can just

add the blocking key in the tMatchGroup component. Click on the Import

button to import the match rule that you created in the profiler.

Page 21: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

20

• Click on ‘match_rule’ and the state blocking key will be added. Press OK.

• The tGenKey component should look like this.

• Configure the tMatchGroup component by double clicking on it. Press the

import button to import the match rule you created in the Profiler.

Page 22: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

21

• Select ‘match_rule’ and press OK.

• Add the blocking key by clicking on the green plus sign in the Blocking

Definition window and then selecting T_GEN_KEY.

Page 23: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

22

• Click on the Refresh button to display the matching results. These results

should match the results that you saw with the profiler. Press OK to save and

close the Configuration Wizard.

• Configure the tLogRow component. The tLogRow is being used to display

the results and is useful when testing and debugging. This will be changed

later to write the results to different outputs. Click on ‘Table’ Mode.

Page 24: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

23

• Run the job to see the results. Double click on the Run tab after the job has

finished executing. Scroll up to the top of the results and scroll to the right.

The columns GID, GRP_SIZE, MASTER and SCORE are added by the

tMatchGroup component and were explained in the Profiler steps. If the

GRP_SIZE is not equal to 1, then that record is part of a group (ie, duplicate).

• Scroll to the left to see the duplicate records.

Page 25: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

24

Step 6: Modify the match job as follows:

• Write the unique records to a flat file.

• Use the auto-­survivorship capability to automatically merge the ‘confident’

duplicates.

• Write the ‘suspect’ duplicates to the Data Stewardship Console so a user can

manually merge the duplicates.

• Due to time constraints, the match job has already been created, but the

steps will be described below. The final job is EX5_Matching_Final

• Configure the tMatchGroup component to write to multiple outputs. By

default, it writes to one output which requires you to use a tMap to distinguish

between unique records and duplicate records. You also do not have the

capability to distinguish between ‘confident’ duplicates and ‘suspect’

duplicates if using one output. To change this, click on the tMatchGroup

Component tab and then click on Advanced Settings. Select the ‘Separate

output’ option.

Page 26: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

25

• Perform a right mouse click on the tMatchGroup and select Row. The

multiple outputs are listed. In the next steps, you will define where the data is

written for each type.

• Drag a tFileOutputDelimited to the design window and map the ‘Uniques’ row

to it.

• The ‘Matches’ are the duplicate records that are considered to be confident

matches and this is based on the ‘Confident match threshold’ found in the

Advanced Settings for the tMatchGroup component. This value defaults to

.90. For the ‘Matches’, you want to set up a process to automatically merge

the records based on conditions. The tRuleSurvivorship component is used

for this. Drag a tRuleSurvivorship component to the design window and map

the Matches row to it.

Page 27: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

26

• Configure the tRuleSurvivorship component. You need to add a rule for

every column that you want to write to the output. If you do not add a rule,

the output value will be null. The Rule name is arbitrary and does not need to

start with a number. When you are done adding the rules, you need to click

on the ‘Generate rule and survivorship flow’ button

Step 2: Add partitioning to the match job

• Duplicate job EX5_NoPartition and call it EX5_Partition. • Add the tPartitioner and tCollector components to the job.

• Drag the following components to the design window and connect as follows :

tLogRow, tMap, tFileOutputDelimited (‘Survivors’ and ‘Ref Table). The

tLogRow component is not required, but is helpful for debugging purposes

and to see the results of the tRuleSurvivorship component.

Page 28: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

27

• Configure the tMap for ‘Survivors’. If the ‘SURVIVOR’ column from the

source is ‘true’, then it is a surviving record.

• Configure the tMap for ‘REFTable’. A select condition is set for ‘Survivors’,

so you can use the ‘Catch output reject’ option to write the non-­surviving

records. These are the original records.

Configure the tPartitioner component.

• Press OK and then click Yes when asked to propagate the changes.

Page 29: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

28

• Drag the following components to the design window and connect as follows :

tMap, tStewardshipTaskOutput.

• Configure the tMap component. The target columns ‘istarget’,’weights’ and

‘taskname’ will need to be added manually.

• Press OK and then click Yes when asked to propagate the changes.

Page 30: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

29

• Configure the tStewardshipTaskOutput component. This component is used

to write the data to the Data Stewardship Console.

• Scroll down in the tStewardshipTaskOutput component to add the columns

that need to appear in the Data Stewardship Console.

Run the job with the “Exec Time” option checked and notice the execution time.

• Run the job and verify the results.

Press here to add more columns

Page 31: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

30

• Check the results for the tRuleSurvivorship process. Double click on the Run

tab to see the results of the tLogRow which shows the data after it’s been

processed by the tRuleSurvivorship component.

• In the highlighted results below, the first two records are the original data and

the third record is the surviving record. Note that the recNum is null. This is

because recNum was not included in the rules for the tRuleSurvivorship

component.

• Scroll to the right to see the remaining columns for the data. The columns

‘SURVIVOR’ and ‘CONFLICT’ are added by the tRuleSurvivorship

component. If SURVIVOR equals ‘true’, then it is a surviving record else it is

an original record.

Page 32: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

31

• Use the Data Previewer to see the unique records.

• View the records in the Data Stewardship Console (DSC). The DSC is

installed as part of the Platform for Data Management installation and is

accessed using a web browser. Firefox is used for these exercises. The

TAC (Talend Administration Center) must be running for the DSC to be

accessible. You can install the TAC as a service or it can be started using a

script found at the following location. This is in the Talend installation

directory. Double click on the start_tac.bat script to start the TAC service if

it is not already running.

Page 33: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

32

• Access the DSC from the web browser using this url -­

http://localhost:8080/org.talend.datastewardship. Enter ‘user’ for Login and

Password.

• Click on the Login button to access the DSC. The duplicate records are

displayed in the DSC.

• Double click on an entry to show the duplicate records.

Page 34: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

33

• You can select the target values (surviving data) multiple ways. o Click on the source value

o Click on the Select All arrows on a source

o Use Auto Suggest which will pick values based on either common values or a trusted

source

• Click the Save button and then click on ‘Save and Close’

• This will put the record in a ‘Resolved’ status

Step 5: Create a job to unload the resolved data from the Data Stewardship Console.

• Due to time constraints, the job has already been created. The steps will be

described below.

Page 35: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

34

• The job is called EX5_DSC_Unload

• Configure the tStewardshipTaskInput component.

• Configure the tMap component to map only the needed fields.

• Connect the tMap component to the tLogrow component and run the job. View

the results in the run tab.

Page 36: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

35

Exercise 3: Activity Monitoring Console (AMC)

Objectives:

§ Here we develop performance sensors for precise metrics related to the execution of our job.

§ Instruct users on how to configure and interpret AMC results. Prerequisites:

§ Familiarity with TALEND DATA MANAGEMENT PLATFORM. § AMC has been installed and configured to interact with the proper database within

the local copy of My SQL. Description:

• Talend Activity Monitoring Console helps Talend product administrators or users to achieve enhanced resource management and improved process performances through a convenient graphical interface and a supervising tool.

• Talend Activity Monitoring Console provides detailed monitoring capabilities that can be used to consolidate the collected log information, understand the underlying component and job interaction, prevent faults that could be unexpectedly generated and support system management decisions.

Step 1: Activate monitoring

• Duplicate the Exercise2_Matching_Final job by right clicking it in the repository and naming the new job EX3_AMCjob. Double click the new job to open it.

• Activate “Absolute” (default) monitoring on the Source_data link by checking the “Monitor this connection” box that resides under the “Advanced Settings” tab:

Page 37: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

36

• Activate the monitoring for Confident_Matches and Uniques links.

• Activate the monitoring “Relative” to “Source_data” on the Uncertain_Matches link, and activate the threshold values. Name the thresholds, assign their upper and lower limits and pick their warning colors by clicking the ellipse to pop up the color picker.

• Activate the monitoring “Relative” to “Source_data” on the Uncertain_Matches • You can choose any color for any range, but in this exercise OK=green, WRN=yellow and KO=red.

• Execute the job a few times.

Page 38: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

37

• Switching to the AMC Perspective or opening it via the highlighted icon, you can

see how many times your job was executed, with no errors and that each of the run times varied and were graphed.

• Switch to Meter Log tab and click on the Threshold Charts to see the threshold chard for the number of uncertain matches compared to the number of source data records read.

THIS CONCLUDES EXERCISE 3

Page 39: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

38

Demo 3: More Administration Center

• Contexts and versioning

Page 40: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

39

Exercise 4: Parallelize Processes for Efficiency

Objectives:

§ Quickly demonstrate the ability within TALEND DATA INTEGRATION STUDIO to run capable processes in parallel to increase efficiency.

Description:

• tParallelize displays as a component on the design workspace. However, its usage is slightly different to that of typical components.

• The tParallelize component itself does not process data or data flows, but helps

you to parallelize and synchronize the execution of numerous subjobs in your main Job.

Step 1: Open a serialized job

• Open the latest version of “EX0_SyncTables”. • Run the Job with the “Exec time” option enabled:

• Note the job execution time. During creation of this exercise book, the runtime was 86,389 milliseconds which varies according to machine, environment, other applications running and other factors.

Page 41: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

40

Step 2: Parallelize the job and observe a marked decrease in execution time • Duplicate the EX0_SyncTables job. Name and open the new job “EX4_Parallelize”. • Add the tParallelize component from the Orchestration category of the Palette.

Connect the two subjobs with “Parallelize” links, by right clicking, choosing Parallelize and landing the line on the two input components:

• Also make sure to remove any previous onSubjobOK connections.

• Run the job with the “Exec time” mode activated and record the difference compared to the previous run:

• The “customerscdc” table was loaded at the same time as the “customers_huge” table, therefore it was not necessary to wait for the “customers_huge” table to completely load to start work.

• In this environment, in this instance of the example we saved approximately 0.90 seconds in loading time corresponding to the “customerscdc” table (Varies depending on the machine).

THIS CONCLUDES EXERCISE 4

Page 42: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

41

Exercise 5: Data Masking

Objectives: • Understand how the Data Masking component can be used to mask your

sensitive data. Description:

• The Data Masking component allows you to mask your sensitive data such as credit card information, SSN, addresses and email addresses.

Step 1: Create a connection for the source data.

• Open the Metadata portion of the Repository. Right click on “File Excel Connection”‟ and choose “Create connection” from the pop-up menu. In the next panel, name the new connection “masking data” and choose next.

• Select the file location and “Sheet1” and click on next.

• Click on ‘Set heading row as column names’ and press the Refresh Preview button. Click the

Next button.

Page 43: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

42

• Change the type for SSN from Date to String. Click the Finish button to create the connection.

Step 2: Create a job to mask the data.

• Create a new job called EX5_Data_Masking. Drag the following components into the job and

connect them. The ‘masking_data’ component is the Excel connection that you created in

the step above.

• Configure the tDataMasking component as follows.

Page 44: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

43

• Run the job and verify the results.

THIS CONCLUDES EXERCISE 5

Page 45: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

44

Exercise 6: Parsing XML documents

Objectives: • Understand how to use the Talend Data Mapper (TDM) to parse XML documents. • Understand how to create a TDM map to map data from an XML document to a

target (flat file). • Understand how to create a data integration job that calls the TDM map so that

you can execute it in production. Description:

• Talend Data Mapper is used to parse complex data such as XML, JSON, EDI and COBOL documents.

Step 1: Create a structure for the XML document.

• Switch to the Mapping perspective.

• Select the Structure folder and perform a right mouse click and select New-­>Structure

Page 46: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

45

• Select the option to Import a structure definition and click on the Next button.

• Select XML Sample Document and click on the Next button.

• Use the Browse button on the Local file line to select the XML document.

Page 47: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

46

• Leave the structure name as Input. Press Next and then Finish.

• The Input structure will be opened in the data mapper window.

Page 48: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

47

• You can click on the elements and it will highlight the data.

Step 2: Create a flat file structure for the output.

• You will create a flat file structure to map the cashitem elements from the xml document.

• Select the Structure folder and perform a right mouse click and select New-­>Structure

Page 49: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

48

• Select the option to Create a new structure where you manually enter elements and click on the Next

button.

• Name the structure flatfile_output

• Select Flat Files as the representation.

Page 50: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

49

• This will create the structure and open the structure in the Data Mapper.

• Perform a right mouse click in the structure window and select the option to create a new element.

Name it Root

• Select the Root element, perform a right mouse click and select the option to create a new element.

Name it Record.

• Since the input data is looping, you will need to create a looping element for the output. The Record

element will be the looping element. To make this a looping element, change the value for Occurs Max

to -­1.

Page 51: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

50

• The looping will be represented on the element by ‘(0 :*)’.

• Select the Record element, peform a right mouse click, select new element and add a new element

called Currency_Code.

• Do the same thing to create the remaining elements. Save the structure.

Step 3: Create a map.

• Select the Map folder, perform a right mouse click and then select the option to create a new map.

Page 52: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

51

• Select Standard Map and click on the Next button.

• Name the map ‘xml2flatfile’ and click on the Finish button

• This will open a new map in the data mapper window. Drag and drop the input structure and output

structure to the map.

Page 53: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

52

• Map the elements under cashitem to the output.

• You can test this by performing a right click on any element on the target and selecting Test Run.

• Here is the result. This is the flat file representation. By default, it is positional with no line breaks.

Page 54: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

53

• For easier viewing, you can select the XML representation.

• To make the flat file delimited, expand the flat file structure and then expand Representations and

double click on Flat.

• Select Output Delimited Header and Output as delimited and click on OK. You will need to save the

structure.

Page 55: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

54

• Verify the changes by performing another Test Run on the output.

Step 4: Create a DI job to call the map.

• Switch to the Integration perspective and create a new job called EX6_xml2flatfile. Drag the following

components to the job and connect them.

• Configure the tFileInputRaw component to reference the xml file.

• Select the tHMap component and click on the button on the Map Path to select the map you created.

Page 56: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

55

• Select map xml2flatfile.

• Verify that the component tab looks like this.

• Run the job and verify the output.

Page 57: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

56

Step 5: Create a DI job and map to write to output delimited file using DI connector.

• Create a File delimited connection to the file that you created in the step above. Name it cashitems.

• Create a new job called EX6_xml2flatfile_DI and add the following components and connect them.

• Configure the tFileInputRaw component to reference the xml file.

• Select the tHMap component and click on the Open Map Editor button.

• Choose ‘Select an existing hierachical mapper structure’ for your input structure and click on the Next

button.

Page 58: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

57

• Select the input structure that you created earlier for the input xml and click the Next button.

• Select ‘Generate a hierarchical mapper structure based on the schema’ and click on the Next button

and then click on Finish. This will create a map and switch to the mapping perspective.

• Map the elements.

• Since this map was created from a DI job, the output type is ‘Map’. You cannot execute a test run

against a map output type. Try to execute a test run and you will receive this message.

Page 59: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

58

• If you want to execute a test run, you will need to add a representation to the output structure. Add an

XML representation to the structure.

• Change the output representation in the map by clicking on ‘Map’ for the output structure and then

selecting XML.

• After selecting XML, you should see XML beside the Output.

Page 60: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

59

• Execute a test run.

• Before running the DI job, you need to switch the representation back to Map and save the map.

• Switch back to the Integration perspective and run the DI job. Verify the results.

THIS CONCLUDES EXERCISE 6

Page 61: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

60

Exercise 7: Distant Run

Objectives: • Simulate running Talend jobs on a remote server and observe what services

are required to do so. Description:

• Distant run is a Talend Data Integration function which will greatly simplify tests in other environments by automating deployments and executions.

Step 1: Configure Talend Data Integration Studio for a list of remote servers

• From the main file menu, choose Window > Preferences and type “remote” in the search box at the top left corner of the panel:

• Click on the green plus “+” sign at the right top of the panel and enter the settings for one or more servers where the Talend jobs will be executed. Edit the fields by clicking in them, but be sure to change just the name and Host name fields. For simulation purposes, use the loopback address of 127.0.0.1 for the Host name.

• Open job Exercise0 and click on the Run tab.

• Click on the “Target Exec” tab to the left and select the Remote server from the

drop down.

Page 62: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

61

• Run your job and observe the console output messages.

Step 2: Initiate the Remote Server binary and re-­run job

• The Talend solution is designed to be scalable, modular and flexible as you can see above. Just take a moment to visualize the requirements at each of the sites, but bear in mind we have collapsed all of them down to your local machine.

• In each of the Execution sites, the Talend Job Server needs to be running in order

to run jobs on those servers. Typically, this licensed service is installed on each machine and executed by a .bat or script file. On Windows, and in this training, the .bat to run is located at C:\<TIS_Program_Path>\jobserver\start_jobserver.bat. Find and run your Talend job server batch file.

Page 63: Platform For Enterprise Data Management Roadshow Booklet …hippocampus.blueoceanmi.com/elibrary/wp-content/uploads/2016/06/... · ... ’INTRODUCTION’TO’TALEND’ADMINISTRATION’CENTER’

62

• Rerun the Exercise0 job again and observe the results.

THIS CONCLUDES EXERCISE 6