26
Integrating Data from Multiple Sources 2015-02-26 David Loshin Knowledge Integrity, Inc. [email protected] © 2015 Knowledge Integrity, Inc [email protected] (301) 754-6350 1

Working With Different Kinds of Data

Embed Size (px)

Citation preview

Integrating Data fromMultiple Sources

2015-02-26

David Loshin

Knowledge Integrity, Inc.

[email protected]

© 2015 Knowledge Integrity, Inc [email protected] (301) 754-6350 1

Ingesting Data from Multiple Sources

• Continuously streamed data sources may influence business performance analytics:– Influence customer

satisfaction

– Expose opportunities for revenue generation

– Identify brand risk

– Flag fraud and abuse

– Improve customer profiling and customer experience

© 2015 Knowledge Integrity, [email protected](301) 754-6350

2

Challenges

• Entity identifiability

• Limited or no data governance

• Editorial bias

• Absence of metadata

© 2015 Knowledge Integrity, [email protected](301) 754-6350

3

Entity Identifiability

• Recognizing and resolving identities is challenging for static, complete data sets

• Entity identifiability becomes more challenging when merging static and streamed information:– Entity attribute identification

– Entity recognition

– Identity resolution

– Linkage across data sets

© 2015 Knowledge Integrity, [email protected](301) 754-6350

4

Is this the same guy?

Limited or No Data Governance

• Little or no knowledge of– Defined data quality criteria

– Edits or controls

– Chain of accountability

• Limited shared definitions– Typically tabular data dictionaries with nondescript

definitions

• Harvested data has no discernable lineage– Completely devoid of context or production chain

© 2015 Knowledge Integrity, [email protected](301) 754-6350

5

Editorial Bias

• Creating data sets for external consumption involves editorial decisions and biases

• Choices are made about– The physical structure of the data values

– Which data elements are included

– Which are excluded from the final artifact

© 2015 Knowledge Integrity, [email protected](301) 754-6350

6

Selection criteria

Absence of Metadata

• Numerous data sources have little or no metadata at all– Dynamically harvested tabular data

– Scraped data

– Human-generated content

– Automata-generated content

– Unstructured data artifacts

– Other data artifacts (graphics, images, video, audio, etc.)

© 2015 Knowledge Integrity, [email protected](301) 754-6350

7

Example: Healthcare Provider Data

• NPPESProvider First Line Business Mailing Address

• Definition:– “provider’s first line business

mailing address”

• Open PaymentsRecipient_Primary_Business_Street_Address_Line_1

• Definition:– “The first line of the primary

practice/business street address of the physician or teaching hospital (covered recipient) receiving the payment or other transfer of value.”

© 2015 Knowledge Integrity, [email protected](301) 754-6350

8

• Is “provider” the same as “recipient”? • Are these conformant data elements?• Actually it turns out that the Open Payments data element is sourced from

the NPPES data set!

Preparing to Integrate

• Infer the source data sets metadata

• Determine if the data element inventories are structurally conformable

• Determine if the data element inventories are semantically conformable

© 2015 Knowledge Integrity, [email protected](301) 754-6350

9

Inferring Metadata Using Profiling

• Analysis of data sets, records, data elements, and data values to– Infer data element types and sizes

– Identify reference value domains

– Make educated guesses about intent/meaning

© 2015 Knowledge Integrity, [email protected](301) 754-6350

10

Attribute

First d 4 6 y

Last f 6 2 h

Street d 4 7 n

City a 0 2 o

State

Value Count

A 12000

I 10000

L 7655

X 3208

N 120

M 8

Profiling

Conformable Data Elements

• Data elements are conformable if– Share the same data element concept

– Share the same value domain

– Share the same definition and semantics

© 2015 Knowledge Integrity, [email protected](301) 754-6350

11

• These two data elements are conformable if their definitions are the same!

CountryOfOrigin2-character IDO 3166 Country Code

CountryOfManufacture2-character IDO 3166 Country Code

Using Metadata to Test Conformability

• Inferred structural metadata provides the first cut at determining whether two data elements are conformable

• Introduce internal governance and management around external metadata– Use a metadata repository to capture inferred metadata

– Define policies for identification, assessment, documentation, and use of external data sources

– Institute stewardship for each external data source for process management, validation, and maintenance

• Select a metadata tool that provides– Enterprise-wide metadata visibility

– Integration with data assessment tools

– Historical lineage for metadata capture

– Collaboration among data consumers

© 2015 Knowledge Integrity, [email protected](301) 754-6350

12

Questions & Suggestions

• www.knowledge-integrity.com

• www.dataqualitybook.com

• www.decisionworx.com

• If you have questions, comments, or suggestions, please contact me

David Loshin

301-754-6350

[email protected]

© 2015 Knowledge Integrity, [email protected](301) 754-6350

13

EMBARCADERO TECHNOLOGIESEMBARCADERO TECHNOLOGIES

Joy RuffProduct Marketing Manager | ER/[email protected]

ER/Studio Team Server Overview

EMBARCADERO TECHNOLOGIES

Keeping pace with the rapid growth of data, change and compliance

Evolving Database

EcosystemsVolume, Velocity,

Variety

Agile Development

CyclesMaximizing IT

InfrastructureComplianceLimited

Resources

Database Professionals Need the Right Tools

15

EMBARCADERO TECHNOLOGIES

Share Models & Metadata with Business & IT

3

Team Server

ER Repository

Modeling Teams

• Business

Analysts

• Executives

• App and DB Developers

• Data Stewards

• DBAs

EMBARCADERO TECHNOLOGIES

• Powerful enterprise glossary & metadata collaboration

• Integrate key business terms and definitions with business systems

• View, store, and manage a single source of business definitions

• Attach business policies to daily workflows with contextual alerts

and tips

EMBARCADERO TECHNOLOGIES

The Power of Unlimited Involvement

• Use business terms to easily locate and relate information assets

• Maintain enterprise glossaries, terms, and underlying metadata in a central interface

• Enable a consistent flow of information and collaboration around data management

18

Contributors

Business

Architecture

IT

Definition

Structure

Deployment

Synd

ication

Co

llabo

ration

Consumers

Executive

Analyst

Developer

Integration

EMBARCADERO TECHNOLOGIES

Benefit of Relating Metadata to Models

• Expand the depth of information by accessing the underlying framework

19

• Models and terms seamlessly integrate to one another

EMBARCADERO TECHNOLOGIES

The Primary Resource for Data Information

20

• Manage a single source of business definitions in an enterprise glossary

• Avoid the issue of information stagnation

• Improve productivity and accuracy in data analysis, application, BI and ETL development

EMBARCADERO TECHNOLOGIES

Data Source Registry

EMBARCADERO TECHNOLOGIES

Unified Glossary and Terms

22

EMBARCADERO TECHNOLOGIES

Empowering the Organization

23

!

© 2014 Embarcadero Technologies, Inc. Embarcadero, the Embarcadero Technologies logos, and all other Embarcadero Technologies

product or service names are trademarks or registered trademarks of Embarcadero Technologies, Inc. All other trademarks are property of

their respective owners. | 102714

!

!Embarcadero!Technologies!has!been!committed!to!developing!industry7leading!tools!

in!the!database!management!and!architecture!space!for!over!20!years.!!Our!ER/Studio*Team*Server*Core!environment!is!the!next!step!on!that!journey,!offering!modeling!and!metadata!collaboration!and!management.!!Your!IT!and!business!users!gain!visibility!to!existing!data!assets!at!a!deeper!level,!enabling!their!leverage!as!the!critical!decision7making!assets!they!can!and!should!be!–!across!the!enterprise.!!If!you!found!Portal!to!be!useful,!you’re!going!to!love!Team!Server!Core.!

!!

The!added!functionality!of!Team*Server*Core,!including!unlimited!web!user!read/write!access!so!that!all!stakeholders!in!the!company!are!able!to!contribute!to!and!have!access!to!the!critical!models,!metadata,!and!the!enterprise!data!dictionary!(glossary).!!Security!and!data!rights!management!have!also!been!enhanced,!so!you!can!have!complete!confidence!that!your!data!is!protected!and!shared!with!the!right!people!at!the!right!times!and!in!the!right!formats.!

Product(Feature(( Definition((Team(Server(Core(

Portal(

Inline(Definitions(Integrate!enterprise!business!definitions!with!data!management!tools!and!internal!web!assets!into!daily!workflows!

! !

Privacy(and(Security(Alerts(

Adhere!to!industry!regulations!and!business!standards!regarding!security!and!privacy!by!alerting!users!who!view!or!modify!sensitive!data!within!integrated!data!management!tools!

! !

Semantic(Mapping(Develop!applications!and!analyses!faster!by!using!business!terms!to!easily!find!data!elements!

! !

Mapped(Data(Source(Registry(

Generate!information!maps!by!relating!data!models!with!their!data!sources!and!creating!a!single!searchable!registry!of!all!available!data!sources!to!store!information!in!one!place!

! !

Centralized(Reporting(Create!and!share!integrated!reports!using!standard!templates!and!a!reporting!wizard!for!ad!hoc!reports!

! !

Team(Collaboration(Apply!enterprise!collaboration!capabilities!to!capture!and!use!corporate!knowledge!to!reduce!time!identifying!and!correcting!expensive!data!quality!issues!

! !

Model(Sharing(Distribute!and!view!models!across!the!organization,!and!set!permissions!for!visibility!of!objects!

! !

Enterprise(Glossary(View,!classify,!relate!and!centrally!store!authoritative!business!definitions!in!an!extensible!enterprise!glossary!

! !

Custom(Extensions(Enhance!comprehension!of!business!terms!and!data!elements!with!custom!extensions!

! !

Unlimited(Access(to(Metadata(

View,!share,!and!update!the!enterprise!glossaries,!business!terms,!and!custom!attributes,!via!the!web!interface,!for!any!business!or!IT!user!

! !

Limit the level of confusion by centralizing glossaries, terms, and object relationships

• Discuss and add to the development of models and metadata

• Track and gain insight into who and what information has changed in the environment

EMBARCADERO TECHNOLOGIES

Team Collaboration

EMBARCADERO TECHNOLOGIES

The Right Tools are Everything Discover the Benefits of the Ultimate Cross-Platform Database Tools

25

EMBARCADERO TECHNOLOGIES

Thank you!

• Learn more about the ER/Studio product family: http://www.embarcadero.com/data-modeling

• Team Server Hosted Trial: http://www.embarcadero.com/products/er-studio/team-server-hosted-trial

• To arrange a demo, please contact Embarcadero Sales: [email protected], (888) 233-2224

26