28
Data Migration in Schemaless NoSQL Databases CS828-1501C-01 ThienSi (TS) Le Colorado Technical University Professor: Dr. Kathreen Hargiss Phase 5: Individual Project Data Migration in Schemaless NoSQL Databases March 15, 2015 Version 1.0 Page 1

CS828 P5 Individual Project v101

Embed Size (px)

Citation preview

Page 1: CS828 P5 Individual Project v101

Data Migration in Schemaless NoSQL Databases

CS828-1501C-01

ThienSi (TS) Le

Colorado Technical University

Professor: Dr. Kathreen Hargiss

Phase 5: Individual ProjectData Migration in Schemaless NoSQL Databases

March 15, 2015

Version 1.0 Page 1

Page 2: CS828 P5 Individual Project v101

Data Migration in Schemaless NoSQL Databases

Abstract

The short research paper in Phase 5, Individual Project of the course CS828-1501C-01

Advanced Topics in Database Systems discusses the concepts of NoSQL databases such

as Cassandra, Mongo, Neo4J, and Riak, and so forth. They adopt the Aggregate Data

Model that are supporting the application-oriented aggregates, embracing schema-less

data, running on the cluster platform in distributed network, and often making the trade-

off between the data consistency and other useful properties. This research paper will

describe the associated concepts of NoSQL’s schemalessness, then focus on data

migration especially on how to ensure the data stored in the databases matched with the

implicit schema embedded in the applications when the implicit schema has experienced

a change. The in-depth discussion, that will also cover the general principles of

conducting data migration, test strategy in NoSQL databases, consists of four main

sections:

A. The concept of NoSQL databases

This section discusses a noDefinition of NoSQL databases with distinct

characteristics, a brief comparison between NoSQL and traditional relational databases,

and NoSQL database’s recent emergence in Internet-centric services.

B. Aggregate Data Models

This section covers an aggregate data model and discusses some pros and cons.

C. Schemalessness and Implicit Schema

One of the primary discussions is description of the central concepts of the

schemaless database and implicit schema in NoSQL databases.

D. Data Migration in NoSQL database with implicit schema

Version 1.0 Page 2

Page 3: CS828 P5 Individual Project v101

Data Migration in Schemaless NoSQL Databases

This section describes an in-depth discussion of data migration with implicit

schema. It covers the principles, strategy, test options of data migration in application

code that contains implicit schema with two demonstration examples.

The paper will also provide a list of references used in this individual project at

the end of this document.

Version 1.0 Page 3

Page 4: CS828 P5 Individual Project v101

Data Migration in Schemaless NoSQL Databases

Data Migration in Schemaless NoSQL Databases

In a modern era of data and information, several novel standards in computing,

automation and technologies that have emerged in computing, automation, and

technologies have produced enormous amounts of electronic data. The corporations,

governments, the academic community in both public and private sectors have turned to

database management systems (DBMS) to assist them operating enterprises and

conducting business locally and globally in very competitive market. According to

Bloomberg Businessweek (2011), many companies in Fortune 500 have used the

traditional relational DBMS from one vendor to another to conduct and control their

business. However, with a vast amount of electronic and nonuniform data and custom

data fields generated by Web estates and services such as Cloud Computing, Business

Intelligence, Science & Technology, etc., NoSQL database that is a schema-free or

schemaless database with an aggregate data model has emerged as a solution to handle

big data (Chen, Chiang and Storey, 2012). Data migration becomes a primary issue to

many companies with multiple types of applications in web service, e-commerce,

business intelligence, e-government and politics, smart health, security and public safety.

A. NoSQL Databases

NoSQL is an acronym for Not Only Structured Query Language (Hargiss, 2015).

1. What is a definition of NoSQL database?

According to Sadalage and Fowler (2012), NoSQL databases have a few distinct

characteristics:

Version 1.0 Page 4

Page 5: CS828 P5 Individual Project v101

Data Migration in Schemaless NoSQL Databases

- They do not use SQL (Structured Query Language).

- They are usually open-source projects.

- Most of the NoSQL databases are driven by the enterprises’ need to run on

clusters.

- They are based on the needs of the early 21th century Web estates.

- They are polyglot persistent. That means NoSQL databases use different data

stores in various circumstances.

- and maybe one of the most unusual characteristics is NoSQL database operates

without a schema. (i.e., schema-free, schemaless, implicit schema).

With a crude set of distinct characteristics above, the NoSQL database is not

definitional. There is no standard for NoSQL databases. Therefore, Sadalage et al. (2012)

defined a NoSQL database as a noDefinition!

2. NoSQL data base versus the traditional Relational DBMS

NoSQL system is a non-relational data storage system that does not require a

relation schema, joins concept with some level of tolerance to ACID properties. A

NoSQL database management system has recently emerged as an alternative database

management system (DBMS) to the traditional relational database system (RDBMS)

(Connolly and Begg, 2014) because of several typical reasons:

a. RDBMS’s database cannot contain universal complex be-all or end-all relations.

b. There are other database languages with other data storage tools for databases.

c. A NoSQL solution is more acceptable and suitable for a client’s advanced

internet-centric applications and services.

Version 1.0 Page 5

Page 6: CS828 P5 Individual Project v101

Data Migration in Schemaless NoSQL Databases

d. NoSQL database provides more freedom, horizontal scalability, and flexibility.

3. The emergence of the NoSQL database

Sadalage et al. (2012) believe that RDBMS has a strictly structured table of

relations that is no longer suitable for modern in-memory data structures such as

Facebook, Twitter with large data needs. In addition, other applications for cloud-based

applications, e.g., Amazon S3, dynamically-typed languages and open-source driven

community drive NoSQL DBMS’s such as Cassandra, CouchDB, Neo4J, Hbase

emerging recently. NoSQL database appears as a solution for a client’s advanced Web-

based applications and services.

B. Aggregate Data Models

The NoSQL database provides a friendly implementation and usage as an

alternative to traditional relational DBMS to developers and end-users. The NoSQL

database requires more programming but less database design. On the positive aspect, it

offers flexible schema or schema-less. It allows quicker and cheaper setup. It has

massively vertical or horizontal scalability. It relaxes data consistency for higher

performance and availability. However, on the negative aspect, it uses no declarative

query language. As a result, it requires more programming to obtain needed information.

Since it relaxes data consistency, there are fewer guarantees of meaningful information.

In addition, while the traditional relational databases could not handle the issues of the

big data, expandable horizontal scalability, complex data format, sophisticated

Version 1.0 Page 6

Page 7: CS828 P5 Individual Project v101

Data Migration in Schemaless NoSQL Databases

manageability, NoSQL databases employ a map-reduce computation task (Date, 2006). A

Map-reduce is a programming database model that uses a parallel and distributed

algorithm to process and generate large sets of data in databases on big clusters of servers

and processors with Mappers and Reducers. Notice that the outcomes of the Mappers and

the Reducers are stored as the materialized views in cached memory (Sadalage et al.,

2012).

1. The NoSQL databases’ aggregate data model

In contrast with a traditional relational database using the strict entity-relation

model, the NoSQL databases use an aggregate data model contains aggregate data. The

aggregate data is a complex structured record of the nested data. The aggregate data,

called an aggregate by Evan (2004), is a collection of related objects treated as a unit of

data. The aggregate data model is an aggregate oriented data model for a unique

NoSQLsolution. It that consists of four model categories: key-value, document, column-

family, and graph (Sadalage et al., 2012). The NoSQL database usually uses two primary

aggregate data models: Key-value or the big hash table (e.g., Amazone S3, Voldemort,

Scalaris) and schema-less (e.g., Cassandra, CouchDB, Neo4J).

2. Some Pros and Cons of the aggregate data models

There are some pros and cons of these aggregate data models. In a key-value

model, the Pros are: very fast, very scalable, a simple model, and able to distribute

horizontally. The Cons are many data structures or objects cannot be easily modeled as

key-value pairs. On the other hand, a schema-less model, the Pros are the schema-less

Version 1.0 Page 7

Page 8: CS828 P5 Individual Project v101

Data Migration in Schemaless NoSQL Databases

data model is richer than key/value pairs, eventual consistency, many are distributed and

it still provides excellent performance and scalability. Its Cons are there are no ACID

transactions or joins.

C. Schemalessness and Implicit Schema

A central theme of NoSQL databases is that they are schemaless. Schemalessness

has a big impact on changes of database’s structure. Users should exercise the control of

storing data so that they can access both old and new data.

1. Main concept of the schemalessness in NoSQL database

A NoSQL database is ignorant of the schema (that is a defined structure such as a

table, column, data type for storing data and its attributes). A NoSQL database cannot use

the schema to store and retrieve data efficiently. It does not even apply its validation

upon that data to ensure that different applications do not manipulate data in an

inconsistent way. However, a schemaless NoSQL database provides freedom and

flexibility on data storage (Moniruzzaman and Hossain, 2013). With the schemaless

characteristic, NoSQL database allows users to store data casually. In advanced Internet-

centric services in e-commerce in the digital market, the aggregate records contain

correctly nonuniform data where its record has a different set of fields in a schemaless

database. For example, a key-value store allows users to store any data they desire in the

database. Users can efficiently store data and comfortably change data storage as they

learn more about their project. They can also add new things as they discover them

(Pankowski, 2002).

Version 1.0 Page 8

Page 9: CS828 P5 Individual Project v101

Data Migration in Schemaless NoSQL Databases

2. Implicit schema in NoSQL database

Since NoSQL database is schema-free, to access aggregate records or nonuniform

data, users are required to write a program such as scripts that mostly relies on some form

of implicit schema. The implicit schema is a set of assumptions about the data’s structure

in the code that manipulate the data. A schemaless database shifts a strict fixed schema

into the application code that accesses data. That means users need to dig into application

code to understand data and its associated information (Sadalage et al., 2012). If the

application code is well structured, users are able to deduce the implicit schema for useful

data and its related information. Otherwise, they may be stuck on data access. In other

words, with implicit schema, users are required more programming skills but less design

experience.

3. A primary problem of data access with the implicit schema

Since application code in the schemaless NoSQL database contains the implicit

schema, it becomes problematic if multiple applications, developed by different

developers, access the same database. To reduce the problems, users can encapsulate all

database interaction within a single application and integrate it with other applications

using Web services. Another approach is to delineate different areas of an aggregate for

access by various applications.

D. Data Migration in NoSQL databases with implicit schema

Version 1.0 Page 9

Page 10: CS828 P5 Individual Project v101

Data Migration in Schemaless NoSQL Databases

In general, data migration is a process to transfer data between storage types,

formats, databases, computer systems. In system implementation, database integration,

upgrade or consolidation, data migration is a key deliberation. It is usually achieved

programmatically by automated migration.( datamigrationpro.com, 2009).

1. Data migration with implicit in NoSQL database

In NoSQL database, the schemalessness provides freedom and flexibility in data

migration within an aggregate record. During developing with NoSQL databases,

designers, who do not think about schema, consider other aspects such as how keys are

assigned and what is data structure inside a value object in key-value stores or types of

relationships with graph databases. Even though there is no fixed schema, data is stored

in memory with implicit schema that is defined and contained in application code. If the

application code can not parse the data from its database, a schema mismatch or data

inconsistency will occur (cisco.com). Notice that to access multiple aggregate records or

change the aggregate boundaries, the data migration with implicit schema becomes

complex as it is in the RDBMS. It is even more complex when users do not understand a

set of assumptions about the data’s structure in the application code that manipulate the

data in aggregate records.

2. Principles of the data migration in NoSQL databases with implicit schema

Data migration process in NoSQL databases is similar to other data migration

processes except some minor change in requirements from the implicit schema. The

efficient data migration has some primary mapping phases that include data extraction,

Version 1.0 Page 10

Page 11: CS828 P5 Individual Project v101

Data Migration in Schemaless NoSQL Databases

data loading, data verification with minimum of data loss and preserving consistencies.

Data cleansing is commonly performed to improve data quality. In the principles (Katzoff

, datamigrationpro.com, 2014) , data migration in NoSQL databases with implicit schema

maybe consists of five phases (Design, extraction, cleansing, loading, and verification)

for applications from moderate to high complexity to match the requirements of the

implicit schema. Three phases of five phases are mentioned below because they are

essential:

- Data extraction: It is a process of retrieving data out of homogeneous or

heterogeneous, unstructured data source for further data processing.

- Data loading: It is a part of the ETL (extract, transform, load) process to load data

into a final target database.

- Data verification: It is a process to check different types of data for accuracy,

inconsistencies after data migration is done.

According to Katzoff (2014), for an efficient process, data migration strategy may

have ten steps as shown below:

a. Planning – Identify the baseline and legal original.

b. Analysis and data discovery – Determine if metadata in the sources is sufficient

for target document process.

c. Tool selection -

d. Master data management – Harmonize key-value pairs and workflow process.

e. Tool configuration -

f. Data cleansing

g. Dry runs

Version 1.0 Page 11

Page 12: CS828 P5 Individual Project v101

Data Migration in Schemaless NoSQL Databases

i. Formal testing

j. Production execution

k. Post production support

After data migration is performed on NoSQL database, there are several options

to minimize migration error by testing. Testing options for data migration in NoSQL

database with implicit schema include a de facto approach data and content migration

based on the sampling of some subset of random data selected and inspected. Some

options are pre-migration testing, formal design review, post-integration testing, user

acceptance testing, and production testing.

3. Example 1 - MongoDB’s data migration

Data migration in NoSQL database such as MongoDB with implicit schema is an

example to show that implicit schema changes do matter when there are a deployed

applications and existing production data in a document data store with a data model :

customer, order, and orderItems as shown below:

MongoDB’s document data code is shown below:

{ “_id”: “31415926AB47E98374D” “customerid”: “CTU_online” “name”: “CS828-1501C-01 Inc” “since”: “01/04/2015” “order”: { “oderid”: “18319888”, “orderdate”:01/04/2015”, “orderItems”: [{“product”: “Database Course”, “price”: 2122.00}] } }

Version 1.0 Page 12

Page 13: CS828 P5 Individual Project v101

Data Migration in Schemaless NoSQL Databases

Application code for implicit schema to write this document structure to

MongoDB is:

BasicDBObject orderItem = new BasicDBObject(); orderItem.put(“product”, productName); orderItem.put(“price”, price); orderItems.add(orderItem);

Code to read the document back from the MongoDB database is:

BasicDBObject item = (BasicDBObject) orderItem; String productName = item.getString(“product”); Double price = item.getDouble(“price”);

Adding preferredShippingType is changing the objects does not require any

change in database because the MongoDB does not care that different documents do not

follow the same schema. All that needs ti be deployed is the applications only.The code

has to ensure that documents that do not have the preferredShippingType attribute can be

spared.

If discountedPrice is introduced and price is renamed to fullPrice, a developer

renames price attribute to fullPrice then adds discountedPrice attribute as below:

{ “_id”: “261003OPOELALKJDK” “customerid”: “CTU_offline” “name”: “RES860-1501C-01 Inc” “since”: “01/04/2015” “order”: { “oderid”: “18319888”, “orderdate”:03/21/2015”, “orderItems”: [{“product”: “Research Course”, “fullPrice”: 2214.00, “discountedPrice”: 2122.00}] } }

Version 1.0 Page 13

Page 14: CS828 P5 Individual Project v101

Data Migration in Schemaless NoSQL Databases

Once the change is deployed, new customers and orders can be saved and read

back properly. However, the price of the product for existing orders can not be read

because now the code looks for fullPrice while the document has only price attribute.

4. Example 2 - Incremental migration

(Source: Chapter 12: Schema Migration from “NoSQL distilled: a brief guide to the

emerging world of polyglot persistence” by Sadalage & Fowler (2012))

Data migration with implicit schema has a risk of data loss, schema mismatch,

attribute removal in new aggregate records. When the application changes its code,

implicit schema is also changed. In consequence, new data may not have all attributes

as the old data does. Before the implicit schema changes, developers can use incremental

migration to ensure that the new code can still parse data. The document with price and

fullPrice attributes from the example 1 is displayed:

BasicDBObject item = (BasicDBObject) orderItem; String productName = item.getString(“product”); Double price = item.getDouble(“price”); If (fullPrice == null) { fullPrice = item.getDouble(“fullPrice”); } Double discountedPrice = item.getDouble(“discoutedPrice”);

When writing the document back, the old attribute price is not saved:

BasicDBObject orderItem = new BasicDBObject(); orderItem.put(“product”, productName); orderItem.put(“fullPrice”, price); orderItem.put(“discountedPrice”, discountedPrice); orderItems.add(orderItem);

Version 1.0 Page 14

Page 15: CS828 P5 Individual Project v101

Data Migration in Schemaless NoSQL Databases

When using incremental migration, there could be many versions of the object

that can translate the old schema to the new schema. While saving the object back, it is

saved using the new object. This gradual migration of data helps the application evolve

faster.

Conclusion

The short research paper discusses the concepts of NoSQL databases with

adopting adopt the Aggregate Data Model that are supporting the application-oriented

aggregates, embracing schema-less data, running on the cluster platform in distributed

network, and often making the trade-off between the data consistency and other useful

properties. It focuses on the associated concepts of NoSQL’s schemalessness and

emphasizes data migration in NoSQL databases with implicit schema. The in-depth

discussion, that also covers the general principles of conducting data migration, test

strategy in NoSQL databases, consists of four main sections: (1) the concepts of NoSQL

databases, (2) aggregate data models, (3) schemalessness and implicit schema, and (4)

data migration in NoSQL database with implicit schema. A final note is whether the

NoSQL databases are able to handle Big Data with the implicit schemas in data-driven

era in the early 21th century?

Version 1.0 Page 15

Page 16: CS828 P5 Individual Project v101

Data Migration in Schemaless NoSQL Databases

REFERENCES

1. Chen, H., Chiang, R. H., & Storey, V. C. (2012). Business Intelligence and Analytics:

From Big Data to Big Impact. MIS Quarterly, 36(4), 1165-1188.

2. Connolly, T. M., & Begg, C. E. (2014). Database Systems: A Practical Approach to

Design, Implementation, and Management. New Jersey, NJ: Pearson

3. Date, C. J., 2006). The relational database dictionary: A comprehensive glossary of

relational terms and concepts, with illustrative examples. "O'Reilly Media, Inc.". pp. 

59–. ISBN 978-1-4493-9115-7.

4. Hargiss, K. (2015). Chat session 9 (Lecture) of NoSQL database. Information retrieved

from presentation slides.

5. McNurlin, B. C., Ralph H. Sprague, J., & Bui, T. (2009). Information Systems

Management in Practice (Eighth Edition ed.). Upper Saddle River: Pearson Prentice Hall.

6. Moniruzzaman, A. B. M., & Hossain, S. A. (2013). Nosql database: New era of

databases for big data analytics-classification, characteristics and comparison.arXiv

preprint arXiv:1307.0191.

Version 1.0 Page 16

Page 17: CS828 P5 Individual Project v101

Data Migration in Schemaless NoSQL Databases

7. Pankowski, T. (2002). PathLog: a Query Language for Schemaless Databases of

Partially Labeled Objects. Fundamenta Informaticae, 49(4), 369.

8. Sadalage, P. J., & Fowler, M. (2012). NoSQL distilled: a brief guide to the emerging

world of polyglot persistence. Pearson Education.

9. http://www.datamigrationpro.com/data-migration-articles/2009/11/30/how-to-

implement-an-effective-data-migration-testing-strateg.html.

10. http://en.wikipedia.org/wiki/Data_migration.

11. https://msdn.microsoft.com/en-us/library/ms174467.aspx.

12. http://www.cisco.com/c/en/us/td/docs/security/ise/1-3/migration_guide/

b_ise_MigrationGuide/b_ise_MigrationGuide12_chapter_011.html.

13. http://www.computerweekly.com/feature/An-ABC-guide-to-data-migration.

14. http://www.laserfiche.com/support/webhelp/Laserfiche/9.0/en-US/AdminGuide/

Content/Basic_Principles_of_the_Migration_Proc.

15. http://www.webopedia.com/TERM/D/data_migration.html.

Version 1.0 Page 17

Page 18: CS828 P5 Individual Project v101

Data Migration in Schemaless NoSQL Databases

APPENDIX

CS828 Phase 5 Individual Project: Grade: A Score: 200 pt 3/16/2015 Current Grade Average: A (955/955)

ThienSi...Congratulations on a well written paper used to discuss the general principles of conducting data migration in NoSQL databases. You clearly presented thoughts as how to ensure the data stored in the databases matched with the “Implicit Schema” embedded in the applications when the “Implicit Schema” has experienced a change....excellent work!Proficient: The submitted work exceeds the project criteria requirements. It demonstrates a comprehensive understanding of course material and meets the course objectives with proficiency.Dr. Kathleen Hargiss.

Version 1.0 Page 18

Page 19: CS828 P5 Individual Project v101

Data Migration in Schemaless NoSQL Databases

Version 1.0 Page 19