49

Handling Large Amounts of Biological Data

  • Upload
    elijah

  • View
    32

  • Download
    4

Embed Size (px)

DESCRIPTION

Session id:40364. Handling Large Amounts of Biological Data. Xiaobin Guan, Ph.D. Senior Oracle DBA/Bioinformatician National Institutes of Health. Introduction. Bioinformatics In Silico Large Database DNA Sequence Using CLOB Using Partition Tables. NISC Database Environment. - PowerPoint PPT Presentation

Citation preview

Page 1: Handling Large Amounts of Biological Data
Page 2: Handling Large Amounts of Biological Data

Handling Large Amounts of Handling Large Amounts of Biological DataBiological Data

Xiaobin Guan, Ph.D.Senior Oracle DBA/Bioinformatician

National Institutes of Health

Session id:40364

Page 3: Handling Large Amounts of Biological Data

Introduction

Bioinformatics In Silico Large Database DNA Sequence Using CLOB Using Partition Tables

Page 4: Handling Large Amounts of Biological Data

NISC Database Environment

NIH Intramural Sequencing Center Established in 1997 A multi-disciplinary genomics facility Large-scale DNA sequencing Applied Biosystems (ABI) DNA Analyzers Produce 10,000 DNA sequences per day

Page 5: Handling Large Amounts of Biological Data

NISC Pipeline

The Laboratory Information Management System (LIMS).

Move the sequencing data from each PC to a partition (/area1) on our main Unix Server.

A Perl script is then running to validate the trace name and run folder name, and also check for duplicates. Then, moved to another partition (/area2).

Phred is run on each trace file to get rid of the low quality bases at the beginning and end of each read.

Page 6: Handling Large Amounts of Biological Data

NISC Pipeline

Vector Screening is then performed on each read, and masked out where the vector is.

Contaminant Checking is to use BLAST to screen any contaminants. The information about contamination is then stored in the database.

QC Report is generated to show the quality and other information.

Page 7: Handling Large Amounts of Biological Data

Why CLOB?

To store DNA sequences Combination of ‘ACGT’ character strings The length can be more or less than 4KB

Page 8: Handling Large Amounts of Biological Data

LOBs vs. Long/Long Raw

LONG, LONG RAW

LOBs

Number of LOB columns per table

1 Multiple

LOB Capacity Up to 2 GB Up to 4 GB

Data stored out-of-line No Yes

Object type support No Yes

Random piece-wise access No Yes

Page 9: Handling Large Amounts of Biological Data

A Simple Create Table Statement

CREATE TABLE dna_sequence1

(base_id NUMBER(6),

base_sequence CLOB)

TABLESPACE example;

Page 10: Handling Large Amounts of Biological Data

Specify the Segment Name, and LOB Storage

CREATE TABLE dna_sequence2

(base_id NUMBER(6),

base_sequence CLOB)

LOB (base_sequence) STORE AS

dna_seq_lob

(TABLESPACE lob_seg_ts)

TABLESPACE example;

Page 11: Handling Large Amounts of Biological Data

Specify the Index Name and Index Storage

CREATE TABLE dna_sequence3

(base_id NUMBER(6),

base_sequence CLOB)

LOB (base_sequence) STORE AS

dna_seq_lob1

(TABLESPACE lob_seg_ts

INDEX dna_seq_clob_idx (

TABLESPACE nisc_index))

TABLESPACE example;

Page 12: Handling Large Amounts of Biological Data

Check Segment and Index Name

SELECT table_name, column_name, segment_name, index_name

FROM user_lobs;

TABLE_NAME COLUMN_NAME SEGMENT_NAME INDEX_NAME

--------------- --------------- --------------------------- ------------------------

DNA_SEQUENCE1 BASE_SEQUENCE SYS_LOB0000040338C00002$$ SYS_IL0000040338C00002$$

DNA_SEQUENCE2 BASE_SEQUENCE DNA_SEQ_LOB SYS_IL0000040341C00002$$

DNA_SEQUENCE3 BASE_SEQUENCE DNA_SEQ_LOB1 DNA_SEQ_CLOB_IDX

Page 13: Handling Large Amounts of Biological Data

Query the Table

SELECT * FROM dna_sequence WHERE base_id = 20;

20 actcggtactgggacccatgtggtggatttctatccttgaagctgcacgtaaagacccggtttttgcgggtatctctgataatgccaccgctcaaatcgctacagcgtgggcaagtgcactggctgactacgccgcagcacataaatctatgccgcgtccggaaattctggcctcctgccaccagacgctggaaaactgcctgatagagtccacccgcaatagcatggatgccactaataaagcgatgctggaatctgtcgcagcagagatgatgagcgtttctgacggtgttatgcgtctgcctttattcctcgcgatgatcctgcctgttcagttgggggcagctaccgctgatgcgtgtaccttcattccggttacgcgtgaccagtccgacatctatgaagtctttaacgtggcaggttcatcttttggttcttatgctgctggtgatgttctggacatgcaatccgtcggtgtgtacagccagttacgtcgccgctatgtgctggtggcaagctccgatggcaccagcaaaaccgcaaccttcaagatggaagacttcgaaggccagaatgtaccaatccgaaaaggtcgcactaacatctacgttaaccgtattaagtctgttgttgataacggttccggcagcctacttcactcgtttactaatgctgctggtgagcaaatcactgttacctgctctctgaactacaacattggtcagattgccctgtcgttctccaaagcgccggataaaagcactgagatcgcaattgagacggaaatcaatattgaagccggctctgagctgatcccgctgatcacca

Page 14: Handling Large Amounts of Biological Data

In-line or Out-of-line Storage

In-line Out-of-line Enable storage in row Disable storage in row Tablespaces

Page 15: Handling Large Amounts of Biological Data

CLOB Usage

Table structure

– This table contains two CLOB columns BASECALLS stores DNA sequences BASEQUALS stores the quality score of each

sequence

– The length of both fields varies between a few hundred to up to 6 thousand characters

Page 16: Handling Large Amounts of Biological Data

Test Protocol

Create tablespaces– Four for 4 tables, and two for LOB storage

Create four test tables– T1, in-line, one tablespace– T2, in-line, two tablespaces– T3, out-of-line, one tablespace– T4, out-of-line, two tablespaces

Page 17: Handling Large Amounts of Biological Data

Test Table 1 (T1)

CREATE TABLE T1 (CALL_ID NUMBER(10) NOT NULL, TRACE_ID NUMBER(10) NOT NULL, BASECALLS CLOB NOT NULL, BASEQUALS CLOB) TABLESPACE "TEST_CALL1" LOB("BASECALLS") STORE AS (TABLESPACE

"TEST_CALL1" ENABLE STORAGE IN ROW) LOB("BASEQUALS") STORE AS (TABLESPACE

"TEST_CALL1" ENABLE STORAGE IN ROW);

Page 18: Handling Large Amounts of Biological Data

Test Table 2 (T2)

CREATE TABLE T2 (CALL_ID NUMBER(10) NOT NULL, TRACE_ID NUMBER(10) NOT NULL, BASECALLS CLOB NOT NULL, BASEQUALS CLOB) TABLESPACE "TEST_CALL2" LOB("BASECALLS") STORE AS (TABLESPACE

"TEST_CALL_LOB1" ENABLE STORAGE IN ROW) LOB("BASEQUALS") STORE AS (TABLESPACE

"TEST_CALL_LOB1" ENABLE STORAGE IN ROW);

Page 19: Handling Large Amounts of Biological Data

Test Table 3 (T3)

CREATE TABLE T3 (CALL_ID NUMBER(10) NOT NULL, TRACE_ID NUMBER(10) NOT NULL, BASECALLS CLOB NOT NULL, BASEQUALS CLOB) TABLESPACE "TEST_CALL3" LOB("BASECALLS") STORE AS (TABLESPACE

"TEST_CALL3" DISABLE STORAGE IN ROW) LOB("BASEQUALS") STORE AS (TABLESPACE

"TEST_CALL3" DISABLE STORAGE IN ROW);

Page 20: Handling Large Amounts of Biological Data

Test Table 4 (T4)

CREATE TABLE T4 (CALL_ID NUMBER(10) NOT NULL, TRACE_ID NUMBER(10) NOT NULL, BASECALLS CLOB NOT NULL, BASEQUALS CLOB) TABLESPACE "TEST_CALL4" LOB("BASECALLS") STORE AS (TABLESPACE

"TEST_CALL_LOB2" DISABLE STORAGE IN ROW) LOB("BASEQUALS") STORE AS (TABLESPACE

"TEST_CALL_LOB2" DISABLE STORAGE IN ROW);

Page 21: Handling Large Amounts of Biological Data

Results

In-line/out-of-line IN-LINE OUT-OF-LINE

Tablespace usage One TS

Two TS

One TS

Two TS

Table name T1 T2 T3 T4

Initial space used (MB) 6 7(2+5) 6 7(2+5)

Space used after 10000 row insert (MB)

46 47(42+5)

162 163(2+161)

Total insert time (sec) 10 11 47 48

Ranking 1 2 3 4

Page 22: Handling Large Amounts of Biological Data

DBMS_LOB Package

Page 23: Handling Large Amounts of Biological Data

Functions/Procedures to Read or Return LOB Values

Subprogram F/P Description

COMPARE() F Compares the value of two LOBs

GETCHUNKSIZE()

F Gets the chunk size used when reading and writing. This only works on internal LOBs and does not apply to external LOBs (BFILEs).

GETLENGTH() F Gets the length of the LOB value

INSTR() F Returns the matching position of the nth occurrence of the pattern in the LOB

READ() P Reads data from the LOB starting at the specified offset

SUBSTR() F Returns part of the LOB value starting at the specified offset

Page 24: Handling Large Amounts of Biological Data

Functions/Procedures to Write LOB Values

Subprogram F/P Description

APPEND() P Appends the LOB value to another LOB

COPY() P Copies all or part of a LOB to another LOB

ERASE() P Erases part of a LOB, starting at a specified offset

LOADFROMFILE() P Load BFILE data into an internal LOB

LOADCLOBFROMFILE()

P Load character data from a file into a LOB

LOADBLOBFROMFILE()

P Load binary data from a file into a LOB

TRIM() P Trims the LOB value to the specified shorter length

WRITE() P Writes data to the LOB at a specified offset

WRITEAPPEND() P Writes data to the end of the LOB

Page 25: Handling Large Amounts of Biological Data

Functions/Procedures for BFILEs

Subprogram F/P Description

FILECLOSE() P Closes the file. Use CLOSE() instead.

FILECLOSEALL()

P Closes all previously opened files

FILEEXISTS() F Checks if the file exists on the server

FILEGETNAME()

P Gets the directory alias and file name

FILEISOPEN() F Checks if the file was opened using the input BFILE locators. Use ISOPEN() instead.

FILEOPEN() P Opens a file. Use OPEN() instead.

Page 26: Handling Large Amounts of Biological Data

Call Functions in SQL

SELECT dbms_lob.getlength(base_sequence) FROM dna_sequence1DBMS_LOB.GETLENGTH(BASE_SEQUENCE)--------------------------------- 878 1269 893 872 961 807 806 808 833 83710 rows selected.

Page 27: Handling Large Amounts of Biological Data

Call procedures in PL/SQLDECLARE v_dna_seq CLOB; v_seq_amt BINARY_INTEGER :=10; v_seq_buffer VARCHAR2(10);BEGIN v_dna_seq :=

'atctcgagtagctgaagctccaatgntggtggaattcacgagttgctt';

DBMS_LOB.READ (v_dna_seq, v_seq_amt, 1, v_seq_buffer);

DBMS_OUTPUT.PUT_LINE('The first 10 bases for this DNA sequence are: ' || v_seq_buffer);

END;/The first 10 bases for this DNA sequence are:

atctcgagtaPL/SQL procedure successfully completed.

Page 28: Handling Large Amounts of Biological Data

Substr vs. dbms_lob.substr

Substr(the_string, from_character, number_of_characters);

Dbms_lob.substr(the_string, number_of_characters, from_character).

Page 29: Handling Large Amounts of Biological Data

Substr vs. dbms_lob.substrCREATE table substring (str varchar2(20), lob clob);INSERT INTO substring VALUES ('Oracle10G', 'Oracle10G');SELECT substr (str, 7, 3), dbms_lob.substr(lob, 7, 3) lob FROM substring;[email protected]> SUB LOB--- ----------10G acle10G10G acle10GSELECT substr (str, 7, 3), dbms_lob.substr(lob, 3, 7) lob FROM substring;[email protected]> SUB LOB--- ----------10G 10G10G 10G

Page 30: Handling Large Amounts of Biological Data

Lob Usage Limitation

Not in the ORDER BY, or GROUP BY or in an aggregate function.

Not in a SELECT... DISTINCT or SELECT... UNIQUE statement or in a join.

Not in ANALYZE... COMPUTE or ANALYZE... ESTIMATE statements.

Not as a primary key column. Not select a LOB column through dblink. ORA-

22992: cannot use LOB locators selected from remote tables.

Page 31: Handling Large Amounts of Biological Data

Partitioning and Its

Usage Scenarios at NISC

Page 32: Handling Large Amounts of Biological Data

Partition Method

Range Partitioning, introduced in Oracle 8. Hash Partitioning, introduced in 8i. List Partitioning, introduced in 9i release 1. Composite Partitioning. The range-hash

partition was introduced in 8i, and the range-list partition was introduced in 9i release 2.

This is a good example how Oracle adds functionalities to the new release.

Page 33: Handling Large Amounts of Biological Data

Benefit of Partitioning

The amount of time for each operation can be significantly reduced because of the small segment.

Improve query performance. The I/O will be balanced among disks.

Reduce the downtime. Part of the table can be put to read only

mode. Easy to implement.

Page 34: Handling Large Amounts of Biological Data

When to Partition

When table becomes large. 2GB is considered as a general guideline.

When the data is kind of adding on, meaning new data will go to the new partition.

Page 35: Handling Large Amounts of Biological Data

Work with Range Partition

Create table with range partitioning. Convert a non-partition table to a partition

table. Merge/split partition. Tablespace usage with partition. Maintain range partition.

Page 36: Handling Large Amounts of Biological Data

Partitioning Usage Examples

Create tablespace Create table Add partition Drop partition Exchange partition Move partition Merge partition Split partition Truncate partition Rename partition

Page 37: Handling Large Amounts of Biological Data

Create Partitioned Table

CREATE TABLE dna_sequence (base_id NUMBER(6), base_sequence CLOB) LOB (base_sequence) STORE AS dna_seq_lob2 TABLESPACE examplePARTITION BY RANGE (BASE_ID) (partition dna_sequence1 values less than (100)

tablespace dna_sequence_p1, partition dna_sequence2 values less than (200)

tablespace dna_sequence_p2, partition dna_sequence3 values less than (300)

tablespace dna_sequence_p3);

Page 38: Handling Large Amounts of Biological Data

Query the Partitioned Table

SELECT table_name, partition_name, tablespace_name, high_value

FROM user_tab_partitions

ORDER BY partition_name;

TABLE_NAME PARTITION_NAME TABLESPACE_NAME HIGH_VALUE

---------------- -------------------- -------------------- ----------

DNA_SEQUENCE DNA_SEQUENCE1 DNA_SEQUENCE_P1 100

DNA_SEQUENCE DNA_SEQUENCE2 DNA_SEQUENCE_P2 200

DNA_SEQUENCE DNA_SEQUENCE3 DNA_SEQUENCE_P3 300

Page 39: Handling Large Amounts of Biological Data

Add Partition

ALTER TABLE dna_sequence

ADD PARTITION dna_sequence4 VALUES LESS THAN (400)

TABLESPACE dna_sequence_p1;

TABLE_NAME PARTITION_NAME TABLESPACE_NAME HIGH_VALUE

--------------- ----------------- -------------------- ----------

DNA_SEQUENCE DNA_SEQUENCE1 DNA_SEQUENCE_P1 100

DNA_SEQUENCE DNA_SEQUENCE2 DNA_SEQUENCE_P2 200

DNA_SEQUENCE DNA_SEQUENCE3 DNA_SEQUENCE_P3 300

DNA_SEQUENCE DNA_SEQUENCE4 DNA_SEQUENCE_P1 400

Page 40: Handling Large Amounts of Biological Data

Drop Partition

ALTER TABLE dna_sequence DROP PARTITION dna_sequence4;

Run partition.sql;

TABLE_NAME PARTITION_NAME TABLESPACE_NAME HIGH_VALUE

---------------- ------------------- -------------------- ---------

DNA_SEQUENCE DNA_SEQUENCE1 DNA_SEQUENCE_P1 100

DNA_SEQUENCE DNA_SEQUENCE2 DNA_SEQUENCE_P2 200

DNA_SEQUENCE DNA_SEQUENCE3 DNA_SEQUENCE_P3 300

Page 41: Handling Large Amounts of Biological Data

Exchange Partition

CREATE TABLE dna_sep03

AS SELECT *

FROM dna_sequence

WHERE 1=2;

ALTER TABLE dna_sequence

EXCHANGE PARTITION dna_sequence3 WITH TABLE dna_sep03;

Page 42: Handling Large Amounts of Biological Data

Move Partition

ALTER TABLE dna_sequence

MOVE PARTITION dna_sequence4 TABLESPACE dna_sequence_p2 NOLOGGING;

Page 43: Handling Large Amounts of Biological Data

Split Partition

ALTER TABLE dna_sequence SPLIT PARTITION dna_sequence4 AT (350) INTO ( PARTITION dna_sequence4 TABLESPACE dna_sequence_p1, PARTITION dna_sequence5 TABLESPACE dna_sequence_p2) PARALLEL ( DEGREE 5 );

TABLE_NAME PARTITION_NAME TABLESPACE_NAME HIGH_VALUE----------------- -------------------- -------------------- ----------DNA_SEQUENCE DNA_SEQUENCE1 DNA_SEQUENCE_P1 100DNA_SEQUENCE DNA_SEQUENCE2 DNA_SEQUENCE_P2 200DNA_SEQUENCE DNA_SEQUENCE3 DNA_SEQUENCE_P3 300DNA_SEQUENCE DNA_SEQUENCE4 DNA_SEQUENCE_P1 350DNA_SEQUENCE DNA_SEQUENCE5 DNA_SEQUENCE_P2 400

Page 44: Handling Large Amounts of Biological Data

Truncate Partition

ALTER TABLE dna_sequence

TRUNCATE PARTITION dna_sequence4 DROP STORAGE;

Page 45: Handling Large Amounts of Biological Data

Rename Partition/Table

Rename partition– ALTER TABLE dna_sequence

RENAME PARTITION dna_sequence4 TO dna_sequence5;

Rename table– ALTER TABLE dna_sequence

RENAME TO dna_seq;– RENAME dna_seq TO dna_sequence;

Page 46: Handling Large Amounts of Biological Data

Conclusion

By proper use of the Oracle features such as CLOB, and partitioning table, it becomes a lot easier to manage the database containing large amounts of biological data.

Page 47: Handling Large Amounts of Biological Data

Major Benefits using CLOB and Partitioning at NISC

Space Savings: Proper use of CLOB Better performance: Put big tables into

smaller segments Better Maintenance: Easier backup and

recovery; Less down time

Page 48: Handling Large Amounts of Biological Data

AQ&Q U E S T I O N SQ U E S T I O N S

A N S W E R SA N S W E R S

Page 49: Handling Large Amounts of Biological Data

Reminder – please complete the OracleWorld online session survey

Thank you.

Xiaobin Guan, Ph.D.NISC/[email protected]