Upload
elijah
View
32
Download
4
Embed Size (px)
DESCRIPTION
Session id:40364. Handling Large Amounts of Biological Data. Xiaobin Guan, Ph.D. Senior Oracle DBA/Bioinformatician National Institutes of Health. Introduction. Bioinformatics In Silico Large Database DNA Sequence Using CLOB Using Partition Tables. NISC Database Environment. - PowerPoint PPT Presentation
Citation preview
Handling Large Amounts of Handling Large Amounts of Biological DataBiological Data
Xiaobin Guan, Ph.D.Senior Oracle DBA/Bioinformatician
National Institutes of Health
Session id:40364
Introduction
Bioinformatics In Silico Large Database DNA Sequence Using CLOB Using Partition Tables
NISC Database Environment
NIH Intramural Sequencing Center Established in 1997 A multi-disciplinary genomics facility Large-scale DNA sequencing Applied Biosystems (ABI) DNA Analyzers Produce 10,000 DNA sequences per day
NISC Pipeline
The Laboratory Information Management System (LIMS).
Move the sequencing data from each PC to a partition (/area1) on our main Unix Server.
A Perl script is then running to validate the trace name and run folder name, and also check for duplicates. Then, moved to another partition (/area2).
Phred is run on each trace file to get rid of the low quality bases at the beginning and end of each read.
NISC Pipeline
Vector Screening is then performed on each read, and masked out where the vector is.
Contaminant Checking is to use BLAST to screen any contaminants. The information about contamination is then stored in the database.
QC Report is generated to show the quality and other information.
Why CLOB?
To store DNA sequences Combination of ‘ACGT’ character strings The length can be more or less than 4KB
LOBs vs. Long/Long Raw
LONG, LONG RAW
LOBs
Number of LOB columns per table
1 Multiple
LOB Capacity Up to 2 GB Up to 4 GB
Data stored out-of-line No Yes
Object type support No Yes
Random piece-wise access No Yes
A Simple Create Table Statement
CREATE TABLE dna_sequence1
(base_id NUMBER(6),
base_sequence CLOB)
TABLESPACE example;
Specify the Segment Name, and LOB Storage
CREATE TABLE dna_sequence2
(base_id NUMBER(6),
base_sequence CLOB)
LOB (base_sequence) STORE AS
dna_seq_lob
(TABLESPACE lob_seg_ts)
TABLESPACE example;
Specify the Index Name and Index Storage
CREATE TABLE dna_sequence3
(base_id NUMBER(6),
base_sequence CLOB)
LOB (base_sequence) STORE AS
dna_seq_lob1
(TABLESPACE lob_seg_ts
INDEX dna_seq_clob_idx (
TABLESPACE nisc_index))
TABLESPACE example;
Check Segment and Index Name
SELECT table_name, column_name, segment_name, index_name
FROM user_lobs;
TABLE_NAME COLUMN_NAME SEGMENT_NAME INDEX_NAME
--------------- --------------- --------------------------- ------------------------
DNA_SEQUENCE1 BASE_SEQUENCE SYS_LOB0000040338C00002$$ SYS_IL0000040338C00002$$
DNA_SEQUENCE2 BASE_SEQUENCE DNA_SEQ_LOB SYS_IL0000040341C00002$$
DNA_SEQUENCE3 BASE_SEQUENCE DNA_SEQ_LOB1 DNA_SEQ_CLOB_IDX
Query the Table
SELECT * FROM dna_sequence WHERE base_id = 20;
20 actcggtactgggacccatgtggtggatttctatccttgaagctgcacgtaaagacccggtttttgcgggtatctctgataatgccaccgctcaaatcgctacagcgtgggcaagtgcactggctgactacgccgcagcacataaatctatgccgcgtccggaaattctggcctcctgccaccagacgctggaaaactgcctgatagagtccacccgcaatagcatggatgccactaataaagcgatgctggaatctgtcgcagcagagatgatgagcgtttctgacggtgttatgcgtctgcctttattcctcgcgatgatcctgcctgttcagttgggggcagctaccgctgatgcgtgtaccttcattccggttacgcgtgaccagtccgacatctatgaagtctttaacgtggcaggttcatcttttggttcttatgctgctggtgatgttctggacatgcaatccgtcggtgtgtacagccagttacgtcgccgctatgtgctggtggcaagctccgatggcaccagcaaaaccgcaaccttcaagatggaagacttcgaaggccagaatgtaccaatccgaaaaggtcgcactaacatctacgttaaccgtattaagtctgttgttgataacggttccggcagcctacttcactcgtttactaatgctgctggtgagcaaatcactgttacctgctctctgaactacaacattggtcagattgccctgtcgttctccaaagcgccggataaaagcactgagatcgcaattgagacggaaatcaatattgaagccggctctgagctgatcccgctgatcacca
In-line or Out-of-line Storage
In-line Out-of-line Enable storage in row Disable storage in row Tablespaces
CLOB Usage
Table structure
– This table contains two CLOB columns BASECALLS stores DNA sequences BASEQUALS stores the quality score of each
sequence
– The length of both fields varies between a few hundred to up to 6 thousand characters
Test Protocol
Create tablespaces– Four for 4 tables, and two for LOB storage
Create four test tables– T1, in-line, one tablespace– T2, in-line, two tablespaces– T3, out-of-line, one tablespace– T4, out-of-line, two tablespaces
Test Table 1 (T1)
CREATE TABLE T1 (CALL_ID NUMBER(10) NOT NULL, TRACE_ID NUMBER(10) NOT NULL, BASECALLS CLOB NOT NULL, BASEQUALS CLOB) TABLESPACE "TEST_CALL1" LOB("BASECALLS") STORE AS (TABLESPACE
"TEST_CALL1" ENABLE STORAGE IN ROW) LOB("BASEQUALS") STORE AS (TABLESPACE
"TEST_CALL1" ENABLE STORAGE IN ROW);
Test Table 2 (T2)
CREATE TABLE T2 (CALL_ID NUMBER(10) NOT NULL, TRACE_ID NUMBER(10) NOT NULL, BASECALLS CLOB NOT NULL, BASEQUALS CLOB) TABLESPACE "TEST_CALL2" LOB("BASECALLS") STORE AS (TABLESPACE
"TEST_CALL_LOB1" ENABLE STORAGE IN ROW) LOB("BASEQUALS") STORE AS (TABLESPACE
"TEST_CALL_LOB1" ENABLE STORAGE IN ROW);
Test Table 3 (T3)
CREATE TABLE T3 (CALL_ID NUMBER(10) NOT NULL, TRACE_ID NUMBER(10) NOT NULL, BASECALLS CLOB NOT NULL, BASEQUALS CLOB) TABLESPACE "TEST_CALL3" LOB("BASECALLS") STORE AS (TABLESPACE
"TEST_CALL3" DISABLE STORAGE IN ROW) LOB("BASEQUALS") STORE AS (TABLESPACE
"TEST_CALL3" DISABLE STORAGE IN ROW);
Test Table 4 (T4)
CREATE TABLE T4 (CALL_ID NUMBER(10) NOT NULL, TRACE_ID NUMBER(10) NOT NULL, BASECALLS CLOB NOT NULL, BASEQUALS CLOB) TABLESPACE "TEST_CALL4" LOB("BASECALLS") STORE AS (TABLESPACE
"TEST_CALL_LOB2" DISABLE STORAGE IN ROW) LOB("BASEQUALS") STORE AS (TABLESPACE
"TEST_CALL_LOB2" DISABLE STORAGE IN ROW);
Results
In-line/out-of-line IN-LINE OUT-OF-LINE
Tablespace usage One TS
Two TS
One TS
Two TS
Table name T1 T2 T3 T4
Initial space used (MB) 6 7(2+5) 6 7(2+5)
Space used after 10000 row insert (MB)
46 47(42+5)
162 163(2+161)
Total insert time (sec) 10 11 47 48
Ranking 1 2 3 4
DBMS_LOB Package
Functions/Procedures to Read or Return LOB Values
Subprogram F/P Description
COMPARE() F Compares the value of two LOBs
GETCHUNKSIZE()
F Gets the chunk size used when reading and writing. This only works on internal LOBs and does not apply to external LOBs (BFILEs).
GETLENGTH() F Gets the length of the LOB value
INSTR() F Returns the matching position of the nth occurrence of the pattern in the LOB
READ() P Reads data from the LOB starting at the specified offset
SUBSTR() F Returns part of the LOB value starting at the specified offset
Functions/Procedures to Write LOB Values
Subprogram F/P Description
APPEND() P Appends the LOB value to another LOB
COPY() P Copies all or part of a LOB to another LOB
ERASE() P Erases part of a LOB, starting at a specified offset
LOADFROMFILE() P Load BFILE data into an internal LOB
LOADCLOBFROMFILE()
P Load character data from a file into a LOB
LOADBLOBFROMFILE()
P Load binary data from a file into a LOB
TRIM() P Trims the LOB value to the specified shorter length
WRITE() P Writes data to the LOB at a specified offset
WRITEAPPEND() P Writes data to the end of the LOB
Functions/Procedures for BFILEs
Subprogram F/P Description
FILECLOSE() P Closes the file. Use CLOSE() instead.
FILECLOSEALL()
P Closes all previously opened files
FILEEXISTS() F Checks if the file exists on the server
FILEGETNAME()
P Gets the directory alias and file name
FILEISOPEN() F Checks if the file was opened using the input BFILE locators. Use ISOPEN() instead.
FILEOPEN() P Opens a file. Use OPEN() instead.
Call Functions in SQL
SELECT dbms_lob.getlength(base_sequence) FROM dna_sequence1DBMS_LOB.GETLENGTH(BASE_SEQUENCE)--------------------------------- 878 1269 893 872 961 807 806 808 833 83710 rows selected.
Call procedures in PL/SQLDECLARE v_dna_seq CLOB; v_seq_amt BINARY_INTEGER :=10; v_seq_buffer VARCHAR2(10);BEGIN v_dna_seq :=
'atctcgagtagctgaagctccaatgntggtggaattcacgagttgctt';
DBMS_LOB.READ (v_dna_seq, v_seq_amt, 1, v_seq_buffer);
DBMS_OUTPUT.PUT_LINE('The first 10 bases for this DNA sequence are: ' || v_seq_buffer);
END;/The first 10 bases for this DNA sequence are:
atctcgagtaPL/SQL procedure successfully completed.
Substr vs. dbms_lob.substr
Substr(the_string, from_character, number_of_characters);
Dbms_lob.substr(the_string, number_of_characters, from_character).
Substr vs. dbms_lob.substrCREATE table substring (str varchar2(20), lob clob);INSERT INTO substring VALUES ('Oracle10G', 'Oracle10G');SELECT substr (str, 7, 3), dbms_lob.substr(lob, 7, 3) lob FROM substring;[email protected]> SUB LOB--- ----------10G acle10G10G acle10GSELECT substr (str, 7, 3), dbms_lob.substr(lob, 3, 7) lob FROM substring;[email protected]> SUB LOB--- ----------10G 10G10G 10G
Lob Usage Limitation
Not in the ORDER BY, or GROUP BY or in an aggregate function.
Not in a SELECT... DISTINCT or SELECT... UNIQUE statement or in a join.
Not in ANALYZE... COMPUTE or ANALYZE... ESTIMATE statements.
Not as a primary key column. Not select a LOB column through dblink. ORA-
22992: cannot use LOB locators selected from remote tables.
Partitioning and Its
Usage Scenarios at NISC
Partition Method
Range Partitioning, introduced in Oracle 8. Hash Partitioning, introduced in 8i. List Partitioning, introduced in 9i release 1. Composite Partitioning. The range-hash
partition was introduced in 8i, and the range-list partition was introduced in 9i release 2.
This is a good example how Oracle adds functionalities to the new release.
Benefit of Partitioning
The amount of time for each operation can be significantly reduced because of the small segment.
Improve query performance. The I/O will be balanced among disks.
Reduce the downtime. Part of the table can be put to read only
mode. Easy to implement.
When to Partition
When table becomes large. 2GB is considered as a general guideline.
When the data is kind of adding on, meaning new data will go to the new partition.
Work with Range Partition
Create table with range partitioning. Convert a non-partition table to a partition
table. Merge/split partition. Tablespace usage with partition. Maintain range partition.
Partitioning Usage Examples
Create tablespace Create table Add partition Drop partition Exchange partition Move partition Merge partition Split partition Truncate partition Rename partition
Create Partitioned Table
CREATE TABLE dna_sequence (base_id NUMBER(6), base_sequence CLOB) LOB (base_sequence) STORE AS dna_seq_lob2 TABLESPACE examplePARTITION BY RANGE (BASE_ID) (partition dna_sequence1 values less than (100)
tablespace dna_sequence_p1, partition dna_sequence2 values less than (200)
tablespace dna_sequence_p2, partition dna_sequence3 values less than (300)
tablespace dna_sequence_p3);
Query the Partitioned Table
SELECT table_name, partition_name, tablespace_name, high_value
FROM user_tab_partitions
ORDER BY partition_name;
TABLE_NAME PARTITION_NAME TABLESPACE_NAME HIGH_VALUE
---------------- -------------------- -------------------- ----------
DNA_SEQUENCE DNA_SEQUENCE1 DNA_SEQUENCE_P1 100
DNA_SEQUENCE DNA_SEQUENCE2 DNA_SEQUENCE_P2 200
DNA_SEQUENCE DNA_SEQUENCE3 DNA_SEQUENCE_P3 300
Add Partition
ALTER TABLE dna_sequence
ADD PARTITION dna_sequence4 VALUES LESS THAN (400)
TABLESPACE dna_sequence_p1;
TABLE_NAME PARTITION_NAME TABLESPACE_NAME HIGH_VALUE
--------------- ----------------- -------------------- ----------
DNA_SEQUENCE DNA_SEQUENCE1 DNA_SEQUENCE_P1 100
DNA_SEQUENCE DNA_SEQUENCE2 DNA_SEQUENCE_P2 200
DNA_SEQUENCE DNA_SEQUENCE3 DNA_SEQUENCE_P3 300
DNA_SEQUENCE DNA_SEQUENCE4 DNA_SEQUENCE_P1 400
Drop Partition
ALTER TABLE dna_sequence DROP PARTITION dna_sequence4;
Run partition.sql;
TABLE_NAME PARTITION_NAME TABLESPACE_NAME HIGH_VALUE
---------------- ------------------- -------------------- ---------
DNA_SEQUENCE DNA_SEQUENCE1 DNA_SEQUENCE_P1 100
DNA_SEQUENCE DNA_SEQUENCE2 DNA_SEQUENCE_P2 200
DNA_SEQUENCE DNA_SEQUENCE3 DNA_SEQUENCE_P3 300
Exchange Partition
CREATE TABLE dna_sep03
AS SELECT *
FROM dna_sequence
WHERE 1=2;
ALTER TABLE dna_sequence
EXCHANGE PARTITION dna_sequence3 WITH TABLE dna_sep03;
Move Partition
ALTER TABLE dna_sequence
MOVE PARTITION dna_sequence4 TABLESPACE dna_sequence_p2 NOLOGGING;
Split Partition
ALTER TABLE dna_sequence SPLIT PARTITION dna_sequence4 AT (350) INTO ( PARTITION dna_sequence4 TABLESPACE dna_sequence_p1, PARTITION dna_sequence5 TABLESPACE dna_sequence_p2) PARALLEL ( DEGREE 5 );
TABLE_NAME PARTITION_NAME TABLESPACE_NAME HIGH_VALUE----------------- -------------------- -------------------- ----------DNA_SEQUENCE DNA_SEQUENCE1 DNA_SEQUENCE_P1 100DNA_SEQUENCE DNA_SEQUENCE2 DNA_SEQUENCE_P2 200DNA_SEQUENCE DNA_SEQUENCE3 DNA_SEQUENCE_P3 300DNA_SEQUENCE DNA_SEQUENCE4 DNA_SEQUENCE_P1 350DNA_SEQUENCE DNA_SEQUENCE5 DNA_SEQUENCE_P2 400
Truncate Partition
ALTER TABLE dna_sequence
TRUNCATE PARTITION dna_sequence4 DROP STORAGE;
Rename Partition/Table
Rename partition– ALTER TABLE dna_sequence
RENAME PARTITION dna_sequence4 TO dna_sequence5;
Rename table– ALTER TABLE dna_sequence
RENAME TO dna_seq;– RENAME dna_seq TO dna_sequence;
Conclusion
By proper use of the Oracle features such as CLOB, and partitioning table, it becomes a lot easier to manage the database containing large amounts of biological data.
Major Benefits using CLOB and Partitioning at NISC
Space Savings: Proper use of CLOB Better performance: Put big tables into
smaller segments Better Maintenance: Easier backup and
recovery; Less down time
AQ&Q U E S T I O N SQ U E S T I O N S
A N S W E R SA N S W E R S
Reminder – please complete the OracleWorld online session survey
Thank you.
Xiaobin Guan, Ph.D.NISC/[email protected]