75
Getting data The first step in data analysis

The first step in data analysis. Learning Objective Using SAS/BASE® to connect to third-party relational data base software to extract data needed for

Embed Size (px)

Citation preview

  • Slide 1
  • The first step in data analysis
  • Slide 2
  • Learning Objective Using SAS/BASE to connect to third-party relational data base software to extract data needed for program evaluation research using administrative data operational reports e.g. routine surveillance 1SHRUG, 2014-05-02 1.What is a relational database? 2.Contact your DBA for how to connect to your database(s)? 3.How to write queries using PROC SQL
  • Slide 3
  • What is a relational database? Set of tables tables made up of rows and columns Trade names of relational databases (RDB): Oracle, Teradata, SQL Server, DB2, Access RDB is software which is designed to retain large amounts of data transactional DB reporting/warehousing DB 2SHRUG, 2014-05-02
  • Slide 4
  • What is a relational database? Transactional DB designed to increase the speed for front- end users complex table and table join structures Warehousing DB designed for efficient storage and retrieval for reporting simpler table designs and table join structures Queries for either design use same syntax (code) queries for warehouses will be simpler to write 3SHRUG, 2014-05-02
  • Slide 5
  • What is a relational database? Why use relational databases? relational databases use a concept called normalization Normalization reduces the amount of redundant data and allows for updates to data with less error There are degrees of normalization first degree second degree third degree and higher degrees 4SHRUG, 2014-05-02
  • Slide 6
  • First degree normalization each row pertains to a single entity: a patient, an encounter, a physician each column pertains to a characteristic of the entity: e.g. date of birth, sex, date of encounter, etc IDFirstNameGenderBirthCityBirthCountry 0001JohnMMonctonCanada 0002DevbaniFKolkataIndia Table 1: Subjects with demographic information 5SHRUG, 2014-05-02
  • Slide 7
  • Violation of first degree normalization SubjIDFirstNameGenderBirthCityBirthCountry 0001John43MonctonNew Brunswick 0002RahaFWest BengalIndia What impact does violating the first degree normalization have on your query if you want all patients born in Canada? if you want all male patients? Table 1: Subjects with improper 1NF 6SHRUG, 2014-05-02
  • Slide 8
  • Second degree normalization Table 2 has employer information about rows in Table 1 The table above has some redundant information: name is repeated from Table 1, province is embedded in the postal code Better design two or even 3 tables NameCityProvPostalCode JohnHalifaxNSB3K 6R8 DevbaniHalifaxNSB3H 2Y9 Table 2: Business addresses 7SHRUG, 2014-05-02
  • Slide 9
  • Second degree normalization SubjIDPostalCode 0001B3K 6R8 0002B3H 2Y9 PostalCodeCityProv B3K 6R8HalifaxNS B3H 2Y9HalifaxNS Table 2: Revised with 2NF Table 3: Creating a secondary table for 2NF 8SHRUG, 2014-05-02
  • Slide 10
  • Second degree normalization Table 2 now no longer contains name its replaced with the subject ID to get the subjects name we link the table to the table in the first example, using SUBJID/ID column we get the province and city by linking Table 2 and 3 using the POSTALCODE column SUBJID is a primary key in Tables 1 and 2 POSTALCODE is a foreign key in Table 2, but a primary key in Table 3 9SHRUG, 2014-05-02
  • Slide 11
  • Primary/Foreign Keys primary key a column or combination of columns that uniquely identify each row in the table e.g. patient medical record needs at least 3 columns to identify a unique record: patient ID, date of encounter, and provider ID foreign key a column or combination of columns that is used to link data between two tables 10SHRUG, 2014-05-02
  • Slide 12
  • Questions about 2NF? Can you see the advantage of splitting the data into different tables? share examples of your data where normalization is used higher degrees of normalization work similarly to the examples above you have to go through more tables for higher levels of normalization in order to link to the data that you need 11SHRUG, 2014-05-02
  • Slide 13
  • Getting access to data: What do you need from DBA? Explain to DBA that you need to query data, but have no need to write to the database this helps them to determine where you belong on a user matrix DBA or IT install necessary software on your machine Google has lots of information on SAS Connect SAS Connect documentation 12SHRUG, 2014-05-02
  • Slide 14
  • How SAS authenticates User name is provided by DBA/IT In this example the password is held in the macro DBPASS Statement to have Oracle print any messages to the SAS log proc sql; connect to oracle (user = password="&dbpass path = prod ); %put &sqlxmsg; This is an example of pass-through code 13SHRUG, 2014-05-02
  • Slide 15
  • Using a LIBNAME to connect Recall that slide 13 showed pass-through facility in SAS most of the query is done on the database Can use libname statement to connect instead of pass- through advantage to this method is that you are programming in SAS (using SAS functions and formats) SAS determines which program (SAS or RDB) will handle statements more efficiently 14SHRUG, 2014-05-02
  • Slide 16
  • Using a LIBNAME to connect Example using a libname statement: libname onco odbc dsn='Oncolog' schema=dbo; 1.2.3. 1. The name of the library 2. Tells SAS that you are using an ODBC engine 3. DSN use the name of the database that was used to set up the odbc connection NOTE: schema statement is not always required 15SHRUG, 2014-05-02
  • Slide 17
  • Seeing your data - Views Once view is created, you use the EXPLORER tab in SAS and use as normal dataset 16SHRUG, 2014-05-02
  • Slide 18
  • Seeing your data - Views Using the view columns in SAS EXPLORER 17SHRUG, 2014-05-02
  • Slide 19
  • Seeing your data - Views Double click on table to get to see the data NOTE: columns that identify personal information have been removed from this screen shot 18SHRUG, 2014-05-02
  • Slide 20
  • Other ways to view data You may have software from the RDB: TOAD (for Oracle) SQL Developer (for Oracle) SQL Server Teradata All vendors may have some limited function development software that allows: Viewing data Viewing the type of a column: char, num, date, etc. Writing SQL queries 19SHRUG, 2014-05-02
  • Slide 21
  • Sample view from SQL Developer 20SHRUG, 2014-05-02
  • Slide 22
  • Syntax: Single table - 1 of 2 PROC SQL DATA STEP proc sql; create as select,, etc from where quit; data ; set ( keep= where=( )); run; Example: Create a dataset (table) with men aged 50 to 74. Assume the source table is called demographics and contains variables: subjectID, age and sex 21SHRUG, 2014-05-02
  • Slide 23
  • Syntax: Single table 2 of 2 PROC SQL DATA STEP proc sql; create table men5074 as select subjectID, age from work.demographics where sex=M and age between 50 and 74 ; quit; data men5074 (drop=sex); set work.demographics (keep=subjectid sex age where=(sex='M' and 50
  • Parsing the code - 2 of 3 (select ptc.gender, count(*) from (select participant_id, sex_cd, case when sex_cd=222 then 'F' else 'M' end as gender from csprod.participant where trunc(birth_dt) between to_date('19520601','Y YYYMMDD') and to_date('19530531','Y YYYMMDD) and sex_cd 240 and del_dt is null) ptc Put these columns in the SAS dataset part60 Create a temporary table called ptc Table PTC contains columns as listed from the PARTICIPANT table, with the restrictions shown in the WHERE clause 51SHRUG, 2014-05-02
  • Slide 53
  • Parsing the code 3 of 3 inner join (select participant_id from csprod.participant_program where program_id=1 and program_status_cd=263 and del_dt is null)pp on ptc.participant_id= pp.participant_id group by ptc.gender ; disconnect from myconn; quit; Create temporary table, PP from PARTICIPANT_PROGRAM with restrictions defined in the WHERE clause 52SHRUG, 2014-05-02
  • Slide 54
  • Joins for joining two or more tables This example shows an inner join: want participants, and the # males and females participating in CRC screening program age 60 as of May 31, 2013 PTC PP C Area C is the result of the inner join Temporary table PTC: a subset of csprod.participant Temporary table PP: a subset of csprod.part_program 53SHRUG, 2014-05-02
  • Slide 55
  • Task 2 - Results What will be the query result? Whats the table/dataset name? How many rows? How many columns? What are the columns called? 54SHRUG, 2014-05-02
  • Slide 56
  • Task 2 - Results 55 SHRUG, 2014-05-02
  • Slide 57
  • Task 3 Patients with kidney cancer 56SHRUG, 2014-05-02 REQUEST Find number of patients with invasive kidney cancer (ICD-O- 3=C64.9) diagnosed between 2008 and 2010. Breakdown counts by age and sex. Interested in age < 60 and age 60 BACKGROUND remove any patients who were deleted remove any tumors that were deleted diagnoses are in table called oldiagnostic sex is in table called olpatient birth date in table called person
  • Slide 58
  • Task 3 - Map onco.oldiagnostic personser deleted diagnosticser deleted dxstate=NS substr(icdohistocode,6,1)='3' year(dateinitial..) in (2008, 2009, 2010) 57 SHRUG, 2014-05-02 onco.olpatient personser olsex onco.person personser persontype=patient datepart(dateofbirth)
  • Slide 59
  • Task 3 Code (1 of 5) proc sql feedback; create table onco_coh as select a.*, b.olsex, f.birth_dt, floor(yrdif(f.birth_dt,a.initdx_dt,'act/act')) as ageatdx from 58 SHRUG, 2014-05-02
  • Slide 60
  • Task 3 Code (2 of 5) /*** get cases ***/ (select o.personser, o.diagnosticser, datepart(o.DateInitialDiagnosis) as initdx_dt format=date9., o.icdositecode from onco.oldiagnostic o where o.icdositecode in ('C64.9') /*** only invasive cancers ***/ and substr(o.icdohistocode,6,1)='3' and year(o.dateinitialdiagnosis) between 2008 and 2010 and o.dxstate='NS 59 SHRUG, 2014-05-02
  • Slide 61
  • Task 3 Code (3 of 5) /*** patient not deleted ***/ and o.personser not in (SELECT ps1.Personser FROM onco.OlPatientSup ps1 WHERE ps1.PersonSer = o.PersonSer and ps1.identifier = 'CCRPatientReportingStatu' AND ps1.String IN ('04','05') and ps1.FieldSeq = 0) 60 SHRUG, 2014-05-02
  • Slide 62
  • Task 3 Code (4 of 5) /*** diagnosis not deleted ***/ and o.diagnosticser not in (SELECT ds1.diagnosticser FROM onco.OLdiagnosticsup ds1 WHERE ds1.PersonSer = o.PersonSer and o.diagnosticser = ds1.diagnosticser and ds1.identifier = 'CCRPrimaryReportingStatu' AND ds1.String IN ('04','05') and ds1.FieldSeq = 0)) a 61 SHRUG, 2014-05-02
  • Slide 63
  • Task 3 Code (5 of 5) /*** get patient's sex ***/ left join (select personser, olsex from onco.olpatient) b on a.personser=b.personser /*** get birth date ***/ left join (select personser, datepart(DateOfBirth) as birth_dt format=date9. from onco.person where lowcase(persontype)='patient') f on a.personser=f.personser ; quit; 62 SHRUG, 2014-05-02
  • Slide 64
  • Task 3 - Results 63 SHRUG, 2014-05-02 Sex Age at diagnosis Under 6060 and olderTotal M 128247375 F 76147223 Total 204394598
  • Slide 65
  • Self-join Correlated sub-query Outer from and where UNION SHRUG, 2014-05-0264
  • Slide 66
  • What is the sound of one table joining? 77 /* select candidates for babes becoming mothers */ 78 proc sql 79 ; 80 create table Candidates as 81 select B1.BrthDate 82, B1.BirthID 83, B2.DLMBDate 84, B2.ContctID 85 from SASDM.DelnBrth as B1 /* babes */ 86, SASDM.DelnBrth as B2 /* mums */ 87 where B1.BrthDate = B2.DLMBDate; NOTE: Table WORK.CANDIDATES created, with 855040 rows and 4 columns. 65 SHRUG, 2014-05-02
  • Slide 67
  • OB/Research has data in: Clinical ultrasound db Maternal serum screening db Objective: find all mothers with abnormal screening and see if the ultrasound indicated risk for restricted growth (small baby) Correlated Sub-query 66SHRUG, 2014-05-02
  • Slide 68
  • create table Work.WithAtlee as /* VP data only available after 2003 not 2000 */ select One18.* /* 18-wk US */, M.MO365/*perinatal data*/, M.Wgt4Age /* 45 lines omitted here */ Correlated Sub-Query 67SHRUG, 2014-05-02
  • Slide 69
  • , M.DLPrvNND, M.DLPrvFTD /* no such variable as IUGR in */, M.MotherID in /* previous pregnancy - back link */ ( select Prev.MotherID from SASDM.DelnBrth as Prev where Prev.MotherID = M.MotherID and Prev.Wgt4Age in ( 1, 2)/* pick