Upload
cordelia-daniels
View
214
Download
1
Embed Size (px)
Citation preview
Web-Based Data Collection and Analysis
Andy Leone
June 2014
Course Overview Tools for extracting and analyzing text
• SEC Filings• Websites• PDF Files
Software• Perl – ActivePerl• Perl “Front End” Komodo• MySQL
Why Perl? There Are Alternatives to Perl (e.g.,
Python, Ruby, R) Perl – Practical Extraction and Reporting
Language It is a mature language with considerable
support.• Countless packages (add-ons)• About the best there is when it comes to “Regular
Expressions.”
The big picture- What I do
EdgarSQL
PerlSAS
Excel
Pass-through
ODBC ODSProc Export
WRDS
SAS Connect
SEC FTP Files can be obtained from the SEC’s FTP site:
ftp.sec.gov
Note that the files are not made available on the FTP site until 24 hours after they are filed. It is possible to get real-time feeds.
You are supposed to do large downloads after 9:00PM Eastern.
Index Files
Index files are archived each quarter and the most recent one us updated every day: /edgar/full-index/2006/QTR1/company.zip The zip file contains a file “company.idx”
Index File, continued
Description: Master Index of EDGAR Dissemination Feed by Company NameLast Data Received: September 6, 2006Comments: [email protected] FTP: ftp://ftp.sec.gov/edgar/ Company Name Form Type CIK Date Filed File Name---------------------------------------------------------------------------------------------------------------------------------------------033 ASSET MANAGEMENT LLC / 13F-HR 1114831 2006-08-11 edgar/data/1114831/0001110550-06-000042.txt 1 800 CONTACTS INC 10-Q 1050122 2006-08-10 edgar/data/1050122/0001104659-06-053544.txt
-First 10 lines - description then data--Company Name-Form Type-CIK Number-Filing Date-Directory and File Name
Data Files SGML format All files contain basic header information
tagged by <SEC-HEADER> This is followed by each file submitted
with the filing. Example-
AMREP Corp. Submitted a 10-K/A (Text Version)
Challenge
Lack of uniformity Another Example:
Smart-tek Solutions Inc. (Text)
Perl
Practical Extracting and Reporting Language A powerful scripting language for working
with unformatted text.
Perl The key feature of Perl is Regular Expressions.
It is relatively easy to search through text files and match words or word patterns.
For example, to identify firm with an internal control deficiency when wording differs.
• “We have identified a material weakness”• “A material weakness exists”• “A control deficiency exists”• “We identified a control deficiency”• “No material weakness exists” (Don’t want to match this
one)
Expectations Using PERL can be very helpful.
More refined measures (e.g., modified audit opinions).
Large samples relative to hand collection. But..
Extracting data from free-form text can be messy. Especially when formatting varies so much from file to file. A lot of trial and error.
It would take an infinite amount of time to get success rates up to 99-100%.
Need to accept 90-95% and live with some “noise”
Example: Audit Opinions How do you find the opinion?
Look for something like “We have audited.” But PWC has to start with “In our opinion.”
How do you find the end of the opinion? Look for a date (e.g., March 15, 2005) a the start of a line But some dates in the opinion happen to fall at start of line
(e.g., for year ended December 31, 2005) Then require the date to be followed by spaces. But some opinions are dated with “except for.” Foreign auditors might date the report in day- month-year
order (15 March, 2005). Some firms have multiple opinions. Sometimes there are typos.
When is PERL most useful? Extracting specific text and looking for
keywords and phrases Audit Opinions Internal Controls Footnotes
Not great for unstructured financial information. For example, use of proceeds in IPO prospectus.
Sometimes you still need to code the data by hand.
What I do Download all index files and write them to a
MYSQL database (via PERL). Download filings I work with fairly often (e.g.,
10-K, 10-Q, 8-K) Create header header tables in MYSQL for
each filing type. Create project-specific tables as needed. Note: MYSQL allows for easy access from SAS,
STATA, excel, etc. via ODBC.
MYSQL, PERL and SAS
DBI Module This makes it really easy to work with
data via SQL or statistics software (SAS, STATA).
SAS Example:proc sql;connect to odbc (datasrc=Compustat user=me password=xxxx);create table myrestatesas select * from connection to odbc (select gvkey, file, sic, hlink, company_name,to_CFO, TO_CFO_Date,cik, irreg,to_CEO, to_CEO_Date
from 8kdata.restate_sample_jul2006 a);quit;
Software Installation Checklist ActivePerl 5.10
Let’s go to the Package Manager and install a couple of modules:
• DBD::mysqlPP• HTML::Format
Komodo Edit 4.4 MySQL 5.0 MySQL GUI Tools SAS 9.1.3 (or higher).