17
Web-Based Data Collection and Analysis Andy Leone June 2014

Web-Based Data Collection and Analysis Andy Leone June 2014

Embed Size (px)

Citation preview

Page 1: Web-Based Data Collection and Analysis Andy Leone June 2014

Web-Based Data Collection and Analysis

Andy Leone

June 2014

Page 2: Web-Based Data Collection and Analysis Andy Leone June 2014

Course Overview Tools for extracting and analyzing text

• SEC Filings• Websites• PDF Files

Software• Perl – ActivePerl• Perl “Front End” Komodo• MySQL

Page 3: Web-Based Data Collection and Analysis Andy Leone June 2014

Why Perl? There Are Alternatives to Perl (e.g.,

Python, Ruby, R) Perl – Practical Extraction and Reporting

Language It is a mature language with considerable

support.• Countless packages (add-ons)• About the best there is when it comes to “Regular

Expressions.”

Page 4: Web-Based Data Collection and Analysis Andy Leone June 2014

The big picture- What I do

EdgarSQL

PerlSAS

Excel

Pass-through

ODBC ODSProc Export

WRDS

SAS Connect

Page 5: Web-Based Data Collection and Analysis Andy Leone June 2014

SEC FTP Files can be obtained from the SEC’s FTP site:

ftp.sec.gov

Note that the files are not made available on the FTP site until 24 hours after they are filed. It is possible to get real-time feeds.

You are supposed to do large downloads after 9:00PM Eastern.

Page 6: Web-Based Data Collection and Analysis Andy Leone June 2014

Index Files

Index files are archived each quarter and the most recent one us updated every day: /edgar/full-index/2006/QTR1/company.zip The zip file contains a file “company.idx”

Page 7: Web-Based Data Collection and Analysis Andy Leone June 2014

Index File, continued

Description: Master Index of EDGAR Dissemination Feed by Company NameLast Data Received: September 6, 2006Comments: [email protected] FTP: ftp://ftp.sec.gov/edgar/ Company Name Form Type CIK Date Filed File Name---------------------------------------------------------------------------------------------------------------------------------------------033 ASSET MANAGEMENT LLC / 13F-HR 1114831 2006-08-11 edgar/data/1114831/0001110550-06-000042.txt 1 800 CONTACTS INC 10-Q 1050122 2006-08-10 edgar/data/1050122/0001104659-06-053544.txt

-First 10 lines - description then data--Company Name-Form Type-CIK Number-Filing Date-Directory and File Name

Page 8: Web-Based Data Collection and Analysis Andy Leone June 2014

Data Files SGML format All files contain basic header information

tagged by <SEC-HEADER> This is followed by each file submitted

with the filing. Example-

AMREP Corp. Submitted a 10-K/A (Text Version)

Page 9: Web-Based Data Collection and Analysis Andy Leone June 2014

Challenge

Lack of uniformity Another Example:

Smart-tek Solutions Inc. (Text)

Page 10: Web-Based Data Collection and Analysis Andy Leone June 2014

Perl

Practical Extracting and Reporting Language A powerful scripting language for working

with unformatted text.

Page 11: Web-Based Data Collection and Analysis Andy Leone June 2014

Perl The key feature of Perl is Regular Expressions.

It is relatively easy to search through text files and match words or word patterns.

For example, to identify firm with an internal control deficiency when wording differs.

• “We have identified a material weakness”• “A material weakness exists”• “A control deficiency exists”• “We identified a control deficiency”• “No material weakness exists” (Don’t want to match this

one)

Page 12: Web-Based Data Collection and Analysis Andy Leone June 2014

Expectations Using PERL can be very helpful.

More refined measures (e.g., modified audit opinions).

Large samples relative to hand collection. But..

Extracting data from free-form text can be messy. Especially when formatting varies so much from file to file. A lot of trial and error.

It would take an infinite amount of time to get success rates up to 99-100%.

Need to accept 90-95% and live with some “noise”

Page 13: Web-Based Data Collection and Analysis Andy Leone June 2014

Example: Audit Opinions How do you find the opinion?

Look for something like “We have audited.” But PWC has to start with “In our opinion.”

How do you find the end of the opinion? Look for a date (e.g., March 15, 2005) a the start of a line But some dates in the opinion happen to fall at start of line

(e.g., for year ended December 31, 2005) Then require the date to be followed by spaces. But some opinions are dated with “except for.” Foreign auditors might date the report in day- month-year

order (15 March, 2005). Some firms have multiple opinions. Sometimes there are typos.

Page 14: Web-Based Data Collection and Analysis Andy Leone June 2014

When is PERL most useful? Extracting specific text and looking for

keywords and phrases Audit Opinions Internal Controls Footnotes

Not great for unstructured financial information. For example, use of proceeds in IPO prospectus.

Sometimes you still need to code the data by hand.

Page 15: Web-Based Data Collection and Analysis Andy Leone June 2014

What I do Download all index files and write them to a

MYSQL database (via PERL). Download filings I work with fairly often (e.g.,

10-K, 10-Q, 8-K) Create header header tables in MYSQL for

each filing type. Create project-specific tables as needed. Note: MYSQL allows for easy access from SAS,

STATA, excel, etc. via ODBC.

Page 16: Web-Based Data Collection and Analysis Andy Leone June 2014

MYSQL, PERL and SAS

DBI Module This makes it really easy to work with

data via SQL or statistics software (SAS, STATA).

SAS Example:proc sql;connect to odbc (datasrc=Compustat user=me password=xxxx);create table myrestatesas select * from connection to odbc (select gvkey, file, sic, hlink, company_name,to_CFO, TO_CFO_Date,cik, irreg,to_CEO, to_CEO_Date

from 8kdata.restate_sample_jul2006 a);quit;

Page 17: Web-Based Data Collection and Analysis Andy Leone June 2014

Software Installation Checklist ActivePerl 5.10

Let’s go to the Package Manager and install a couple of modules:

• DBD::mysqlPP• HTML::Format

Komodo Edit 4.4 MySQL 5.0 MySQL GUI Tools SAS 9.1.3 (or higher).