21
Slide 1 Letters From the Front Lori Emadi Head, Taxonomy & Metadata June 14, 2015

2015 SLA Conference - Open Data - Emadi

Embed Size (px)

Citation preview

Page 1: 2015 SLA Conference - Open Data - Emadi

Slide 1

Letters From the Front

Lori Emadi Head, Taxonomy & Metadata

June 14, 2015

Page 2: 2015 SLA Conference - Open Data - Emadi

Slide 2

• Background• About the Army Data Collection• The Business Case• Building Hermes• Accessibility through Metadata• Moving forward…

Page 3: 2015 SLA Conference - Open Data - Emadi

Slide 3

Background• RAND acquired hundreds of thousands of classified and unclassified

documents from U.S. Army units returning from Iraq and Afghanistan

• Files were copied as is from hundreds of hard drives – documents are varied in content, naming, format, and structure (or lack of)

• Available tools were not suitable for working with such a large and diverse corpus of documents

• Lacked a good way to search, make use of these data

• “Letters From The Front” project was approved as an FY 2013 R&D effort

• My project team developed a document indexing, search, and visualization capability called Hermes from an open source tool set that is scalable and extensible

Page 4: 2015 SLA Conference - Open Data - Emadi

Slide 4

• Background• About the Army Data Collection• The Business Case• Building Hermes• Accessibility through Metadata• Moving forward…

Page 5: 2015 SLA Conference - Open Data - Emadi

Slide 5

What We Learned About This Very Large Data Collection

Units and Agencies (of 195, 29 submitted data)

Assistant Secretary of the Army, Acquisition, Logistics, Logistics, and Technology - Center for Army Lessons

Learned - Center of Military History - Stryker Center for Lessons Learned - United

States Army Combined Arms Support Command and Fort

Lee - US Army Corps of Engineers - US Army G-8 - US

Army Communications Life Cycle Management Command

- 101st Air Assault - 10th Mountain Division (Light) and

Fort Drum - 10th Special Forces Group - 16th Engineer Brigade - 1st Cavalry Division - United States Army Europe

and Seventh Army - 1st

Infantry Division - 3d Infantry Division (Mechanized) and Fort Stewart - 3rd Armored Cavalry Regiment - 42nd

Infantry Division - I CORPS - III ARMY - III CORPS and FT Hood - United States Army Special Forces Command (Airborne) - 75th Exploitation Task Force -

Multi-National Corps-Iraq - Multi-National Force-Iraq -

Multi-National Security Transition

Command-Iraq/Commander, NATO Training Mission-Iraq -

Office of Security Cooperation-Afghanistan -

United States Army Military Police School - United States

Military Academy

Diverse content and file types

SITREPS - FRAGOS - SIGACTS - INTSUMS - SPOT reports - AAR - WARNO - BDA - BDR - CONPLAN - Order of Battle

- OPLAN - OPORD - Deployment orders - Daily Personnel Status - Military

vehicle status - Alert Roster - Service Support Order -  Mission Analysis Briefs - Decision Briefs - Mission

Concept Briefs - Backbriefs – Balcony Briefs - Debriefs - …

23 Army commands, units, and Army support activities

4 DOD, Joint, and Other Government activities

2 Military Academies

Emails & Email collections (~63,000 files), PowerPoint (~171,000), PDF (~64,000), Excel (~84,000), Word and other text (~400,000), images and video (~300,000), …

Page 6: 2015 SLA Conference - Open Data - Emadi

Slide 6

Some interestingly named data folders …

SIRs & IRs\1ID\BEFORE WE GOT ORGANIZED\

1ID_G3\G3 Operations\CHOPS\G3 OPS – Battle Captains\STUFF THAT BOB JUST SAVED\

1ID_G3\G3 Operations\FRAGOSs G3 OPS\RFI Section\SSG xxxxxxxx\DA BUCKSTER FOLDER\MILITARY RELATED CRAP\WORK RELATED CRAP\

1stCAV\SJA\EXSUMS\Dan’s Super Duper FRAGO Folder\

1stCAV\G3\EOD LNO\Im Thuper Thanks Fer Athkin\

1stCAV\SJA\EXSUMS\DEAR GOD, I HOPE WE DON’T NEED THESE MONTHS\APRIL 2005\...

MNSTCI\NIPR\MNCI FOLDERS\NIPS\C2\C2_SECURITY\ I.Think.This.Is.The.Template.That.You.Need.To.Use.For.Submitting.Anything.On.Me.But.I.Could.Be.Wrong.About.That.So.I.Will.Ask.About.It.Tomorrow\

Page 7: 2015 SLA Conference - Open Data - Emadi

Slide 7

• Background• About the Army Data Collection• The Business Case• Building Hermes• Accessibility through Metadata• Moving forward…

Page 8: 2015 SLA Conference - Open Data - Emadi

Slide 8

Search Scenarios• “What survey data exists on local (Iraqi, Afghan) population attitudes,

beliefs, and information consumption patterns?”• “[for x operation] We need to find out which brigades were deployed,

when they were deployed, and who their brigade commanders were.”• "There are a number of interesting distinctions that may be tractable,

e.g., variations in communications of different entities and whether/how they coordinate…”

• “[for intelligence gathering] It takes multiple data points to create a profile. The data should be stored separately and combined to make the profile. Then you can run it across everything and look for relationships with any of the data points.“

Page 9: 2015 SLA Conference - Open Data - Emadi

Slide 9

Reconstructing Lost Events• Not a scenario: Missing records on the 81st BCT

(Washington State ARNG) and 82nd AB Division and their operations in Afghanistan and Iraq. (Seattle Times article, July 13, 2013)

• These units did not contribute their hard drives to RAND but Hermes found a few thousand documents retained by other units who interacted with the 81st BCT or 82nd Airborne

Page 10: 2015 SLA Conference - Open Data - Emadi

Slide 10

• Background• About the Army Data Collection• The Business Case• Building Hermes• Accessibility through Metadata• Moving forward…

Page 11: 2015 SLA Conference - Open Data - Emadi

Slide 11

There Are Many Available Tools For Dealing With Masses of Textual Data…

Page 12: 2015 SLA Conference - Open Data - Emadi

Slide 12

The Solr Suite Provides Needed Speed, Scalability and Extensibility, and Is Open

Source…• “SolrTM is the popular, blazing fast open

source enterprise search platform from the Apache LuceneTM project. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest internet sites.”

• (“Apache Solr, at http://lucene.apache.org/Solr/)

…And Solr Is An Emerging Standard for Searching Large Text Databases…

Page 13: 2015 SLA Conference - Open Data - Emadi

Slide 13

Application Framework

Search Human Interface

Modular and flexible by design, this architecture can be customized for other RAND efforts

Page 14: 2015 SLA Conference - Open Data - Emadi

Slide 14

• Background• About the Army Data Collection• The Business Case• Building Hermes• Accessibility through Metadata• Moving forward…

Page 15: 2015 SLA Conference - Open Data - Emadi

Slide 15

…Example search results in a similar out-of-box setting…

15

Columbia University Library Catalog (CLIO)

Page 16: 2015 SLA Conference - Open Data - Emadi

Slide 16

Metadata fields - --customizable --implemented during processing

Page 17: 2015 SLA Conference - Open Data - Emadi

Slide 17

Page 18: 2015 SLA Conference - Open Data - Emadi

Slide 18

Query Parser Syntax Fields field name followed by a colon ":" then term. title:"The Right Way" AND text:goWildcard Searches single character wildcard "?" ; multiple character wildcard "*"Regular Expression Searches /[mb]oat/Fuzzy Searches use the tilde, "~" : roam~ roam~1 Proximity Searches use the tilde, "~" : "jakarta apache"~10Range Searches Use range queries with date and non-date fields: mod_date:[20020101 TO 20030101] title:{Aida TO Carmen}Inclusive range queries are denoted by square brackets. Exclusive range queries are denoted by curly brackets.Boosting a Term use the caret, "^", with a boost factor at the end of the term: jakarta^4 apache "jakarta apache"^4 "Apache Lucene"Boolean Operators AND, "+", OR, NOT and "-" "jakarta apache" OR Jakarta "jakarta apache" AND "Apache Lucene" +jakarta lucene"jakarta apache" NOT "Apache Lucene" NOT "jakarta apache“ "jakarta apache" -"Apache Lucene"Grouping use parentheses to group clauses to form sub queries. (jakarta OR apache) AND websiteField Grouping use parentheses to group multiple clauses: title:(+return +"pink panther")Escaping Special Characters To escape use the \ before the character. Ex: to search for (1+1):2 use the query: \(1\+1\)\:2

Page 19: 2015 SLA Conference - Open Data - Emadi

Slide 19

• Background• About the Army Data Collection• The Business Case• Building Hermes• Accessibility through Metadata• Moving forward…

Page 20: 2015 SLA Conference - Open Data - Emadi

Slide 20

Next Steps• Security around collection, User authentication • Natural Language Processing (NLP)• Extracting content attachments in emails (while keeping the

attachments in place)• Additional visualization options • Enhanced logging and tracking • Ability for users to rank content, add or edit metadata content• Enhance user interface

Page 21: 2015 SLA Conference - Open Data - Emadi

Slide 21