Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts

Preview:

DESCRIPTION

Invited talk at TRUST Women’s Institute for Summer Enrichment (WISE), Cornell, NY Jun 16, 2014. Infrastructure support for text mining research of big data repository like HathiTrust raises challenges in access and security when the bulk of the repository is protected by copyright.

Citation preview

Case  Study  in  Big  Data  :  the  Socio-­‐Technical  Issues  of  HathiTrust  Digital  Texts  

Women’s  Ins*tute  for  Summer  Enrichment  Cornell  University,  Jun  16,  2014  

 Beth  Plale  

Professor,  School  of  Informa?cs  and  Compu?ng  Director,  Data  To  Insight  Center    

Indiana  University  

HATHI TRUST RESEARCH CENTER!

•  Who  are  the  Players?  HathiTrust,  Google,  Authors  Guild  

•  The  Object  of  AJen?on  :  11  M  books  from  university  libraries  

•  Rulings  around  copyright  •  HTRC,  or  why  I  care  •  Is  security  of  HTRC  Data  Capsule  good  enough?  

The  Players  

Books  Digi*za*on  Project  (2007)  

Libraries  of  U  Michigan,  U  California,  Virginia,  Wisconsin,  Indiana,  …  

digi*zed  books  

digi*zed  books  

digi*ze  

digi*zed  books  

digi*zed  books  

Legal  ac*on  

Mar  2011:    New  York  federal  judge  rejected  a  $125  million  legal  se\lement  that  Google  had  worked  out  with  the  authors  and  publishers  over  the  copyright  issues  Nov  2013:  same  Judge  issued  ruling  saying  that  Google's  use  of  the  works  was  a  "fair  use"  under  copyright  law  

Google/Authors  Guild  

•  June  2014:    2nd  Circuit  Court  of  Appeals  ruling  on  Authors  Guild  versus  HathiTrust  (Cornell,  U  Michigan,  U  California,  U  Wisconsin,  Indiana)  is  a  major  victory  for  fair  use  

digi*zed  books  

Legal  ac*on  

Highlights  2014  ruling  

•  With  respect  to  the  full-­‐text  database,  the  court  found  that  although  a  copy  of  the  en*re  work  is  made,  the  purpose  of  a  full-­‐text  searchable  database  is  so  different  from  that  of  the  underlying  works  that  the  use  must  be  considered  transforma*ve.  In  fact,  the  court  wrote,  "the  crea*on  of  a  full-­‐text  searchable  database  is  a  quintessen*ally  transforma*ve  use".    

June  10,  2014  |  By  Parker  Higgins    Another  Fair  Use  Victory  for  Book  Scanning  in  HathiTrust    

•  The  Authors  Guild  argued  that  HathiTrust's  use  of  an  iden*cal  server  and  two  tape  back-­‐ups  cons*tuted  "excessive"  copying.    

•  Thankfully,  the  court  rejected  that  premise,  acknowledging  that  when  it  comes  to  digital  technology,  an  approach  that  focuses  only  on  individual  copies  made  is  insufficient.  

June  10,  2014  |  By  Parker  Higgins    Another  Fair  Use  Victory  for  Book  Scanning  in  HathiTrust    

Highlights  2014  ruling  

Does  Authors  Guild  Represent  All  Authors?    

•  The  Authors  Guild  members  are  overwhelmingly  trade-­‐book  authors;  the  books  scanned  by  the  Hathi  Trust  are  overwhelmingly  scholarly  books  wri\en  as  part  of  an  academic  tradi*on  that  takes  free  access  and  sharing  as  its  founda*on.    

•  The  Authors  Alliance  :  new  organiza*on  represen*ng  authors  who  are  primarily  concerned  with  being  read.  

Court  finds  full-­‐book  scanning  is  fair  use  Cory  Doctorow  at  3:00  pm  Sat,  Jun  14,  2014    

Highlight  2014  Ruling    

•  Given  that  consistent  fair  use  record  for  book  digi*za*on,  today's  ruling  might  not  be  totally  surprising.  S*ll,  the  text  of  the  opinion  is  encouraging,  and  reflects  a  court  that  respects  the  Cons/tu/onal  purpose  of  copyright  as  a  tool  to  promote  the  progress  of  science  and  the  useful  arts—not  a  blunt  instrument  for  rightsholders  to  regulate  all  downstream  uses.  

June  10,  2014  |  By  Parker  Higgins    Another  Fair  Use  Victory  for  Book  Scanning  in  HathiTrust    

•  Who  are  the  Players?  HathiTrust,  Google,  Authors  Guild  

•  The  Object  of  A\en*on  :  11  M  books  from  university  libraries  

•  Rulings  around  copyright  •  HTRC,  or  why  I  care  •  Is  security  of  HTRC  Data  Capsule  good  enough?  

HTRC,  or  why  I  care:        HathiTrust  digital  library  is  “big  data”;    

and  Text  mining  is  the  new  library  catalog  

search  

Similar  model,  different  ends  

$$  

HTRC  goes  beyond  “full  text  searchable  database”  

Scholarly  search  

Scholarly  mining  

#HTRC    @HathiTrust  

HathiTrust  

•  HathiTrust  is  a  consor*um  of  academic  &  research  ins*tu*ons,  offering  a  collec*on  of  millions  of  *tles  digi*zed  from  libraries  around  the  world.  – Founding  members:  University  of  Michigan,  Indiana  University,  University  of  California,  and  University  of  Virginia  

http://www.hathitrust.org/htrc  

http://www.hathitrust.org  

à  Dis*nguished  from  

#HTRC    @HathiTrust  

#HTRC    @HathiTrust  

Content  of  HathiTrust  

•  Books  and  journals  – Plus  pilots  around  images,  audio,  born-­‐digital  

•  Digi*za*on  sources  – Google  (96.8%,  10,162,104)  –  Internet  Archive  (2.9%,  301,972)  – Local  (0.3%,  31,840)  

#HTRC    @HathiTrust  

Content  Sources  

#HTRC    @HathiTrust  

Content  distribu*on  

360,000  volumes  in  Spanish  

#HTRC    @HathiTrust  

Mo?va?on  for  HTRC  

à  HathiTrust repository is massive scale -- latent goldmine for text based research à  Restricted nature of parts of HathiTrust content suggests need for new forms of access that preserves intimate nature of interaction with texts while at same time honoring restrictions on access à  Size and restrictions demand new paradigm: computation moves to the data (not vice versa)

#HTRC    @HathiTrust  

HathiTrust  Research  Center  

•   The  HathiTrust  Research  Center  (HTRC)  was  established  in  2011  to  enable  computa*onal  research  across  a  comprehensive  body  of  published  works,  for  the  purposes  of  scholarship,  educa*on,  and  inven*on.    

•  HTRC  Execu*ve  Commi\ee  –  Beth  Plale,  co-­‐Director,  Professor  of  Informa*cs  and  Compu*ng,  Indiana  University  

–  J.  Stephen  Downie,  co-­‐Director,  Professor  of  Informa*on  Science,  University  of  Illinois  

–  Robert  McDonald,  Indiana  University  Libraries  –  Beth  Namachchivaya  Sandore,  University  of  Illinois  Library  –  John  Unsworth,  CIO,  Dean  of  Library,  Brandies  University  

   

HTRC  system    

Complexity  hiding  interface  

The  complexity  

Tabular  info  

Sta*s*cal  plots  

Spa*al  plots  

Request  

   

Complexity

 hiding  interface  

   

Text  mining  at  scale:  quick  tutorial  on  topic  modeling  of  texts  

#HTRC    @HathiTrust  

Topic  Modeling  

•  Can  answer  more  complex  or  nuanced  ques*ons  – What  are  the  primary  themes  of  an  author?  – What  are  the  primary  themes  of  a  research  domain?  

– When  did  a  new  topic  enter  a  research  domain?  •  Provides  more  data  than  word  counts  

– 100s  of  topics  can  be  extracted.      – Underlying  data  (topics,  volume,  and  page)  is  available  

#HTRC    @HathiTrust  

Themes  for  Authors  Two  topics  with  iden*cal  centrali*es  (e.g.,  Dickens)  but  separate  themes  

More  strongly  focused  on  book  (illustra*ons,  volume,  literature)  

More  strongly  focused  on  author  himself    (le\ers,  household,  house)  

Ted Underwood, Univ of Illinois

Digging  into  philosophy  of  science  

Establish  points  of  contact  between  philosophy  and  

science:  where  philosophical  arguments  on  

anthropomorphism  appear  in  science  texts  

Colin  Allen,  IU  

The  How  

•  1315  volumes  from  HTRC  selected  using  keyword  search  for  ‘darwin’,  ‘romanes’,  ‘anthropomorphism’,  and  ‘compara*ve  psychology’  

•  Set  contains  lots  of  uninteres*ng  books:    e.g.,  college  course  catalogs  

•  Apply  topic  modeling  on  86  volume  subset    •  Using  iPy  Notebook  

..  Of  set  of  topics,  choose  ‘16’  as  best  

Volumes  most  similar  to  topic  16  

Copyright:  A  Reality    Full  text  download  is  limited  by  both  

size  and  by  copyright  

HTRC  solu*on  to  fully-­‐flexible  text  mining  research  on  en*re  HT  digital  repository:          HTRC  Data  Capsule  

 Funded  by  Alfred  P.  Sloan  Founda*on;  in  collabora*on  with  Atul  Prakash,  University  of  Michigan    

#HTRC    @HathiTrust  

Ques*ons  driving  HTRC  Data  Capsule  

•  Non-­‐consump*ve  use:  can  framework  provide  safe  handling  of  large  amounts  of  protected  data?    

•  Openness:  can  framework  support  user-­‐contributed  analysis  without  resor*ng  to  code  walkthroughs  prior  to  acceptance?    

•  Large-­‐scale  and  low  cost:  can  protec*ons  be  extended  to  u*liza*on  of  large-­‐scale  na*onal  (public)  computa*onal  resources?    

#HTRC    @HathiTrust  

HTRC  Data  Capsules  

•  Trusts  text  mining  researcher  to  not  deliberately  leak  repository  data  

•  Prevents  malware  ac*ng  on  user’s  behalf  from  leaking  data.  

•  V1.0  limits  analysis  to  running    within  single  VM  

VM  Image  Manager  

VM  Image  Store  

VM  Image  Builder  

VM  Manager  

VM  instance  

Secure  Capsule  cluster  

SSH   Research  results  

Researcher  

HTRC  Data  Capsule  Architectural  Components  

   

Registry    Services,  worksets  

 

 

VM  Image  

Manager  

VM  Image  Store  

VM  Image  Builder  

VM  Manager  

VM  instance  

Upon  run,  Secure  Capsule:  

controls  I/O  behind  scenes  

SSH   Research  results  

Researcher  

HTRC  Data  Capsule  interac*on  

Researcher  requests    new  VM  of  type  X  

Researcher  install  tools  onto  VM  through  window  on  her  desktop.    

   

Registry    Services,  worksets  

 

 

Final  loca*on  of  results  is  registry  

1)  

2)  

Image  instance  is  created  

3)  

4)  

setup  

41  

HTRC  secure  data  capsule:  view  from  researcher  desktop  

Thanks  to  our  sponsors  

HTRC  goes  beyond  “full  text  searchable  database”.    Security  has  to  be  top  concern.  

scholarly  research  

HTRC  goes  beyond  “full  text  searchable  database”  

Recommended