Upload
terry-reese
View
284
Download
3
Embed Size (px)
Citation preview
OSUL and Digital Humanities
Dealing with Data Problems◦ While the Library licenses the content via
a content provider, access to the underlying data for aggregated research is and isn’t supported.
◦ In this case, access to content is limited through both our subscriptions and newspaper publishers themselves.
◦ For this project, licensing to many of the sources David and Patrick were interested in working with required licensing fees of ~$25-50,000 per newspaper.
Big “little” dataWe worry a lot about big research data in the library and how this information will be preserved and made accessible into the future
◦ But equally concerning – is big “little” data
Big “little” data has very specific problems:1. Acquisition of the data can be really difficult
2. Storage tends to be inefficient and difficult
3. It’s incredibly hard to move around
4. For purposes of aggregation, it limits the types of tools that can be used for evaluation
5. When the data is closed, finding undocumented inconsistencies is hard
Sample Data Set
NewsPaper Processing tool
Data processing methodologyCreated two data sets:
1. First data set focused on any digital object (excluding classifieds), that included references to public housing
2. Second data set focused on any digital object (excluding classifieds), that included public housing and 4 agreed upon synonyms for public housing
One of the benefits of using the resources that we did, was that there was very little article duplication across resources (i.e., very little reliance on the Associated Press – meaning that little data filtering needed to occur to account for duplicate data across newspapers)
Data processing methodologyFrom these sets – I wrote a suite of tools in C# that measured:
1) Presence of positive terms
2) Presences of negative terms
3) Neutral terms
4) Frequency of negative and positive terms
5) Proximity to positive and negative terms to provide weight
These tools utilized stemming to allow the tool to capture forms of words.
One thing that this work highlighted however, was the limitations in the data due to data quality. These resources are ocr’ed representations of a particular newspaper article, classified, etc. – and ocr data quality varies significantly across the titles. A secondary research project that I’ve begun is using these data sets to test ocr quality of the set by utilizing word frequency to map unique words across a digital object
0
5
10
15
20
25
30
35
40
45
1930 1940 1950 1960 1970 1980 1990 2000
Cleveland Call Post
More Positive More Negative
Just Public Housing: Cleveland
-15
-10
-5
0
5
10
15
20
25
1930 1940 1950 1960 1970 1980 1990 2000
Article Content: Positive Over Negative
Just Public Housing: Cleveland
Extended Terms: Cleveland
0
5
10
15
20
25
30
35
40
45
50
1930 1940 1950 1960 1970 1980 1990 2000
Cleveland Call Post
More Positive More Negative
Extended Terms: Cleveland
-15
-10
-5
0
5
10
15
20
25
1930 1940 1950 1960 1970 1980 1990 2000
Article Content: Positive Over Negative
Public Housing vs Extended Terms
-15
-10
-5
0
5
10
15
20
25
1930 1940 1950 1960 1970 1980 1990 2000
Article Content: Positive Over Negative
-15
-10
-5
0
5
10
15
20
25
1930 1940 1950 1960 1970 1980 1990 2000
Article Content: Positive Over Negative
Public Housing vs Extended Terms: NY
-10
-5
0
5
10
15
20
25
30
35
1910 1920 1930 1940 1950 1960 1970 1980 1990 2000
Article Content: Positive Over Negative
-15
-10
-5
0
5
10
15
20
25
30
1910 1920 1930 1940 1950 1960 1970 1980 1990 2000
Article Content: Positive Over Negative
Data processing methodologyPotential additional areas of inquiry:• Representation of public housing in:• letters to the editor
• Editorials
• Featured Articles