25
Corpus of Computer-Mediated Communication in Hindi (CO3H) Ritesh Kumar JNU, New Delhi

Corpus of Computer-mediated Communication in Hindi (CO3H)

  • Upload
    dbrau

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Corpus of Computer-Mediated Communication in Hindi (CO3H)

Ritesh KumarJNU, New Delhi

Computer-mediated Communication

• Computer-mediated communication (the written communication).

• Synchronous/ Asynchronous• 1-way vs. 2-way• Persistence• Anonymity/Privacy/Filtering/Quoting• Message Format• Buffer Size

• Situation Factors• One-to-one/one-to-many/many-to-many and

related factors.• Social/ Personal Details of the participants• Professional/ Social/ Aesthetic (Purpose)• Topic/ Theme• Debate/ Game/ Casual/ Formal (Activity)• Language/ Writing System.

Computer-mediated Communication

Corpus

• A systematic, structured, careful and large collection of texts (mainly, electronic texts) in one (monolingual corpus) or several (multilingual corpus) languages.

• Project-related corpus vs. corpus for general use.

• Corpus of raw data vs. annotated corpus.

Hindi CMC Corpus

• First CMC corpus of Hindi.• Chief data source includes

–Blogs written in Hindi–Web portals in Hindi (8 portals listed on the 'pitara' toolbar)

–Log files of IRC.

Hindi CMC Corpus

–Personal chats from Gmail (with the consent of the users).

–Hindi group discussions on Google groups and Yahoo groups

–Personal emails in Hindi–Hindi conversations on Facebook (and hopefully on other social networking sites)

Constraints and Sampling

• Huge amount of data and limited time.• Only blogs listed on chittajagat.in are taken in

the corpus.• Blog/portal entries having 3 or more comments

are considered (same applies for the entries on the web portals).

• The initial plan was to collect blog/portal entries from 01/01/07 to 31/12/10 but that looks too much now.

Constraints and Sampling

• Chatting in Hindi on public domain is a rare phenomenon.

• I found only two channels on DALnet with some amount of Hindi chatting.

• Saving the log files from 28/04/10; will go on till 30/04/11.

• Log files from each day not available; practical constraints.

Constraints and Sampling

• Very difficult to make people agree on sharing their personal emails and chats for the corpus.

• The pages on Google groups, Yahoo groups and Facebook are arranged and presented in such a way that it becomes very difficult to arrange them as the text files with the proper formatting.

• For discussions on the groups, it must have received at least 2 replies to be included in the corpus, while for other anything in Hindi (written in between 01/01/07 and 31/12/10) is included in the corpus.

Storing the Data: Public Chats

• Chat log files are created automatically by the system in plain text (.txt) format, arranged according to the date.

• These files then go through an initial cleaning up process to remove the noise.

• The second step would be sorting out the conversation in Hindi from that of the other languages.

Storing the Data: Others

• Stored in plain text (.txt) format with UTF-8 encoding.

• The pre-processing removes all the HTML tags, links and other contents that are irrelevant.

• Finally the file containing only the text is stored.• The details like the name of the writers (of the

blog/comments/portals/groups), the date and time, hyperlinks are retained, while those of private chats, emails and social networking are anonymised.

Arranging the Data: Blogs

• Blog entries from one blog are grouped together in one folder.

• It is the easiest way to arrange and it also reflects the way data is collected.

• Each folder is identified by a number which refers to an entry in the metadata file.

• The file name of each blog entry reflects the the number of comments it has and the date on which the entry was first uploaded on the blog.

Arranging the Data: Portals and Groups

• It is little different from the blogs, since there are not so many portals/groups in Hindi as blogs.

• Data from one portal/group is grouped in one folder.• The data is further sub-grouped according to the date

on which it is retrieved.• There is one further subgrouping based on the kind

(theme) of article/entry it is.• The same method of file naming as in blogs is

followed.

Arranging the Data: Others

• Emails/Chats from one person (anonymised by an id) are stored in one folder and each email conversation/chat transcript is named using the same convention as in other media.

• The conversation from the social networking is arranged in the similar way.

Preparing the Metadata: Chat

• There is no manual preparation of metadata for public chats as it is done automatically by ‘Chatzilla’.

• Moreover there is no great variety here.• The data is collected from only two

channels (India and Bharat) on the ‘Dalnet’ network.

Preparing the Metadata: Blogs

• For blogs, however, a basic metadata file is manually prepared along with the data collection.

• It consists of the address of the blog, reference to the folder in which the data from that blog is stored, the name and contact (if any) of the writer of the blog, name of the blog, date of retrieval, number of entries from that particular blog.

• Number of words will be included in this file later on after the noise has been removed.

Preparing the Metadata: Portals and Groups

• The metadata for web portals/groups is similar as blogs but not so extensive.

• It includes the link to the portal/groups and the date of retrieval.

• It will be revised so as to include the links to the individual entries that have been collected.

• It will also include the name of the author.

Preparing the Metadata: Others

• The metadata for the personal communication (in emails/chats/social networking) mainly consists of the details of the participants in the communication like age, gender, relationship in between the participants, educational qualification, native language, etc.

• These details are linked to an anonymised id which will identify the participants everywhere.

• The name of the participants will be stored in a separate file which will not be generally accessible.

Progress So FarMedia Entries/Transcripts Collected Immediate Target

Blogs 100 blog sites, totaling more than 2500 blog entries

At least 5000 blog entries

Web Portals 2 web portals, totaling around 200 entries. At least 8 different web portals

Groups 5 Google groups, totaling around 500 discussions.

At least 200 Google groups

E-mails > 20,000 e-mails but only few hundred among them are in Hindi

At least 5000 Hindi e-mails

Public Chats Public chats of around 200 days Till 30th April, 2011

Private Chats More than 1500 private chat transcripts At least 2500 chat transcripts

A Snapshot of the Data: Chat

● As I have mentioned earlier, chats are mostly not in Hindi.

● Few sentences are written in Hindi, some exchanges take place in Hindi but very few full-fledged conversation as such.

● Almost like a spoken group discussion among friends, acquaintances and strangers.

● Short utterances, plenty of gestures (in words), generally casual and friendly.

● Sample 1

A Snapshot of the Data: Blogs

• Blogging in Hindi has really exploded in the last few years.

• Lakhs of blog entries (more than 6 lacs blog entries in more than 15k blog sites) in Hindi.

• They are largely characterised by formal, written Hindi.

• Entries are mainly on some socio-political issues or the writer’s personal creations like poetry, stories, etc.

A Snapshot of the Data: Blogs

• Most of the blogs are read and commented upon by a closed circle of followers, friends and acquaintances.

• Comments are characterised by mostly formal approval or mild suggestions and requests.

• There are very few direct attacks and straightforward counter-arguments.

• Sample 1• Sample 2

A Snapshot of the Data: Portals

• Web portals are similar to blogs in using formal Hindi.

• However the comments received by the entries on portals sometimes go beyond just token appreciation as in blogs.

• They could be a very direct and venomous attack without any attempts at ‘face saving’, probably because portals are considered more organisational and less personal than blogs.

• Sample 1 (on a controversial social issue)• Sample 2 (a movie review).

Issues and Concerns

• The ethical/copyright issues.• Moving towards a more balanced corpus

is also one of the prime concerns.• Does not look like a vey comprehensive

corpus.• Availability of CMC in Hindi.

Thank you !