[Userinvolvement] REQUEST - help on compiling overview of social media corpora

Tue May 9 13:23:52 CEST 2017

Dear Jakob, dear all,

Thank you for the very useful overview of social media resources. 

We have filled in the missing information on the Greek social media listed, directly on the spreadsheet (row 10). 

I have also asked some colleagues if they are aware of any resources of the types described in your mail; I'll forward any answers I get.

Best regards,

Maria

Maria Gavrilidou

ILSP/R.C. 'Athena'

Epidavrou & Artemidos 6

GR-15125 Marousi

Athens

Greece

Tel.: +30 210 6875441

Email: maria at ilsp.athena-innovation.gr <mailto:maria at ilsp.athena-innovation.gr>  

URL: www.ilsp.gr <http://www.ilsp.gr/> 

From: userinvolvement-bounces at lists.clarin.eu [mailto:userinvolvement-bounces at lists.clarin.eu] On Behalf Of Lenardic, Jakob
Sent: Wednesday, April 26, 2017 10:51 AM
To: userinvolvement at lists.clarin.eu
Subject: [Userinvolvement] REQUEST - help on compiling overview of social media corpora

Dear all,

Darja and I have been working on an overview of corpora containing data from social media platforms (e.g. Twitter, Facebook, blogs, fora, etc.) available in CLARIN member countries. We are doing this in light of the forthcoming CLARIN-PLUS workshop on the data of social media that will be held on 18 and 19 May in Kaunas, Lithuania <https://www.clarin.eu/event/2017/clarin-plus-workshop-creation-and-use-social-media-resources> . 

We are interested in identifying three types of resources:

	1) corpora of Social Media data that can be used for various kinds of linguistic analyses, such as the Finnish Suomi 24 Corpus <http://metashare.csc.fi/repository/browse/the-suomi-24-corpus-2016h2/eb323320f44d11e6b70e005056be118e30dc4e74e4654a4a8b3e8789ef31c0d0/> , and

	2) smaller, specialized datasets for particular NLP tasks, such as CMC training corpus Janes-Tag 1.2 <https://www.clarin.si/repository/xmlui/handle/11356/1085> .

	3) NLP tools adapted or developed for (noisy) social media language, such as csmtiser <https://github.com/clarinsi/csmtiser> , which is a tool for text normalisation via character-level machine translation developed by CLARIN.SI members.

In terms of the metadata, we are looking for the following information:

*	Language(s)
*	Size (in tokens)
*	Period (from-to)
*	Annotation & tools
*	Availability
*	License
*	Key publication

The results of our preliminary investigation can be seen in the Google spreadsheet <https://docs.google.com/spreadsheets/d/1sbTvCTjmkXFjVfA2kOUoj1NRDm48R7UiLmabIjHLMRQ/edit?usp=sharing> . As you can see, we haven't been to find relevant corpora/datasets/tools for Bulgaria, Denmark, Lithuania, Latvia, Portugal and Hungary. For several of the corpora/datasets/tools that we have identified some metadata are incomplete. Finally, there might exist corpora/datasets/tools we are not yet aware of but would be grateful to learn about them.

For this reason, I would kindly like to invite you to fill in the missing data on behalf of your consortium in the spreadsheet, or send me the missing information by email if that's easier for you. I am looking forward to your contributions by 8 May.

Best,

Jakob

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clarin.eu/pipermail/userinvolvement/attachments/20170509/538fdcc4/attachment.html>