[Userinvolvement] REQUEST – help on compiling overview of social media corpora

Mon May 1 11:21:17 CEST 2017

Dear Jakob and Darja,

Would you like me to send out a few tweets or FB posts asking if people know any? I could then add the sources to your spreadsheet. 

Best,
Karolina

> On 26 Apr 2017, at 09:51, Lenardič, Jakob <Jakob.Lenardic at ff.uni-lj.si> wrote:
> 
> Dear all,
> 
> Darja and I have been working on an overview of corpora containing data from social media platforms (e.g. Twitter, Facebook, blogs, fora, etc.) available in CLARIN member countries. We are doing this in light of the forthcoming CLARIN-PLUS workshop on the data of social media that will be held on 18 and 19 May in Kaunas, Lithuania <https://www.clarin.eu/event/2017/clarin-plus-workshop-creation-and-use-social-media-resources>. 
> 
> We are interested in identifying three types of resources:
> 1) corpora of Social Media data that can be used for various kinds of linguistic analyses, such as theFinnish Suomi 24 Corpus <http://metashare.csc.fi/repository/browse/the-suomi-24-corpus-2016h2/eb323320f44d11e6b70e005056be118e30dc4e74e4654a4a8b3e8789ef31c0d0/>, and
> 
> 2) smaller, specialized datasets for particular NLP tasks, such as CMC training corpus Janes-Tag 1.2 <https://www.clarin.si/repository/xmlui/handle/11356/1085>.
> 
> 3) NLP tools adapted or developed for (noisy) social media language, such as csmtiser <https://github.com/clarinsi/csmtiser>, which is a tool for text normalisation via character-level machine translation developed by CLARIN.SI members.
> In terms of the metadata, we are looking for the following information:
> Language(s)
> Size (in tokens)
> Period (from-to)
> Annotation & tools
> Availability
> License
> Key publication
> The results of our preliminary investigation can be seen in the Google spreadsheet <https://docs.google.com/spreadsheets/d/1sbTvCTjmkXFjVfA2kOUoj1NRDm48R7UiLmabIjHLMRQ/edit?usp=sharing>. As you can see, we haven’t been to find relevant corpora/datasets/tools for Bulgaria, Denmark, Lithuania, Latvia, Portugal and Hungary. For several of the corpora/datasets/tools that we have identified some metadata are incomplete. Finally, there might exist corpora/datasets/tools we are not yet aware of but would be grateful to learn about them.
> 
> For this reason, I would kindly like to invite you to fill in the missing data on behalf of your consortium in the spreadsheet, or send me the missing information by email if that’s easier for you. I am looking forward to your contributions by 8 May.
> 
> Best,
> Jakob
> _______________________________________________
> Userinvolvement mailing list
> Userinvolvement at lists.clarin.eu <mailto:Userinvolvement at lists.clarin.eu>
> https://lists.clarin.eu/cgi-bin/mailman/listinfo/userinvolvement <https://lists.clarin.eu/cgi-bin/mailman/listinfo/userinvolvement>

Karolina Badzmierowska
CLARIN ERIC Communications Officer | Utrecht University | Drift 10, 3512 BS Utrecht, The Netherlands | Room 2.05 | Working days: Mon-Tue

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clarin.eu/pipermail/userinvolvement/attachments/20170501/373790ce/attachment.html>