[Userinvolvement] REQUEST – help on compiling overview of social media corpora

Lenardič, Jakob Jakob.Lenardic at ff.uni-lj.si
Wed Apr 26 09:51:23 CEST 2017


Dear all,


Darja and I have been working on an overview of corpora containing data from social media platforms (e.g. Twitter, Facebook, blogs, fora, etc.) available in CLARIN member countries. We are doing this in light of the forthcoming CLARIN-PLUS workshop on the data of social media that will be held on 18 and 19 May in Kaunas, Lithuania<https://www.clarin.eu/event/2017/clarin-plus-workshop-creation-and-use-social-media-resources>.


We are interested in identifying three types of resources:

1) corpora of Social Media data that can be used for various kinds of linguistic analyses, such as the Finnish Suomi 24 Corpus<http://metashare.csc.fi/repository/browse/the-suomi-24-corpus-2016h2/eb323320f44d11e6b70e005056be118e30dc4e74e4654a4a8b3e8789ef31c0d0/>, and


2) smaller, specialized datasets for particular NLP tasks, such as CMC training corpus Janes-Tag 1.2<https://www.clarin.si/repository/xmlui/handle/11356/1085>.


3) NLP tools adapted or developed for (noisy) social media language, such as csmtiser<https://github.com/clarinsi/csmtiser>, which is a tool for text normalisation via character-level machine translation developed by CLARIN.SI members.

In terms of the metadata, we are looking for the following information:

  *   Language(s)
  *   Size (in tokens)
  *   Period (from-to)
  *   Annotation & tools
  *   Availability
  *   License
  *   Key publication

The results of our preliminary investigation can be seen in the Google spreadsheet<https://docs.google.com/spreadsheets/d/1sbTvCTjmkXFjVfA2kOUoj1NRDm48R7UiLmabIjHLMRQ/edit?usp=sharing>. As you can see, we haven’t been to find relevant corpora/datasets/tools for Bulgaria, Denmark, Lithuania, Latvia, Portugal and Hungary. For several of the corpora/datasets/tools that we have identified some metadata are incomplete. Finally, there might exist corpora/datasets/tools we are not yet aware of but would be grateful to learn about them.


For this reason, I would kindly like to invite you to fill in the missing data on behalf of your consortium in the spreadsheet, or send me the missing information by email if that’s easier for you. I am looking forward to your contributions by 8 May.


Best,

Jakob
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clarin.eu/pipermail/userinvolvement/attachments/20170426/26f3c640/attachment.html>


More information about the Userinvolvement mailing list