[Userinvolvement] Manually annotated training corpora - CLARIN resource families

Martin Wynne martin.wynne at bodleian.ox.ac.uk
Tue Dec 4 13:59:45 CET 2018


Hi,

I've added the BNC Sampler, which was manually post-edited to correct 
the automatically assigned part-of-speech tags. Perhaps we should add 
another column for 'Annotation notes' so that we can include information 
like this?

Best,
Martin

On 04/12/2018 12:22, Koenraad De Smedt wrote:
> Hi,
>
> Almost no treebanks are fully manually annotated, but a lot of 
> treebanks are semi-manually annotated. Machine parses are often 
> corrected as needed by annotators. In other cases machine parses are 
> manually disambiguated. I am going to assume that those semi-manually 
> constructed treebanks, which are indeed mentioned as training corpora, 
> are also of interest for the current survey.
>
> Best,
> Koenraad
>
>> On 4 Dec 2018, at 10:59, Pavel Stranak <stranak at ufal.mff.cuni.cz 
>> <mailto:stranak at ufal.mff.cuni.cz>> wrote:
>>
>> Hi Jakob,
>>
>> I am not sure I understand "training corpus" concept, but if you mean 
>> any manually annotated resource (which by definition can be used for 
>> supervised training), than the list is missing at the very least all 
>> the treebanks.
>>
>> -Pavel
>>
>>
>>
>>> On 3 Dec 2018, at 19:06, Lenardič, Jakob 
>>> <Jakob.Lenardic at ff.uni-lj.si <mailto:Jakob.Lenardic at ff.uni-lj.si>> 
>>> wrote:
>>>
>>> Dear all,
>>>
>>> as part of the CLARIN Resource Families initiative, we are 
>>> conducting a survey of*manually-annotated training*corpora. We have 
>>> prepared the preliminary results based on the VLO and the national 
>>> CLARIN repositories:
>>>
>>> _https://docs.google.com/spreadsheets/d/1A12KnLUboHu-SPRY5HfvpkuV6clhN_HFmp7IU_jqC9I/edit?usp=sharing_
>>> We would appreciate it if you could add any resources and info that 
>>> we have missed and correct any mistakes we have made. Note that we 
>>> are looking for corpora that have been designed specifically for 
>>> training language tools, such as PoS-taggers, Named-Entity 
>>> recognizers, dependency parsers, etc. Comments and suggestions by 
>>> email are welcome too. We are collecting feedback by December 
>>> 20 after which we will prepare the report.
>>> Best,
>>> Jakob
>>>
>>>
>>> Univerza/ v Ljubljani/
>>> Filozofska/ fakulteta/ 	asist. Jakob Lenardič
>>>
>>>
>>> /Oddelek za prevajalstvo/ / /Department of translation/
>>>
>>> Filozofska/ fakulteta/ / Faculty /of arts/
>>>
>>> Aškerčeva cesta 2, SI-1000 Ljubljana, Slovenija / Slovenia
>>> T.: 241-1143 <tel:241-1143>
>>> Jakob.Lenardic at ff.uni-lj.si <mailto:Jakob.Lenardic at ff.uni-lj.si>, 
>>> www.ff.uni-lj.si <http://www.ff.uni-lj.si/>
>>> Univerza v Ljubljani <http://www.uni-lj.si/>
>>>
>>> _______________________________________________
>>> Userinvolvement mailing list
>>> Userinvolvement at lists.clarin.eu <mailto:Userinvolvement at lists.clarin.eu>
>>> https://lists.clarin.eu/cgi-bin/mailman/listinfo/userinvolvement
>>
>> _______________________________________________
>> Userinvolvement mailing list
>> Userinvolvement at lists.clarin.eu <mailto:Userinvolvement at lists.clarin.eu>
>> https://lists.clarin.eu/cgi-bin/mailman/listinfo/userinvolvement
>



More information about the Userinvolvement mailing list