[Tf-curation] curation-related request about the Nederlands Instituut voor Beeld en Geluid Academia collectie

Wed Mar 7 10:59:37 CET 2018

Dear Dieter,

I had a brief look at this. I am a bit hesitant to have them removed them from the VLO.

Though it is true that these records lead to a lot of Dutch-language values for the facet ‘Subject’, I believe that ‘subject’ is not suited as a facet with a restricted number of values. Currently there are more than 51k possible values, and it is unlikely that this will decrease. Taking out the Academia collection will not really solve this problem. One should see this facet as facet for string search in the values of a limited number of fields, not really as a facet with restricted values or a small number of values

I see many values that are in different languages than English, including German, French, Spanish, Portuguese, Chinese (or Japanese?), codes that are incomprehensible without a legend (e.g. sh85091588),  and some of these are complete phrases or even sentences, so perhaps it is better to reconsider this facet.

One possible approach, which might somewhat solve the problem you mention, it appears to me, could be to mark all fields from which the subject facet is derived for the language they are in, and make use of that fact so that users can search for strings in a particular language. Such an addition to the Academia records could be done fully automatically (since all are in Dutch), or it could be added at the curation level. How difficult it is to do this automatically for metadata from other origins I do not know.

As to access: If I understood well, access to these resource will be given to all academic  organization in the Netherlands soon, without costs. It is still limited to organisations in the Netherlands, though. But that should not lead to exclusion from the VLO. The VLO contains many descriptions of resources with limitations on access. Having these metadata in might lead researchers from outside the Netherlands to them, and they can arrange access in an ad-hoc way, if that is crucial to their research. If these metadata would not be  in the VLO, these data might not be found at all. After all, we want all researchers to find their data via one entry point: the CLARIN VLO.

Jan

From: Dieter Van Uytvanck [mailto:dieter at clarin.eu]
Sent: dinsdag 30 januari 2018 12:15
To: Odijk, J.E.J.M. (Jan); tf-curation at lists.clarin.eu
Subject: curation-related request about the Nederlands Instituut voor Beeld en Geluid Academia collectie

Dear Jan,

Recently during a meeting on the quality of the metadata in the VLO the Beeld en Geluid academia collection<https://vlo.clarin.eu/?fqType=collection:or&fq=collection:Nederlands+Instituut+voor+Beeld+en+Geluid+Academia+collectie> was brought up as a source of problematic metadata, mainly because:

  *   the metadata per record is rather sparse and purely in Dutch (example<https://vlo.clarin.eu/data/clarin/results/cmdi/Nederands_Instituut_voor_Beeld_en_Geluid_OAI_PMH_repository/oai_beeldengeluid_nl_Expressie_3844948.xml>), leading to many "noise" entries for the facet Subject
  *   access to these resources is limited (to a subset<https://www.academia.nl/licentiehouders> of the Dutch academic organisations)

Since I realize it was not trivial to create all this metadata during the CLARIN-NL project, I was wondering what your opinion on this is. Could it be an option to remove this collection from the VLO?

best regards,

--

Dieter Van Uytvanck

Technical Director CLARIN ERIC

www.clarin.eu<http://www.clarin.eu> | tel. +31-(0)850091363 | skype: dietervu.mpi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clarin.eu/pipermail/tf-curation/attachments/20180307/116e807f/attachment.html>