[Tf-curation] l/a synthesis

Durco, Matej Matej.Durco at oeaw.ac.at
Wed Feb 3 17:28:50 CET 2016


Dear Krister,

thank you a lot for the extensive response, this is really very helpful!

In my view, your clarifications regarding NC/ACA and derivative data should definitely find their way into the public information about L/A [1].

The hint that the three main categories imply certain subcategories (ACA => ID;BY;NORED) is also very helpful.
I just wonder, if we want to make it *explicit* in the VLO (i.e. add for every resource with ACA tag, also the ID, BY and NORED attributes), or just explain it (somewhere under [1]) and leave that implicit in VLO.

In the list you proposed to map a few "restricted ..." values with "*" (or other), which seems a bit counterintuitive, 
but I guess this has to do with the special meaning of "RES"... ?

The next steps:
We right now process the (Krister's) mapping into a normalization map as used by VLO.
We will apply it on our Minerva VLO instance first and let you inspect the new mappings probably  on Monday.
We will also tentatively try to map from the dc-concepts (dcterms:rights, dcterms:accessRights, dcterms:license), to see if we can get a better coverage (the profile coverage analysis [2] suggests so)
After a few days validation and comment period, we would apply the mapping in the main VLO instance (and roll out with version 3.4).

There is one more thing, we would like to have feedback on, especially from CLIC. That is the labels and definitions for the l/a related facets.
But I spare that for a separate email.

Thank you for all the input so far.

Best,
Matej

[1] https://www.clarin.eu/content/license-categories
[2] https://docs.google.com/spreadsheets/d/1eeOr0ShOWxdY8BLzp62LDyfGgHo0gZ95Myw0qauzLxU/edit#gid=0&vpid=A1


-----Ursprüngliche Nachricht-----
Von: Krister Lindén [mailto:krister.linden at helsinki.fi] 
Gesendet: Sonntag, 31. Jänner 2016 22:26
An: Durco, Matej <Matej.Durco at oeaw.ac.at>; Penny Labropoulou <penny at ilsp.gr>; 'Twan Goosen' <twan.goosen at mpi.nl>; 'Sander Maijers' <sander at clarin.eu>
Cc: 'Menzo Windhouwer2' <menzo.windhouwer at meertens.knaw.nl>; 'Thomas Eckart' <teckart at informatik.uni-leipzig.de>; Ostojic, Davor <Davor.Ostojic at oeaw.ac.at>; tf-curation at lists.clarin.eu; Sugimoto, Go <Go.Sugimoto at oeaw.ac.at>; 'Dieter Van Uytvanck' <dieter at clarin.eu>
Betreff: Part I: Re: AW: AW: [Tf-curation] License/Availability was WG: Re: LicenseAvailabilityMap.xml in vlo/trunk/vlo-commons/src/main/resources – CLARIN Trac

Dear all,

It seems I could have responded in one long email, but I chose to answer in three different parts and therefore I end up confirming some of the things here that Penny said in the next message, while also adding some explanations, but here goes.

Part I:

On 21.1.2016 17:04, Durco, Matej wrote:
> ACA vs. NC
> as you rightly commented in the gsheet
>
> 1)
>
> * ad: PUB/ACA/RES yes, the goal is to have one of this 3 categories 
> assigned to each record/resource. ... the way it is solved in 
> META-SHARE profiles brought me to the idea to decompose to license 
> categories. Namely in META-SHARE profile the licenceInfo is quite 
> rich: There is a licence-element and the repeated restrictionsOfUse 
> element, so each record has more than enough information to correctly 
> map to both the main categories and the optional ones. (see example 
> Helsinki corpus [1]) Therefore I believe we can (and have to) be 
> conservative in the mapping and can avoid adding uncertain 
> information: Correct me if I am wrong but "non-commercial use" can be 
> safely mapped to the License category "NC",

Yes it can. There is currently a legal debate going on about what non-commercial means exactly, because it is a fuzzy concept, but whenever that discussion arrives at a conclusion, which may be as soon as the new directive for some kind of research exception emerges, that is the definition we will adopt.

> however mapping it to the main category PUB ACA or RES is problematic 
> without more information. Long story short, even though we cannot map 
> each individual value in the normalization map to one of the main 
> categories, in the end each every record/resource (that provides the 
> appropriate information) will be assigned to one of the three.

Agree.

> * ad: atomic vs. combined License Categories We want to try with 
> decomposition, i.e. atomic categories as separate facet values (BY, 
> SA, ...)

Good.

> * ad: indication of license/availability information being unavailable 
> in the dev-instance we use a placeholder "[missing value]" (actually 
> in all facets), but it needs to be decided if we want to expose this 
> in the main vlo, especially given the many records falling into this 
> category. we cannot say "non standard license", because we don't know. 
> we can only say "unspecified" or synonyms thereof, "unspecified" being 
> sometimes used as value itself.

Unspecified is OK.

> 2) ad: C-* facets It's actually the opposite, these are special facets 
> exposing the values individual concepts that contribute to the actual 
> availability/license facets. (concept-facet mapping) The overview of 
> these concepts, incl. definition (copied from the source) and links to 
> CCR are in the trac-wiki [1]. These C-* facets are exactly meant to be 
> able to identify, where the individual values in the availability 
> facet came from. Identifying the underlying concepts is more-or-less 
> the closest we can get you (easily), as VLO does not keep the 
> information from which actual profile/element given value comes from. 
> (this is also in response to 2nd point of 3) ). However one can get 
> this information in the detail-view (looking into full metadata 
> record). Not super convenient, but well. And as said the ProfileName 
> and DataProvider facets help you identify the provider and profile in 
> question.

OK.

> ad: licence type vs. license type Yes, indeed there is both a "licence 
> type" and a "license type" concept. they come from ISOcat still 
> (DC-3800 and DC-5439 resp.) I added a snapshot from SMC-browser to the 
> wiki page [2] showing where these two come from (in which profiles 
> they are used and what was the context of these profiles.
> (also attaching the snapshot) It is 3 and 5 profiles using these, I 
> guess it would be possible in this case to ask the authors (with the 
> help from the CCR and CMDI team) to merge these two and correct the 
> profiles accordingly.

Good idea.

> Two more points from my side: AFAI understood there is a conflict in 
> the understanding of PUB/ACA/RES in CLARIN and in META-SHARE, in 
> META-SHARE everything beyond CC-0 being of availability:restrictedUse. 
> Is that correct? The example above [1] delivers also the CLARIN 
> compliant licence (CLARIN_ACA-NC), but I doubt that this is the case 
> for all META-SHARE records. So in my understanding we need to 
> disregard the availability information in resourceInfo-profile and 
> just regard the licence and restrictionsOfUse. Would you agree?

I seem to remember that META-SHARE made a point of declaring everything except CC0 restricted, which may be true from a legal point of view, although I don't think they used this for any particular purpose as all the regular CC licenses then also fall into the META-SHARE restricted category.

In CLARIN, the RES category was intended to be used for resources "restricted to individual use" typically containing personal data preventing them from being opened to a broader category of users. This is often referred to only as "restricted use" due to the RES acronym and therefore misinterpreted in view of the META-SHARE terminology.

> The next question that is not clear to me: - Is NC equivalent with 
> ACA? Because then we have a problem with CC-NC?

No. NC is not equivalent with ACA.

In its basic form, ACA means "resources available for educational, teaching and research purposes" including commercial research, so we need NC to specify that an ACA resource is available only for non-commercial purposes.

In addition, ACA implies ID i.e. "A user needs to be authenticated or identified." and BY as that is required by law in most EU countries anyway. (This is why there is CC0 to explicitly say that we don't care about attribution.)

Authentication implies more than self-identification for collecting usage statistics, so someone needs to verify the identity. For this we need an affiliation to some community that can authenticate the user. 
We currently offer two flavors of affiliation: EDU and META. If nothing else is mentioned EDU is assumed (which is the pure ACA), but if META is mentioned (by saying ACA+META), we also acknowledged that the META community, which includes industrial partners, may do the authentication. How they do it, is up to them.

In contrast to ACA resources, we may also have resources available for any purpose that still require self-identification for collecting usage statistics, e.g. the ip address may be collected or some email address or whatever means of identification the distributor of the resource chooses. This does not restrict access to the resource to a particular community, so we can therefore put such resources in the category PUB+ID.

In order to be able to control the ID for authentication, ACA also implies NORED. If the resource could be distributed freely to other researchers, automated authentication could not be implemented and would also not make sense.

More generally the following implications hold:

  ACA => ID;BY;NORED
  ACA;META => ID;BY
  RES => ID;BY;NORED

> I hope I did not add more confusion.

I hope my answers clarified some parts.

--
Krister


More information about the Tf-curation mailing list