[Tf-curation] l/a synthesis

Ondřej Košarko kosarko at ufal.mff.cuni.cz
Wed Feb 10 14:29:55 CET 2016


Hello everyone,

I am getting slightly concerned with the various mentions that the license
labeling should be coming from the producers (centers) themselves. Is that
really so or am I misreading the discussion?

As you've been writing all along, the labels are just
interpretation/summary of the actual conditions, so what happens if two
centers choose to label the same license differently? Wouldn't that be
confusing in the results? Or are the incoming labels meant to be used only
for "not well known licenses"?

Regards,
Ondrej

2016-02-05 16:09 GMT+01:00 Krister Lindén <krister.linden at helsinki.fi>:

> Since we are interpreting what others have written, we need to take a
> conservative view not to give the impression that the data is more freely
> available than it is. We are currently in the process of interpreting the
> usage conditions in the metadata by going from "unspecified" to something
> slightly more informative. We therefore interpret the resources to be as
> freely accessible as we safely can. If the metadata provider is unhappy
> with how narrowly his legal metadata is interpreted in the VLO, he can set
> the elements more exactly in his CMDI metadata.
>
> (Non-CLARIN Centers have the same opportunity. For a CLARIN Center, CLARIN
> ERIC has added responsibility for auditing the quality of the relationship
> between the licenses and the CMDI metadata.)
>
> Penny's advice to be careful is good but it actually works in the other
> direction based on the principle that one cannot give more rights than one
> has. Without additional info, "for research" can be narrowly approximated
> by at least +LRT without saying whether the data is available for all
> (PUB), for a trusted community (ACA) or upon personal request to the owner
> (RES). However, using only +LRT would likely keep the resource in the
> "unspecified" main category, whereas using ACA is a rather safe bet stating
> that the data is available for research while also assuming that the
> downloader needs to be identified to access the data, which in most cases
> is unfortunately still true for research data that is not explicitly
> licensed with one of the open or public licenses.
>
> If we have additional knowledge that +ID is not required by the license of
> the resource, then Penny's suggestion for PUB+research can be narrowly
> approximated with tags saying that at least PUB+LRT is safe to assume.
>
> Regards,
> Krister
>
>
> On 5.2.2016 11:19, Penny Labropoulou wrote:
>
>> Dear Matej and Krister,
>> If indeed the VLO harvests only from CLARIN centres, and following
>> Krister's explanations, ok, let's have the tags explicit.
>> But I have the feeling that the VLO also harvests from other sources, and
>> these may not all include CMDI metadata or, even more, a licensing category
>> or even a licence (which should be imposed at least for new data!); if this
>> is the case, then we are actually interpreting the providers' metadata,
>> often just free text statements, in which case we should be more careful, I
>> think. If the providers simply state "for research" and we interpret that
>> as ACA (as done in our excel), then the ID tag may be more than what the
>> original providers ask for; if asked, they might have gone for PUB
>> +research. In any case, the users are directed through the VLO to where the
>> resource itself is made available, and there, the users will have to accept
>> whatever licensing conditions the provider asks for and it's up to the
>> source distributor to enforce it. If the source distributor is a CLARIN
>> centre, the ID will be imposed by our own policy, and it's clearly stated,
>> as Krister says, in the agree!
>>
> ment tem
>
> plate.
>
>> I would like to see the ID being used as a way of facilitating access to
>> resources for researchers in a trusted federation such as CLARIN, rather
>> than a way of discouraging access to resources.
>> As said, if the VLO harvests only from CLARIN centres, just disregard all
>> the above.
>> Best,
>> Penny
>>
>> -----Original Message-----
>> From: Krister Lindén [mailto:krister.linden at helsinki.fi]
>> Sent: Friday, February 05, 2016 3:55 AM
>> To: Durco, Matej <Matej.Durco at oeaw.ac.at>; Penny Labropoulou <
>> penny at ilsp.gr>; 'Twan Goosen' <twan.goosen at mpi.nl>; 'Sander Maijers' <
>> sander at clarin.eu>
>> Cc: 'Menzo Windhouwer2' <menzo.windhouwer at meertens.knaw.nl>; 'Thomas
>> Eckart' <teckart at informatik.uni-leipzig.de>; Ostojic, Davor <
>> Davor.Ostojic at oeaw.ac.at>; tf-curation at lists.clarin.eu; Sugimoto, Go <
>> Go.Sugimoto at oeaw.ac.at>; 'Dieter Van Uytvanck' <dieter at clarin.eu>
>> Subject: Re: l/a synthesis
>>
>> Dear Matej,
>>
>> Regarding the explicitness of tags: In the current Agreement templates
>> for a normal ACA resource, the federated login, attribution and no
>> redistribution conditions are made explicit. It would therefore be better
>> to reflect this in the tags in the VLO for this automated update.
>>
>> Not having the tags explicitly may cause liability for CLARIN in some
>> cases as it encourages unintended usage, whereas being slightly too strict
>> in the labeling will have no legal implications. People will only be
>> pleasantly surprised that some resources are more widely useable than they
>> imagined.
>>
>> When CLARIN Centers provide their own licenses with tags already marked
>> as CMDI components, they will be responsible for the labeling of their own
>> licenses. If they do not e.g. require login for their particular brand of
>> licenses "for (teaching, education and) research-purpose", they may leave
>> out the ID tag, but if the ID tag is only a non-explicit assumption via the
>> guidelines, the Centers will not even be able to leave it out, as it should
>> always be implicitly read into the tag set.
>> (Note that the assessment of the license labeling should be part of the
>> regular CLARIN Center assessment procedure.)
>>
>> The "other" tag is there to draw the attention of the user to peculiar
>> but relevant usage conditions similar to "only to be used on Tuesdays"
>> or the like. We can't have a tag for everything, but an asterisk is an
>> indication that this license has conditions out of the ordinary. We are
>> aware that recognizing what is out of the ordinary may be non-trivial.
>>
>> Regards,
>> Krister
>>
>>
>> On 3.2.2016 18:28, Durco, Matej wrote:
>>
>>> Dear Krister,
>>>
>>> thank you a lot for the extensive response, this is really very helpful!
>>>
>>> In my view, your clarifications regarding NC/ACA and derivative data
>>> should definitely find their way into the public information about L/A [1].
>>>
>>> The hint that the three main categories imply certain subcategories (ACA
>>> => ID;BY;NORED) is also very helpful.
>>> I just wonder, if we want to make it *explicit* in the VLO (i.e. add for
>>> every resource with ACA tag, also the ID, BY and NORED attributes), or just
>>> explain it (somewhere under [1]) and leave that implicit in VLO.
>>>
>>> In the list you proposed to map a few "restricted ..." values with "*"
>>> (or other), which seems a bit counterintuitive, but I guess this has to
>>> do with the special meaning of "RES"... ?
>>>
>>> The next steps:
>>> We right now process the (Krister's) mapping into a normalization map as
>>> used by VLO.
>>> We will apply it on our Minerva VLO instance first and let you inspect
>>> the new mappings probably  on Monday.
>>> We will also tentatively try to map from the dc-concepts
>>> (dcterms:rights, dcterms:accessRights, dcterms:license), to see if we
>>> can get a better coverage (the profile coverage analysis [2] suggests so)
>>> After a few days validation and comment period, we would apply the mapping
>>> in the main VLO instance (and roll out with version 3.4).
>>>
>>> There is one more thing, we would like to have feedback on, especially
>>> from CLIC. That is the labels and definitions for the l/a related facets.
>>> But I spare that for a separate email.
>>>
>>> Thank you for all the input so far.
>>>
>>> Best,
>>> Matej
>>>
>>> [1] https://www.clarin.eu/content/license-categories
>>> [2]
>>> https://docs.google.com/spreadsheets/d/1eeOr0ShOWxdY8BLzp62LDyfGgHo0gZ
>>> 95Myw0qauzLxU/edit#gid=0&vpid=A1
>>>
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: Krister Lindén [mailto:krister.linden at helsinki.fi]
>>> Gesendet: Sonntag, 31. Jänner 2016 22:26
>>> An: Durco, Matej <Matej.Durco at oeaw.ac.at>; Penny Labropoulou
>>> <penny at ilsp.gr>; 'Twan Goosen' <twan.goosen at mpi.nl>; 'Sander Maijers'
>>> <sander at clarin.eu>
>>> Cc: 'Menzo Windhouwer2' <menzo.windhouwer at meertens.knaw.nl>; 'Thomas
>>> Eckart' <teckart at informatik.uni-leipzig.de>; Ostojic, Davor
>>> <Davor.Ostojic at oeaw.ac.at>; tf-curation at lists.clarin.eu; Sugimoto, Go
>>> <Go.Sugimoto at oeaw.ac.at>; 'Dieter Van Uytvanck' <dieter at clarin.eu>
>>> Betreff: Part I: Re: AW: AW: [Tf-curation] License/Availability was
>>> WG: Re: LicenseAvailabilityMap.xml in
>>> vlo/trunk/vlo-commons/src/main/resources – CLARIN Trac
>>>
>>> Dear all,
>>>
>>> It seems I could have responded in one long email, but I chose to answer
>>> in three different parts and therefore I end up confirming some of the
>>> things here that Penny said in the next message, while also adding some
>>> explanations, but here goes.
>>>
>>> Part I:
>>>
>>> On 21.1.2016 17:04, Durco, Matej wrote:
>>>
>>>> ACA vs. NC
>>>> as you rightly commented in the gsheet
>>>>
>>>> 1)
>>>>
>>>> * ad: PUB/ACA/RES yes, the goal is to have one of this 3 categories
>>>> assigned to each record/resource. ... the way it is solved in
>>>> META-SHARE profiles brought me to the idea to decompose to license
>>>> categories. Namely in META-SHARE profile the licenceInfo is quite
>>>> rich: There is a licence-element and the repeated restrictionsOfUse
>>>> element, so each record has more than enough information to correctly
>>>> map to both the main categories and the optional ones. (see example
>>>> Helsinki corpus [1]) Therefore I believe we can (and have to) be
>>>> conservative in the mapping and can avoid adding uncertain
>>>> information: Correct me if I am wrong but "non-commercial use" can be
>>>> safely mapped to the License category "NC",
>>>>
>>>
>>> Yes it can. There is currently a legal debate going on about what
>>> non-commercial means exactly, because it is a fuzzy concept, but whenever
>>> that discussion arrives at a conclusion, which may be as soon as the new
>>> directive for some kind of research exception emerges, that is the
>>> definition we will adopt.
>>>
>>> however mapping it to the main category PUB ACA or RES is problematic
>>>> without more information. Long story short, even though we cannot map
>>>> each individual value in the normalization map to one of the main
>>>> categories, in the end each every record/resource (that provides the
>>>> appropriate information) will be assigned to one of the three.
>>>>
>>>
>>> Agree.
>>>
>>> * ad: atomic vs. combined License Categories We want to try with
>>>> decomposition, i.e. atomic categories as separate facet values (BY,
>>>> SA, ...)
>>>>
>>>
>>> Good.
>>>
>>> * ad: indication of license/availability information being
>>>> unavailable in the dev-instance we use a placeholder "[missing
>>>> value]" (actually in all facets), but it needs to be decided if we
>>>> want to expose this in the main vlo, especially given the many
>>>> records falling into this category. we cannot say "non standard
>>>> license", because we don't know.
>>>> we can only say "unspecified" or synonyms thereof, "unspecified"
>>>> being sometimes used as value itself.
>>>>
>>>
>>> Unspecified is OK.
>>>
>>> 2) ad: C-* facets It's actually the opposite, these are special
>>>> facets exposing the values individual concepts that contribute to the
>>>> actual availability/license facets. (concept-facet mapping) The
>>>> overview of these concepts, incl. definition (copied from the source)
>>>> and links to CCR are in the trac-wiki [1]. These C-* facets are
>>>> exactly meant to be able to identify, where the individual values in
>>>> the availability facet came from. Identifying the underlying concepts
>>>> is more-or-less the closest we can get you (easily), as VLO does not
>>>> keep the information from which actual profile/element given value
>>>> comes from.
>>>> (this is also in response to 2nd point of 3) ). However one can get
>>>> this information in the detail-view (looking into full metadata
>>>> record). Not super convenient, but well. And as said the ProfileName
>>>> and DataProvider facets help you identify the provider and profile in
>>>> question.
>>>>
>>>
>>> OK.
>>>
>>> ad: licence type vs. license type Yes, indeed there is both a
>>>> "licence type" and a "license type" concept. they come from ISOcat
>>>> still
>>>> (DC-3800 and DC-5439 resp.) I added a snapshot from SMC-browser to
>>>> the wiki page [2] showing where these two come from (in which
>>>> profiles they are used and what was the context of these profiles.
>>>> (also attaching the snapshot) It is 3 and 5 profiles using these, I
>>>> guess it would be possible in this case to ask the authors (with the
>>>> help from the CCR and CMDI team) to merge these two and correct the
>>>> profiles accordingly.
>>>>
>>>
>>> Good idea.
>>>
>>> Two more points from my side: AFAI understood there is a conflict in
>>>> the understanding of PUB/ACA/RES in CLARIN and in META-SHARE, in
>>>> META-SHARE everything beyond CC-0 being of availability:restrictedUse.
>>>> Is that correct? The example above [1] delivers also the CLARIN
>>>> compliant licence (CLARIN_ACA-NC), but I doubt that this is the case
>>>> for all META-SHARE records. So in my understanding we need to
>>>> disregard the availability information in resourceInfo-profile and
>>>> just regard the licence and restrictionsOfUse. Would you agree?
>>>>
>>>
>>> I seem to remember that META-SHARE made a point of declaring everything
>>> except CC0 restricted, which may be true from a legal point of view,
>>> although I don't think they used this for any particular purpose as all the
>>> regular CC licenses then also fall into the META-SHARE restricted category.
>>>
>>> In CLARIN, the RES category was intended to be used for resources
>>> "restricted to individual use" typically containing personal data
>>> preventing them from being opened to a broader category of users. This is
>>> often referred to only as "restricted use" due to the RES acronym and
>>> therefore misinterpreted in view of the META-SHARE terminology.
>>>
>>> The next question that is not clear to me: - Is NC equivalent with
>>>> ACA? Because then we have a problem with CC-NC?
>>>>
>>>
>>> No. NC is not equivalent with ACA.
>>>
>>> In its basic form, ACA means "resources available for educational,
>>> teaching and research purposes" including commercial research, so we need
>>> NC to specify that an ACA resource is available only for non-commercial
>>> purposes.
>>>
>>> In addition, ACA implies ID i.e. "A user needs to be authenticated or
>>> identified." and BY as that is required by law in most EU countries
>>> anyway. (This is why there is CC0 to explicitly say that we don't care
>>> about attribution.)
>>>
>>> Authentication implies more than self-identification for collecting
>>> usage statistics, so someone needs to verify the identity. For this we need
>>> an affiliation to some community that can authenticate the user.
>>> We currently offer two flavors of affiliation: EDU and META. If nothing
>>> else is mentioned EDU is assumed (which is the pure ACA), but if META is
>>> mentioned (by saying ACA+META), we also acknowledged that the META
>>> community, which includes industrial partners, may do the authentication.
>>> How they do it, is up to them.
>>>
>>> In contrast to ACA resources, we may also have resources available for
>>> any purpose that still require self-identification for collecting usage
>>> statistics, e.g. the ip address may be collected or some email address or
>>> whatever means of identification the distributor of the resource chooses.
>>> This does not restrict access to the resource to a particular community, so
>>> we can therefore put such resources in the category PUB+ID.
>>>
>>> In order to be able to control the ID for authentication, ACA also
>>> implies NORED. If the resource could be distributed freely to other
>>> researchers, automated authentication could not be implemented and would
>>> also not make sense.
>>>
>>> More generally the following implications hold:
>>>
>>>     ACA => ID;BY;NORED
>>>     ACA;META => ID;BY
>>>     RES => ID;BY;NORED
>>>
>>> I hope I did not add more confusion.
>>>>
>>>
>>> I hope my answers clarified some parts.
>>>
>>> --
>>> Krister
>>>
>>>
>>
>> _______________________________________________
> Tf-curation mailing list
> Tf-curation at lists.clarin.eu
> https://lists.clarin.eu/cgi-bin/mailman/listinfo/tf-curation
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clarin.eu/pipermail/tf-curation/attachments/20160210/9a4f3f2f/attachment-0001.html>


More information about the Tf-curation mailing list