[Tf-curation] License/Availability was WG: Re: LicenseAvailabilityMap.xml in vlo/trunk/vlo-commons/src/main/resources – CLARIN Trac

Twan Goosen twan.goosen at mpi.nl
Tue Dec 1 14:10:30 CET 2015


Hi everyone,

I agree with Krister's assessment/proposal and would like to note that 
this is fully in line with what is currently possible with the VLO or at 
least scheduled to be implemented in the (somewhat) short term.

With respect to sorting on basis of licence category, I think it would 
make sense to go for a 'boosting' approach that is always applied. What 
this would come down to is that, all other things being equal, PUB level 
records will be shown first, then ACA, then RES, then undefined. Search 
term matching (especially in title or description) should have a higher 
weight, but I think it can weigh in at least as heavily as hierarchy level.

Since our plan is to also make these levels clearly visible in the 
search results, I don't think that an absolute, explicit sorting is 
needed, especially when there's also the option to filter at this basic 
level. From a technical point of view, however, this would be trivial, 
so if there is a consensus in favour of such a feature I'm fine with it, 
although I think we should always be careful not to make the user 
interface more complex than necessary.

Advanced search is already possible with the current feature set, 
providing the values are of good quality (and I'm confident that they 
will be soon :)). For example, one can search for "license:(GPL or 
LGPL)"[1] or "license:CC*"[2] or even "license:(CC*  -*NC*)"[3]. 
Admittedly, some (more) user guidance would be desirable - a first step 
would be to add one or more licence related examples to the advanced 
search syntax documentation. An advanced search form (with field 
specific search term suggestions) is one of the features we are 
considering, and would make this even easier.

Best,
Twan

P.S. I will add a 'licence' alias for the 'license' field

[1] https://vlo.clarin.eu/search?q=license:%28GPL+or+LGPL%29
[2] https://vlo.clarin.eu/search?q=license:CC*
[3] https://vlo.clarin.eu/search?q=license:%28CC*++-*NC*%29


On 01/12/15 00:01, Krister Lindén wrote:
> CLIC is sorely aware of the lack of agreement on licenses in general 
> (let alone on sorting them).
>
> Normally there will be only a few resources matching a particular real 
> end-user resource query, so most users will normally be satisfied with 
> perusing the full search result.
>
> However, providing the capability to sort the resources according to 
> their main CLARIN license category (PUB/ACA/RES/undefined) will give a 
> rough general preference to open resources and will take most users as 
> far as they care to go along this particular line of investigation.
>
> If we in addition provide the capability to exclude certain laundry 
> tags and the capability to provide part of a license name as a filter, 
> we already cover most foreseeable simple searches.
>
> There will always be those who want more, but for those there can be 
> some advanced search to be defined based on user feedback.
>
> -- 
> Krister
>
> On 30.11.2015 23:58, Sander Maijers wrote:
>> Hi all,
>>
>> Making license type ordinal in the sense of permissiveness favors one
>> interpretation of what users care about in a license, and the
>> interpretation is bound to be arbitrary to some degree. That we all
>> knew, but is everyone aware of the lack of consensus on such an order
>> and categorization even in primary sources?
>>
>> E.g., see the different categorizations in highly referenced sources
>> such as:
>> 1. https://opensource.org/licenses/category
>>
>> 2.
>> https://www.iprhelpdesk.eu/sites/default/files/newsdocuments/Intellectual%20Property%20Rights%20Management%20in%20Software%20Developments_updated.pdf 
>>
>>
>> 2.
>> https://en.wikipedia.org/wiki/License_compatibility#Compatibility_of_FOSS_licenses 
>>
>>
>> 3. Academic take:
>> http://jleo.oxfordjournals.org/content/21/1/20.full.pdf+html:
>>
>>     We will consider three classes of licenses: unrestrictive [e.g., the
>>     Berkeley Software Definition (BSD) license], restrictive [e.g.,
>>     lesser general public license (LGPL)], and highly restrictive
>>     [general public license (GPL)]. (See below for a more complete
>>     discussion of these licenses.)
>>
>>
>> In conclusion, a resource license as encoded in metadata ought to be an
>> enumerated/sum type and it is a matter of search implementation how to
>> rank, unify and filter its levels.
>>
>> Best,
>> Sander
>> -- 
>> *Sent as system administrator and engineer for CLARIN*
>> /Centre Registry & Service Provider Federation/ @
>> {centres,infra}.clarin.eu <http://clarin.eu>;
>> /software engineering tools/ @ {svn,trac}.clarin.eu <http://clarin.eu>;
>> /identity and access management/ @ {user,idp}.clarin.eu 
>> <http://clarin.eu>
>> /usage statistics and service monitoring/ @ stats.clarin.eu
>> <http://stats.clarin.eu>
>>
>> Max Planck Institute for Psycholinguistics <https://tla.mpi.nl/>,
>> software developer
>> personal Skype: sander.maijers | work address: Wundtlaan 1, 6525 XD,
>> Nijmegen (NL)
>>
>>
>>
>> On Mon, Nov 30, 2015 at 1:02 PM, Krister Lindén
>> <krister.linden at helsinki.fi <mailto:krister.linden at helsinki.fi>> wrote:
>>
>>     Quick answer to Penny's question about PUB/ACA/RES is that they
>>     should also be facets that you can restrict, i.e. choose to get data
>>     that is neither ACA nor RES leaving only open or public data with
>>     varying licenses.
>>
>>     Despite our good efforts to collect data, people will be happy to
>>     find a resource at all. I do not really think that they have the
>>     luxury of choosing e.g. whether they want a treebank for a
>>     particular language with a CC-BY and not an MIT license, or vice
>>     versa, but they may wish to say that they want a treebank with an
>>     open or public license, if available.
>>
>>     It is this final "if available", that has gotten me thinking that we
>>     should probably also let legal metadata provide a sorting order,
>>     because other criteria will be more important, i.e. if a favorite
>>     license is not on offer, I may settle for a
>>     slightly-more-difficult-to-manage license, as the legal status of
>>     the resource is more like a price tag, i.e. it will cost me more
>>     effort to deal with a restricted resource than an open one, but if I
>>     need an English speech data set, I will not settle for some Russian
>>     text data simply because the license is more interesting.
>>
>>     This said, if it is not too much of an effort, we could of course
>>     provide the option to also write the name (or part of a name) of a
>>     license as a search criterion. After all, that can be implemented as
>>     rather straightforward string matching in the license name field.
>>
>>     --
>>     Krister
>>
>>     On 30.11.2015 13:11, Penny Labropoulou wrote:
>>
>>         Hi Matej, Krister and all
>>
>>         Some thoughts on the topics raised:
>>         - license & availability are indeed too close semantically and
>>         that's where the confusion comes; moreover, for the
>>         normalization, the values are taken from different attributes
>>         which brings about the contradicting outcomes we noticed in 
>> Wroclaw.
>>         - Now, if I understand correctly the new approach, both facets
>>         will be replaced by the License Categories, is that it? If yes,
>>         I think this would improve the situation and we need to check
>>         the new mappings. In this case, my only question to Krister and
>>         the CLIC, is whether the PUB/ACA/RES should be treated at the
>>         same level as the other tags.
>>         - in this scenario, I agree in principle with Krister's email
>>         about the sorting of the resources when shown to the user - but
>>         I don't know what other sortings (apart from alphabetical
>>         ordering on the resource name) you have also implemented on the
>>         new VLO; it would also be nice to somewhere state that this is
>>         the ordering of the resources or allow the user decide on the
>>         sorting, perhaps?
>>         - the only problem I have with keeping only license categories
>>         in the facets, is that we "lose" information of resources that
>>         are licensed with a standard license, e.g. CC, GNU etc. Could we
>>         have a second facet for these? Problems with this: (a) confusion
>>         between the two facets (will the user understand that
>>         "attribution" will give him more results than CC-BY?) and (b)
>>         mapping of resources without a standard license value... I'm
>>         just putting up the question without having a definite answer.
>>
>>         And to Matej's question: sorry, I haven't done anything yet :-(.
>>         Do you have a deadline for the normalization issues? I can look
>>         at it closer in the next couple of weeks, taking into account
>>         the current discussion outcomes. And, if possible, it would be
>>         nice to know the attributes these concepts come from and the
>>         combinations thereof (i.e. if the same resource has two or more
>>         licensing-related attributes and/or values, get the combinations
>>         thereof).
>>
>>         Best,
>>         Penny
>>
>>
>>         -----Original Message-----
>>         From: Krister Lindén [mailto:krister.linden at helsinki.fi
>>         <mailto:krister.linden at helsinki.fi>]
>>         Sent: Friday, November 27, 2015 5:55 PM
>>         To: Durco, Matej <Matej.Durco at oeaw.ac.at
>>         <mailto:Matej.Durco at oeaw.ac.at>>; penny at ilsp.gr
>>         <mailto:penny at ilsp.gr>; Menzo Windhouwer2
>>         <menzo.windhouwer at meertens.knaw.nl
>>         <mailto:menzo.windhouwer at meertens.knaw.nl>>; Twan Goosen
>>         <twan.goosen at mpi.nl <mailto:twan.goosen at mpi.nl>>; Thomas Eckart
>>         <teckart at informatik.uni-leipzig.de
>>         <mailto:teckart at informatik.uni-leipzig.de>>; Ostojic, Davor
>>         <Davor.Ostojic at oeaw.ac.at <mailto:Davor.Ostojic at oeaw.ac.at>>
>>         Cc: tf-curation at lists.clarin.eu 
>> <mailto:tf-curation at lists.clarin.eu>
>>         Subject: Re: License/Availability was WG: Re:
>>         LicenseAvailabilityMap.xml in
>>         vlo/trunk/vlo-commons/src/main/resources – CLARIN Trac
>>
>>         Dear all,
>>
>>         One thing we discussed afterwards in CLIC, was the best approach
>>         to utilize the legal metadata in a query. We have no illusions
>>         that the legal metadata would be the primary criterion people
>>         use to select their data. Other criteria such as name of the
>>         resource, languages covered, data type and usage purpose are
>>         probably more crucial, but if users have a choice, they probably
>>         look for resources that have a clearly defined legal status and
>>         have as few restrictions as possible.
>>
>>         "As few restrictions as possible" implies a sorting order.
>>         However, a user may dislike some restrictions for practical
>>         purposes. Therefore, it would make sense to let the user check
>>         the tags he would like to filter out and if none are checked,
>>         none are filtered, i.e. all resources conforming to the primary
>>         criteria are shown. In addition, the user should be able to
>>         filter out resources whose legal status is undefined, because
>>         the user would not know how to legally use the resource, even if
>>         it exists. If displayed within a sorting order of "as few
>>         restrictions as possible", resources with undefined legal status
>>         should be at the end of the list.
>>
>>         Regards,
>>         Krister
>>
>>         On 27.11.2015 16:08, Durco, Matej wrote:
>>
>>             Dear all,
>>
>>             I only very late found out that there was a follow-up on the
>>             License
>>             issue right after the conference (see email below).
>>
>>             Penny were you able to proceed on that?
>>
>>             Meanwhile we did experimented quite a bit and compiled
>>             information, so
>>             here is our current take on this for our (TF Curation /
>>             ACDH-OEAW) side:
>>
>>             We put down an overview (and would like to collect there
>>             more findings
>>             and decisions as we go along) in clarin-trac [1]
>>
>>             Main points:
>>
>>             1.Some of the concepts are linked to both facets (not
>>             necessarily bad,
>>             but a hint that we don’t have a clear distinction
>>
>>             2.There is a normalisation file employed, which is however
>>             incomplete
>>             (new unmapped values exist, some of which are however
>>             obviously in the
>>             completely wrong place (like size in kB) )
>>
>>             3.With current concept-mapping we cover only some 60.000 out
>>             of 800.000
>>             records !!!
>>
>>             Regarding 2: the Normalization
>>
>>             The current normalization uses the 3-4 values distinction:
>>             Free; Free
>>             for academic use; Restricted; Upon request (in line with
>>             PUB/ACA/RES –
>>             laundry tags)
>>
>>             This sounds easy, but as far as I could gather, it is
>>             problematic (in
>>             many ways).
>>
>>             In Wroclaw, we discussed with Krister an alternative 
>> approach:
>>
>>             We could try to map to the license categories as they are
>>             defined [2] by
>>             the Legal Issues Committee and available also in the License
>>             Category
>>             Calculator [3]. By that we would avoid the problematic
>>             reduction, still
>>             keeping the “laundry-tag” approach. And we would be in sync
>>             with the
>>             Legal committee recommendations. Also each of these atomic
>>             tags is well
>>             defined and most of them broadly used in the webs.
>>
>>             We could employ here the decomposition approach, in line
>>             with what we
>>             try to adopt for resourceType and other facets, that means,
>>             we wouldn’t
>>             have facet values: [ “PUB”, “PUB+BY”, “PUB+BY+SA”] but
>>             rather [“PUB”,
>>             “BY”, “SA”].
>>
>>             Allowing multiple possible values for the facet in each
>>             record in
>>             combination with the (already implemented) multi-select
>>             feature in VLO
>>             this should cover for all use cases and be more ergonomic
>>             (e.g. if I am
>>             interested only in the Non-Commercial clause, I need to
>>             select only one
>>             facet value and don’t have to search for all the combination
>>             that
>>             contain NC.)
>>
>>             There is already a ​normalisation map used in production
>> <https://github.com/clarin-eric/VLO/blob/master/vlo-commons/src/main/resources/LicenseAvailabilityMap.xml>
>>             [4](committed 2015-04-23). But there are new values that are
>>             not mapped
>>             yet. ​Normalisation map as gsheet
>> <https://drive.google.com/open?id=1Pf8Jk_P7RaA-7-dj8fcLOKNH5DjprraFEWXgQ3FvVtQ>
>>             [5] with already existing mappings (see normalisation map
>>             above) + new
>>             values encountered not yet normalized; Values come from 
>> elements
>>             annotated with concepts linked to one of the two facets
>>             License/Availability.
>>
>>             If we agree on the decomposition approach, this list would
>>             need to be
>>             reviewed completely, but it’s just around 240 entries.
>>
>>             And ad 3. Missing values
>>
>>             Here we have 3 possible situations:
>>
>>             1.Profile does not have any information about
>>             licensing/availability
>>             (worst case)
>>
>>             2.Profile has information about L/A, but is not linked to a
>>             concept, or
>>             the concept is not in the facet mapping
>>
>>             3.Profile is well defined, with linking to one of the
>>             concepts in the
>>             facet mapping, but the information is simply not filled in
>>             the record.
>>
>>             We prepared a list ​profile/facet coverage
>> <https://docs.google.com/spreadsheets/d/1eeOr0ShOWxdY8BLzp62LDyfGgHo0gZ95Myw0qauzLxU/edit#gid=0&vpid=A1>
>>             [6] with special considerations of availability and
>>             licensing facet.
>>             Especially also the individual concepts contributing to the
>>             facet are
>>             plotted (see the c-* columns).
>>
>>             If you want to further investigate this issue, I strongly
>>             recommend our
>>             experimental instance of the VLO on Minerva
>>             <https://minerva.arz.oeaw.ac.at/vlo/> [7].
>>
>>             It features normalized and unnormalized facets, explicit
>>             [missing
>>             values], profileID and name as facets, data provider facet
>>             showing the
>>             actual data provider, multi-value selection and also special
>>             facets for
>>             the concepts contributing to facet availability (i.e. every
>>             concept is
>>             plotted as a separate facet; these are marked with prefix 
>> c-)
>>
>>             With all this you can only to easily see that the biggest
>>             contributor to
>>             missing values in availability facet is Meertens [8]
>>             (Playing the blame
>>             game ;)  And you can equally easily see what are the
>>             respective profiles
>>             (just open the profile Name facet).
>>
>>             So much to our findings until now. We would love to hear
>>             from you, what
>>             do you think, perhaps we c/should arrange a telco to discuss
>>             how to go
>>             on about this.
>>
>>             Best,
>>
>>             Matej
>>
>>             [1]
>> https://trac.clarin.eu/wiki/Taskforces/Curation/ValueNormalization/License
>>
>>             [2] https://www.clarin.eu/content/license-categories
>>
>>             [3]
>> https://www.clarin.eu/content/clarin-license-category-calculator
>>
>>             [4]
>> https://github.com/clarin-eric/VLO/blob/master/vlo-commons/src/main/resources/LicenseAvailabilityMap.xml
>>
>>             [5]
>> https://drive.google.com/open?id=1Pf8Jk_P7RaA-7-dj8fcLOKNH5DjprraFEWXgQ3FvVtQ
>>
>>             [6]
>> https://docs.google.com/spreadsheets/d/1eeOr0ShOWxdY8BLzp62LDyfGgHo0gZ95Myw0qauzLxU/edit#gid=0&vpid=A1
>>
>>             [7] https://minerva.arz.oeaw.ac.at/vlo/
>>
>>             [8]
>> http://minerva.arz.oeaw.ac.at/vlo/search?fq=dataProvider:Meertens_Institute_Metadata_Repository&fq=availability:%5Bmissing+value%5D
>>
>>
>>
>>             -------- Forwarded Message --------
>>
>>             *Subject: *
>>
>>
>>
>>             Re: LicenseAvailabilityMap.xml in
>>             vlo/trunk/vlo-commons/src/main/resources – CLARIN Trac
>>
>>             *Date: *
>>
>>
>>
>>             Sat, 17 Oct 2015 10:34:04 +0200
>>
>>             *From: *
>>
>>
>>
>>             Twan Goosen <twan.goosen at mpi.nl <mailto:twan.goosen at mpi.nl>>
>>             <mailto:twan.goosen at mpi.nl <mailto:twan.goosen at mpi.nl>>
>>
>>             *To: *
>>
>>
>>
>>             Penny Labropoulou <penny at ilsp.gr <mailto:penny at ilsp.gr>>
>>             <mailto:penny at ilsp.gr <mailto:penny at ilsp.gr>>
>>
>>             *CC: *
>>
>>
>>
>>             Thomas Eckart <teckart at informatik.uni-leipzig.de
>>             <mailto:teckart at informatik.uni-leipzig.de>>
>>             <mailto:teckart at informatik.uni-leipzig.de
>> <mailto:teckart at informatik.uni-leipzig.de>>, Matej Durco
>>             <xnrn at gmx.net <mailto:xnrn at gmx.net>>
>>             <mailto:xnrn at gmx.net <mailto:xnrn at gmx.net>>
>>
>>
>>
>>             That would be great. To get more information on the mapping
>>             from the
>>             values in resourceInfo records to VLO facets, you can enter
>>             the profile
>>             id 'clarin.eu:cr1:p_1361876010571' in the input box of the
>>             "check
>>             profile" form at
>> <https://lux17.mpi.nl/isocat/clarin/vlo/mapping/index.html>
>> <https://lux17.mpi.nl/isocat/clarin/vlo/mapping/index.html>.
>>
>>             This will give you quite a lot of information but the
>>             relevant sections
>>             would be
>>
>>                   Facet: availability
>>                        Matched CMD Element ConceptLink:
>> http://hdl.handle.net/11459/CCR_C-2457_45bbaa1a-7002-2ecd-ab9d-57a189f694a6
>>
>> /c:CMD/c:Components/c:resourceInfo/c:distributionInfo/c:licenceInfo/c:licence/text()
>>                   xpath accepted
>>                        Matched CMD Element ConceptLink:
>> http://hdl.handle.net/11459/CCR_C-2453_1f0c3ea5-7966-ae11-d3c6-448424d4e6e8
>>
>> /c:CMD/c:Components/c:resourceInfo/c:distributionInfo/c:licenceInfo/c:restrictionsOfUse/text()
>>                   xpath accepted
>>
>>             and
>>
>>                   Facet: license
>>                        Matched CMD Element ConceptLink:
>> http://hdl.handle.net/11459/CCR_C-2457_45bbaa1a-7002-2ecd-ab9d-57a189f694a6
>>
>> /c:CMD/c:Components/c:resourceInfo/c:distributionInfo/c:licenceInfo/c:licence/text()
>>                   xpath accepted
>>                        Matched CMD Element ConceptLink:
>> http://hdl.handle.net/11459/CCR_C-2453_1f0c3ea5-7966-ae11-d3c6-448424d4e6e8
>>
>> /c:CMD/c:Components/c:resourceInfo/c:distributionInfo/c:licenceInfo/c:restrictionsOfUse/text()
>>                   xpath accepted
>>
>>
>>             So the two fields 'licence' and 'restrictionsOfUse' are
>>             mapped to both
>>             facets (via the concepts 'availability' and 'license'). By
>>             looking at
>>             the mapping file we can able to see why this results in the
>>             three
>>             different availability levels we are now getting in the VLO
>>             (at least in
>>             the case of
>> <http://catalog-clarin.esc.rzg.mpg.de/vlo/search?q=perso&fq=country:Finland>
>> <http://catalog-clarin.esc.rzg.mpg.de/vlo/search?q=perso&fq=country:Finland>):
>>             - license 'CLARIN_ACA-NC' maps to 'Free for academic use'
>>             - restriction 'attribution' maps to 'Free'
>>             - restriction 'noRedistribution' maps to 'Restricted'
>>
>>             The next step is to decide what would be the desired mapping
>>             (logic).
>>
>>             Best,
>>             Twan
>>
>>             On 16/10/15 22:14, Penny Labropoulou wrote:
>>
>>                   No problem! Glad to do it - it was more or less on our
>>             agenda for
>>                   CLIC, so I'll have a look and let you know of the
>>             outcomes.
>>
>>                   Best,
>>
>>                   Penny
>>
>>                   On 16 October 2015 at 16:06, Twan Goosen
>>             <twan.goosen at mpi.nl <mailto:twan.goosen at mpi.nl>
>>                   <mailto:twan.goosen at mpi.nl
>>             <mailto:twan.goosen at mpi.nl>>> wrote:
>>
>>                       Thanks for your offer to look through this 
>> mapping!
>>                       I will also send you a link to Menzo's mapping 
>> tool.
>>
>> https://trac.clarin.eu/browser/vlo/trunk/vlo-commons/src/main/resources/LicenseAvailabilityMap.xml
>>
>>
>>
>>
>>     _______________________________________________
>>     Tf-curation mailing list
>>     Tf-curation at lists.clarin.eu <mailto:Tf-curation at lists.clarin.eu>
>>     https://lists.clarin.eu/cgi-bin/mailman/listinfo/tf-curation
>>
>>



More information about the Tf-curation mailing list