[Tf-curation] License/Availability was WG: Re: LicenseAvailabilityMap.xml in vlo/trunk/vlo-commons/src/main/resources – CLARIN Trac

Penny Labropoulou penny at ilsp.gr
Tue Jan 19 12:51:31 CET 2016


Hi all!
I've had a quick look at the various documents and emails sent and would like to add to the discussion with the following comments:

1) I welcome the distinction of the facets (if I understand well Matej's email) into the three: (a) licence category tags (either altogether or broken into broad & fine-grained tags); (b) licence (when clearly stated) & (c) the "original availability" statement of the provider. But it's important to make clear to the user the semantics of these three and how and when these are to be used and how they interact. Some points that have come to my mind:
	- PUB/ACA/RES distinction: it would be nice if we could classify all resources (or at least those that have some kind of access/licensing statement) into this; Krister, do you think this is doable? From the mapping google spreadsheet, I got the feeling it's not always possible. 
	- Some of the licence categories come with specific tags, e.g. RES must also have the tag ID (am I right, Krister?); if so, these should be specified in the facet, too.
	- Matej, what I didn't get is whether these tags will all be different facets values, e.g. PUB, BY, NC or whether there will be also combinations thereof, e.g. PUB, PUB-BY, PUB-BY-NC already in the list of values to be shown on the facet.
	- How are resources that do not have any clear indication of access statement or that cannot be mapped to the normalized values be shown? Although I do not expect anyone to ask for "resources that do not require attribution" or that come with a licence other than CC-BY or LGPL, as a user, I would like to know that there are 100 resources with CC-BY and 500 more with some "unspecified licence" or "non standard licence" or "no licence at all" ...
2) In the current VLO-dev link (https://minerva.arz.oeaw.ac.at/vlo/search?0), I had difficulty understanding how these interact. If I understand well, C-... are the ones that will replace the existing facets? Please, note that in this there are two C-LICENSE_TYPE & C-LICENCE_TYPE, and I didn't get what the difference between them is. Also, for C-LICENSE, if you stick to "clearly identified licences", then values such as "proprietary", "other", "special license" etc. should not be included in this. If I've misunderstood the columns, could you provide some more explanations?
3) As regards the mapping of the values, 
	- I've had a quick look at the googlesheet and I have noted the obviously erroneous ones and pointed out some confusing cases (in two separate columns); I will give it again a more thorough check within the week, but you can all see some problematic issues already.
	- One thing that has already been noted in previous discussions, though, is that it's important to know the attribute from which these values come. For instance, "non-commercial use" on its own doesn't help in classifying a resource: if it comes from the element "conditionsOfUse" (as used in the MetaShare schema), one should look at its accompanying conditions of use in order to come to a conclusion; if it it's a statement from an element such as "rights", it might meant that the resource is free for non-commercial use. Is there a way of knowing the attribute of the values?

Best,
Penny

-----Original Message-----
From: Twan Goosen [mailto:twan.goosen at mpi.nl] 
Sent: Tuesday, January 19, 2016 11:29 AM
To: Krister Lindén <krister.linden at helsinki.fi>; Sander Maijers <sander at clarin.eu>
Cc: Durco, Matej <Matej.Durco at oeaw.ac.at>; Penny Labropoulou <penny at ilsp.gr>; Menzo Windhouwer2 <menzo.windhouwer at meertens.knaw.nl>; Thomas Eckart <teckart at informatik.uni-leipzig.de>; Ostojic, Davor <Davor.Ostojic at oeaw.ac.at>; tf-curation at lists.clarin.eu; Sugimoto, Go <Go.Sugimoto at oeaw.ac.at>; Dieter Van Uytvanck <dieter at clarin.eu>
Subject: Re: AW: [Tf-curation] License/Availability was WG: Re: LicenseAvailabilityMap.xml in vlo/trunk/vlo-commons/src/main/resources – CLARIN Trac

(I'm putting Dieter in the loop because I know that he's interested in this as well)

On 19/01/16 10:28, Twan Goosen wrote:
> Hi Krister, all,
>
> On 18/01/16 15:40, Krister Lindén wrote:
>> [..] conceptual division of language resources into three main access 
>> categories, i.e. resources
>> - which are publicly or openly available (PUB),
>> - which are available for research or academic use (ACA), or
>> - which are restricted to individual use (RES).
>> Note that PUB resources are available to everyone, whereas ACA 
>> resources normally imply a trusted community, and RES resources 
>> typically contain personal data requiring an individual access 
>> permission.
> Thanks a lot for those definitions and the elaboration. After a recent 
> discussion I was more or less convinced that we should have a "licence 
> category" facet as the main filter mechanism in the VLO that includes 
> these values. However from your description I have to conclude that 
> these categories only indicate availability rather than type of 
> licence (even though these two are not completely orthogonal).
> Also, do I understand correctly that is feasible to map all resources 
> to one of these categories/levels, as long as they provide sufficient 
> licence and/or access information? I think this would be preferable 
> for use in the VLO, since a coarse initial distinction is more useful 
> than a really fine grained one. And I don't think that, for instance, 
> the standard CC restrictions make a resource less public in the eye of 
> the our users.
>
> I do have to say that I'm still a bit confused about the _exact_ 
> relations between licence (type), accessibility and availability.
> "Public" resources do not have to be directly openly accessible (in 
> some cases only upon request), while "restricted" resources can be 
> freely accessible, at least technically. How do such apparent 
> contradictions get reflected in the availability category for a given 
> resource? It would be nice if we had the description of an underlying 
> "algorithm" or at least some formulation of the heuristic on which our 
> mapping is based. Most likely it exists, but I have not been paying 
> attention - in which case my apologies :)
>
> In any case, from the user's perspective something like "availability" 
> (however we define it exactly) does indeed seem to be the most useful 
> primary dimension to filter by. I think we will have to reconsider the 
> decision to switch from availability to licence category, if the rest 
> of the VLO crowd agrees.
>> In hindsight, the acronyms could have been chosen differently 
>> especially when compared with the current Orchid categories, but the 
>> acronyms were chosen before the Orchid community appeared. Now the 
>> CLARIN license category acronyms have been around for so long that 
>> they are simply three-letter acronyms.
> I agree that it should be fine to use them, especially if we can 
> provide a description or definition for each of them simultaneously 
> (which can be done in the VLO).
>
> Best,
> Twan
>> On 18.1.2016 15:37, Sander Maijers wrote:
>>> Hi Krister,
>>>
>>> Requests for access to trac.clarin.eu <http://trac.clarin.eu> can be 
>>> directed to trac at clarin.eu <mailto:trac at clarin.eu>. As it happens I 
>>> handle those requests.
>>> I've checked and you do have access already. You can log in using 
>>> your e-mail address krister.linden at helsinki.fi 
>>> <mailto:krister.linden at helsinki.fi> and your password for 
>>> www.clarin.eu <http://www.clarin.eu>. So you CLARIN account. :)
>>>
>>> Best,
>>> Sander
>>> --
>>> *Sent as system administrator and engineer for CLARIN* /Centre 
>>> Registry & Service Provider Federation/ @ {centres,infra}.clarin.eu 
>>> <http://clarin.eu>; /software engineering tools/ @ 
>>> {svn,trac}.clarin.eu <http://clarin.eu>; /identity and access 
>>> management/ @ {user,idp}.clarin.eu <http://clarin.eu> /usage 
>>> statistics and service monitoring/ @ stats.clarin.eu 
>>> <http://stats.clarin.eu>
>>>
>>> Max Planck Institute for Psycholinguistics <https://tla.mpi.nl/>, 
>>> software developer personal Skype: sander.maijers | work address: 
>>> Wundtlaan 1, 6525 XD, Nijmegen (NL)
>>>
>>>
>>>
>>> On Mon, Jan 18, 2016 at 11:35 AM, Krister Lindén 
>>> <krister.linden at helsinki.fi <mailto:krister.linden at helsinki.fi>> wrote:
>>>
>>>     Dear Matej,
>>>
>>>     Thanks for the prompting. You provide a link to a spreadsheet 
>>> [5] in
>>>     which there seems to be only approx. 240 lines, many of which are
>>>     already correctly mapped, so an overhaul of them by me with some
>>>     keen-eyed checking by Penny should be doable within the next few 
>>> days.
>>>
>>>     (Your tentative tags REG and REQ are already covered by ID and RES.
>>>     Most of the other rather broad AVAILABILITY strings seem to map
>>>     nicely to existing tags as well as suggested by the current
>>> mapping.)
>>>
>>>     For the majority (90%) of resources that do not have any
>>>     availability record or license, the idea was that CLIC could
>>>     contribute with each country taking care of its own resources.
>>>
>>>     (Note. I could not enter the trac page [6] as it seems to require a
>>>     login and password.)
>>>
>>>     If anyone needs reassurance that PUB/ACA/RES is a viable 
>>> division of
>>>     access criteria for resources, you can have a look at Orchid, where
>>>     they have adopted a similar tripartite view of the world of how
>>>     researchers share their data (= papers and resources). Orchid even
>>>     uses the same color coding as we do for their categories: public
>>>     [=green], trusted [=yellow], and private [=red]. Orchid calls their
>>>     categories "Privacy Settings" for their researcher data:
>>> http://support.orcid.org/knowledgebase/articles/124518-orcid-privacy
>>> -settings
>>>
>>>     . This is the same idea as availability, but "Privacy settings" are
>>>     the categories seen from the right holder's perspective in stead of
>>>     the end-user's point of view.
>>>
>>>     Regards,
>>>     Krister
>>>
>>>     On 17.1.2016 16:33, Durco, Matej wrote:
>>>
>>>         Dear all,
>>>
>>>         time passes by ...
>>>         We would like to follow-up on our discussion in November,
>>>         December regarding availability and license information in 
>>> the VLO.
>>>
>>>         We would like to try the following setup:
>>>         1. introduce a License Category (search)facet using as values
>>>         the license categories tags as defined by CLIC [1]
>>>         2. restrict the License field[2] to only feature URLs or clear
>>>         identifiers of Licenses
>>>         3. (optionally) use Availability field (or however we want to
>>>         call it) as catch all with the original information (as 
>>> found in
>>>         the metadata records) that is not covered by the previous two.
>>>
>>>         1. means that we would preserve the PUB/ACA/RES distinction, 
>>> but
>>>         add also the more fine grained distinction (NC;SA;BY;...)
>>>         Not for the next 3.4 release but relatively soon it will be
>>>         possible to select multiple values within one facet (already
>>>         possible at our dev-instance [3]), which I believe would be the
>>>         cleanest solution, allowing the user to easily restrict the
>>>         search to whichever aspect (or a combination thereof). Until
>>>         then we still have to see if we need to divide the License
>>>         Categories into two facet: the main distinction + the fine
>>>         grained categories. We will let you test and feedback on both
>>>         possibilities.
>>>
>>>         Now, we would need your help!
>>>         We updated the map (as it is being used now [4])
>>>         a) with the fine-grained distinctions
>>>         b) with new previously unmapped values.
>>>
>>>         Here is the working spreadsheet [5]
>>>         column C contains the currently valid mapping, column D the
>>>         tentative new one
>>>         The mapping is a bit more complex:
>>>         - You can map one value to multiple values using semicolon 
>>> (e.g.
>>>         PUB;BY;NC;SA) still in one cell
>>>         - You can use two dashes ("--") to indicate that this values
>>>         should be disregarded for the facet (for obviously erroneous
>>> values)
>>>
>>>         I also took the freedom to tentatively introduce two more tags:
>>>         REG := after registration
>>>         REQ := upon request
>>>
>>>         If you propose a change, please, comment on it (with your name)
>>>         in one of the next columns
>>>
>>>         We put together some information regarding the two facets and
>>>         the mapping in CLARIN trac [6]
>>>         Also, on our dev instance of VLO [3] we added additional facets
>>>         allowing to better explore the details of the mapping.
>>>         There are facets:
>>>         - AVAILABILITYORIG := the values as they were found in the 
>>> records
>>>         - AVAILABILTY := the mapped/normalized values (according to the
>>>         currently used normalization map)
>>>         - C-* := each concept contributing to the AVAILABILITY facet
>>>         listed separately.
>>>
>>>         Example:
>>>         filter by "Availability:restricted" [7]
>>>         gives you 5438 records.
>>>         In AVAILABILITYORIG facet you get listed the original values
>>>         (next to "restricted" itself, HZSK-RES, 
>>> available-restrictedUse,
>>>         etc.)
>>>         and these same values further broken down by the concept
>>>         (underlying the CMD-element from which the values come from) in
>>>         the facets C-* ...
>>>         There you can further restrict, e.g. by 
>>> "c-license:CC-BY-NC-ND" [8]
>>>         The VLO-dev at minerva instance [3] features also extra facets
>>>         PROFILE NAME (, PROFILE ID) and DATA PROVIDER (with the actual
>>>         provider listed), so once you filtered [8], you can nicely see
>>>         which profiles and data providers contribute given values.
>>>         (In the case of [8], it is LINDAT, CLARIN_PL and Language Bank
>>>         of Finland with metashare profiles (data and resourceInfo) )
>>>         Obviously you can also use it the other way round and restrict
>>>         by provider or profile first and then see all the values
>>>         contributed by those.
>>>
>>>
>>>         Then there is the what I see as the more dramatic issue of too
>>>         many records not providing any licensing/availability
>>>         information (around 90% !), but I would spare that for a
>>>         separate email before this one gets too lengthy.
>>>
>>>         I am sorry I did not come back to you earlier, but we would be
>>>         very grateful if we could have your input/feedback in the next
>>>         days, as we are approaching the VLO-3.4 milestone (and also
>>>         simply to make progress for better user experience in resource
>>>         discovery)
>>>
>>>         Best,
>>>         Matej
>>>
>>>
>>>         [1]
>>> https://www.clarin.eu/content/clarin-license-category-calculator
>>>         [2] "field" means a facet that is only visible in the detail
>>>         view of the record and is not displayed as search facet
>>>         [3] https://minerva.arz.oeaw.ac.at/vlo/?0
>>>         [4]
>>> https://github.com/clarin-eric/VLO/blob/master/vlo-commons/src/main/
>>> resources/LicenseAvailabilityMap.xml
>>>
>>>         [5]
>>> https://docs.google.com/spreadsheets/d/1Pf8Jk_P7RaA-7-dj8fcLOKNH5Djp
>>> rraFEWXgQ3FvVtQ/edit#gid=0
>>>
>>>         [6]
>>> https://trac.clarin.eu/wiki/Taskforces/Curation/ValueNormalization/L
>>> icense
>>>
>>>         [7]
>>> http://minerva.arz.oeaw.ac.at/vlo/search?fq=availability:Restricted
>>>         [8]
>>> http://minerva.arz.oeaw.ac.at/vlo/search?fq=c-license:CC-BY-NC-ND&fq
>>> =availability:Restricted
>>>
>>>
>>>
>>>         -----Ursprüngliche Nachricht-----
>>>         Von: Krister Lindén [mailto:krister.linden at helsinki.fi
>>>         <mailto:krister.linden at helsinki.fi>]
>>>         Gesendet: Dienstag, 01. Dezember 2015 00:01
>>>         An: Sander Maijers <sander at clarin.eu <mailto:sander at clarin.eu>>
>>>         Cc: Penny Labropoulou <penny at ilsp.gr <mailto:penny at ilsp.gr>>;
>>>         Durco, Matej <Matej.Durco at oeaw.ac.at
>>>         <mailto:Matej.Durco at oeaw.ac.at>>; Menzo Windhouwer2
>>>         <menzo.windhouwer at meertens.knaw.nl
>>>         <mailto:menzo.windhouwer at meertens.knaw.nl>>; Twan Goosen
>>>         <twan.goosen at mpi.nl <mailto:twan.goosen at mpi.nl>>; Thomas Eckart
>>>         <teckart at informatik.uni-leipzig.de
>>>         <mailto:teckart at informatik.uni-leipzig.de>>; Ostojic, Davor
>>>         <Davor.Ostojic at oeaw.ac.at <mailto:Davor.Ostojic at oeaw.ac.at>>;
>>>         tf-curation at lists.clarin.eu 
>>> <mailto:tf-curation at lists.clarin.eu>
>>>         Betreff: Re: [Tf-curation] License/Availability was WG: Re:
>>>         LicenseAvailabilityMap.xml in
>>>         vlo/trunk/vlo-commons/src/main/resources – CLARIN Trac
>>>
>>>         CLIC is sorely aware of the lack of agreement on licenses in
>>>         general (let alone on sorting them).
>>>
>>>         Normally there will be only a few resources matching a
>>>         particular real end-user resource query, so most users will
>>>         normally be satisfied with perusing the full search result.
>>>
>>>         However, providing the capability to sort the resources
>>>         according to their main CLARIN license category
>>>         (PUB/ACA/RES/undefined) will give a rough general preference to
>>>         open resources and will take most users as far as they care to
>>>         go along this particular line of investigation.
>>>
>>>         If we in addition provide the capability to exclude certain
>>>         laundry tags and the capability to provide part of a license
>>>         name as a filter, we already cover most foreseeable simple 
>>> searches.
>>>
>>>         There will always be those who want more, but for those there
>>>         can be some advanced search to be defined based on user 
>>> feedback.
>>>
>>>         --
>>>         Krister
>>>
>>>         On 30.11.2015 23:58, Sander Maijers wrote:
>>>
>>>             Hi all,
>>>
>>>             Making license type ordinal in the sense of permissiveness
>>>             favors one
>>>             interpretation of what users care about in a license, 
>>> and the
>>>             interpretation is bound to be arbitrary to some degree. 
>>> That
>>>             we all
>>>             knew, but is everyone aware of the lack of consensus on 
>>> such
>>>             an order
>>>             and categorization even in primary sources?
>>>
>>>             E.g., see the different categorizations in highly 
>>> referenced
>>>             sources
>>>             such as:
>>>             1. https://opensource.org/licenses/category
>>>
>>>             2.
>>> https://www.iprhelpdesk.eu/sites/default/files/newsdocuments/Intelle
>>> ct 
>>> ual%20Property%20Rights%20Management%20in%20Software%20Developments_up
>>>             dated.pdf
>>>
>>>             2.
>>> https://en.wikipedia.org/wiki/License_compatibility#Compatibility_of_F
>>>             OSS_licenses
>>>
>>>             3. Academic take:
>>> http://jleo.oxfordjournals.org/content/21/1/20.full.pdf+html:
>>>
>>>                   We will consider three classes of licenses:
>>>             unrestrictive [e.g., the
>>>                   Berkeley Software Definition (BSD) license],
>>>             restrictive [e.g.,
>>>                   lesser general public license (LGPL)], and highly
>>>             restrictive
>>>                   [general public license (GPL)]. (See below for a more
>>>             complete
>>>                   discussion of these licenses.)
>>>
>>>
>>>             In conclusion, a resource license as encoded in metadata
>>>             ought to be
>>>             an enumerated/sum type and it is a matter of search
>>>             implementation how
>>>             to rank, unify and filter its levels.
>>>
>>>             Best,
>>>             Sander
>>>             --
>>>             *Sent as system administrator and engineer for CLARIN* 
>>> /Centre
>>>             Registry & Service Provider Federation/ @
>>>             {centres,infra}.clarin.eu <http://clarin.eu>
>>>             <http://clarin.eu>; /software engineering tools/ @
>>>             {svn,trac}.clarin.eu <http://clarin.eu> <http://clarin.eu>;
>>>             /identity and access
>>>             management/ @ {user,idp}.clarin.eu <http://clarin.eu>
>>>             <http://clarin.eu> /usage
>>>             statistics and service monitoring/ @ stats.clarin.eu
>>>             <http://stats.clarin.eu>
>>>             <http://stats.clarin.eu>
>>>
>>>             Max Planck Institute for Psycholinguistics
>>>             <https://tla.mpi.nl/>,
>>>             software developer personal Skype: sander.maijers | work
>>>             address:
>>>             Wundtlaan 1, 6525 XD, Nijmegen (NL)
>>>
>>>
>>>
>>>             On Mon, Nov 30, 2015 at 1:02 PM, Krister Lindén
>>>             <krister.linden at helsinki.fi
>>>             <mailto:krister.linden at helsinki.fi>
>>>             <mailto:krister.linden at helsinki.fi
>>>             <mailto:krister.linden at helsinki.fi>>> wrote:
>>>
>>>                   Quick answer to Penny's question about PUB/ACA/RES is
>>>             that they
>>>                   should also be facets that you can restrict, i.e.
>>>             choose to get data
>>>                   that is neither ACA nor RES leaving only open or
>>>             public data with
>>>                   varying licenses.
>>>
>>>                   Despite our good efforts to collect data, people will
>>>             be happy to
>>>                   find a resource at all. I do not really think that
>>>             they have the
>>>                   luxury of choosing e.g. whether they want a 
>>> treebank for a
>>>                   particular language with a CC-BY and not an MIT
>>>             license, or vice
>>>                   versa, but they may wish to say that they want a
>>>             treebank with an
>>>                   open or public license, if available.
>>>
>>>                   It is this final "if available", that has gotten me
>>>             thinking that we
>>>                   should probably also let legal metadata provide a
>>>             sorting order,
>>>                   because other criteria will be more important, 
>>> i.e. if
>>>             a favorite
>>>                   license is not on offer, I may settle for a
>>>                   slightly-more-difficult-to-manage license, as the
>>>             legal status of
>>>                   the resource is more like a price tag, i.e. it will
>>>             cost me more
>>>                   effort to deal with a restricted resource than an 
>>> open
>>>             one, but if I
>>>                   need an English speech data set, I will not settle 
>>> for
>>>             some Russian
>>>                   text data simply because the license is more 
>>> interesting.
>>>
>>>                   This said, if it is not too much of an effort, we
>>>             could of course
>>>                   provide the option to also write the name (or part of
>>>             a name) of a
>>>                   license as a search criterion. After all, that can be
>>>             implemented as
>>>                   rather straightforward string matching in the license
>>>             name field.
>>>
>>>                   --
>>>                   Krister
>>>
>>>                   On 30.11.2015 13:11, Penny Labropoulou wrote:
>>>
>>>                       Hi Matej, Krister and all
>>>
>>>                       Some thoughts on the topics raised:
>>>                       - license & availability are indeed too close
>>>             semantically and
>>>                       that's where the confusion comes; moreover, 
>>> for the
>>>                       normalization, the values are taken from 
>>> different
>>>             attributes
>>>                       which brings about the contradicting outcomes we
>>>             noticed in Wroclaw.
>>>                       - Now, if I understand correctly the new 
>>> approach,
>>>             both facets
>>>                       will be replaced by the License Categories, is
>>>             that it? If yes,
>>>                       I think this would improve the situation and we
>>>             need to check
>>>                       the new mappings. In this case, my only question
>>>             to Krister and
>>>                       the CLIC, is whether the PUB/ACA/RES should be
>>>             treated at the
>>>                       same level as the other tags.
>>>                       - in this scenario, I agree in principle with
>>>             Krister's email
>>>                       about the sorting of the resources when shown to
>>>             the user - but
>>>                       I don't know what other sortings (apart from
>>>             alphabetical
>>>                       ordering on the resource name) you have also
>>>             implemented on the
>>>                       new VLO; it would also be nice to somewhere state
>>>             that this is
>>>                       the ordering of the resources or allow the user
>>>             decide on the
>>>                       sorting, perhaps?
>>>                       - the only problem I have with keeping only
>>>             license categories
>>>                       in the facets, is that we "lose" information of
>>>             resources that
>>>                       are licensed with a standard license, e.g. CC, 
>>> GNU
>>>             etc. Could we
>>>                       have a second facet for these? Problems with
>>> this:
>>>             (a) confusion
>>>                       between the two facets (will the user 
>>> understand that
>>>                       "attribution" will give him more results than
>>>             CC-BY?) and (b)
>>>                       mapping of resources without a standard license
>>>             value... I'm
>>>                       just putting up the question without having a
>>>             definite answer.
>>>
>>>                       And to Matej's question: sorry, I haven't done
>>>             anything yet :-(.
>>>                       Do you have a deadline for the normalization
>>>             issues? I can look
>>>                       at it closer in the next couple of weeks, taking
>>>             into account
>>>                       the current discussion outcomes. And, if 
>>> possible,
>>>             it would be
>>>                       nice to know the attributes these concepts come
>>>             from and the
>>>                       combinations thereof (i.e. if the same resource
>>>             has two or more
>>>                       licensing-related attributes and/or values, get
>>>             the combinations
>>>                       thereof).
>>>
>>>                       Best,
>>>                       Penny
>>>
>>>
>>>                       -----Original Message-----
>>>                       From: Krister Lindén
>>>             [mailto:krister.linden at helsinki.fi
>>>             <mailto:krister.linden at helsinki.fi>
>>>                       <mailto:krister.linden at helsinki.fi
>>>             <mailto:krister.linden at helsinki.fi>>]
>>>                       Sent: Friday, November 27, 2015 5:55 PM
>>>                       To: Durco, Matej <Matej.Durco at oeaw.ac.at
>>>             <mailto:Matej.Durco at oeaw.ac.at>
>>>                       <mailto:Matej.Durco at oeaw.ac.at
>>>             <mailto:Matej.Durco at oeaw.ac.at>>>; penny at ilsp.gr
>>>             <mailto:penny at ilsp.gr>
>>>                       <mailto:penny at ilsp.gr <mailto:penny at ilsp.gr>>;
>>>             Menzo Windhouwer2
>>>                       <menzo.windhouwer at meertens.knaw.nl
>>>             <mailto:menzo.windhouwer at meertens.knaw.nl>
>>> <mailto:menzo.windhouwer at meertens.knaw.nl
>>> <mailto:menzo.windhouwer at meertens.knaw.nl>>>; Twan Goosen
>>>                       <twan.goosen at mpi.nl <mailto:twan.goosen at mpi.nl>
>>>             <mailto:twan.goosen at mpi.nl <mailto:twan.goosen at mpi.nl>>>;
>>>             Thomas Eckart
>>>                       <teckart at informatik.uni-leipzig.de
>>>             <mailto:teckart at informatik.uni-leipzig.de>
>>> <mailto:teckart at informatik.uni-leipzig.de
>>> <mailto:teckart at informatik.uni-leipzig.de>>>; Ostojic, Davor
>>>                       <Davor.Ostojic at oeaw.ac.at
>>>             <mailto:Davor.Ostojic at oeaw.ac.at>
>>>             <mailto:Davor.Ostojic at oeaw.ac.at
>>>             <mailto:Davor.Ostojic at oeaw.ac.at>>>
>>>                       Cc: tf-curation at lists.clarin.eu
>>>             <mailto:tf-curation at lists.clarin.eu>
>>>             <mailto:tf-curation at lists.clarin.eu
>>>             <mailto:tf-curation at lists.clarin.eu>>
>>>                       Subject: Re: License/Availability was WG: Re:
>>>                       LicenseAvailabilityMap.xml in
>>>                       vlo/trunk/vlo-commons/src/main/resources – 
>>> CLARIN Trac
>>>
>>>                       Dear all,
>>>
>>>                       One thing we discussed afterwards in CLIC, was 
>>> the
>>>             best approach
>>>                       to utilize the legal metadata in a query. We have
>>>             no illusions
>>>                       that the legal metadata would be the primary
>>>             criterion people
>>>                       use to select their data. Other criteria such as
>>>             name of the
>>>                       resource, languages covered, data type and usage
>>>             purpose are
>>>                       probably more crucial, but if users have a 
>>> choice,
>>>             they probably
>>>                       look for resources that have a clearly defined
>>>             legal status and
>>>                       have as few restrictions as possible.
>>>
>>>                       "As few restrictions as possible" implies a
>>>             sorting order.
>>>                       However, a user may dislike some restrictions for
>>>             practical
>>>                       purposes. Therefore, it would make sense to let
>>>             the user check
>>>                       the tags he would like to filter out and if none
>>>             are checked,
>>>                       none are filtered, i.e. all resources conforming
>>>             to the primary
>>>                       criteria are shown. In addition, the user should
>>>             be able to
>>>                       filter out resources whose legal status is
>>>             undefined, because
>>>                       the user would not know how to legally use the
>>>             resource, even if
>>>                       it exists. If displayed within a sorting order of
>>>             "as few
>>>                       restrictions as possible", resources with
>>>             undefined legal status
>>>                       should be at the end of the list.
>>>
>>>                       Regards,
>>>                       Krister
>>>
>>>                       On 27.11.2015 16:08, Durco, Matej wrote:
>>>
>>>                           Dear all,
>>>
>>>                           I only very late found out that there was a
>>>             follow-up on the
>>>                           License
>>>                           issue right after the conference (see email
>>>             below).
>>>
>>>                           Penny were you able to proceed on that?
>>>
>>>                           Meanwhile we did experimented quite a bit and
>>>             compiled
>>>                           information, so
>>>                           here is our current take on this for our (TF
>>>             Curation /
>>>                           ACDH-OEAW) side:
>>>
>>>                           We put down an overview (and would like to
>>>             collect there
>>>                           more findings
>>>                           and decisions as we go along) in 
>>> clarin-trac [1]
>>>
>>>                           Main points:
>>>
>>>                           1.Some of the concepts are linked to both
>>>             facets (not
>>>                           necessarily bad,
>>>                           but a hint that we don’t have a clear 
>>> distinction
>>>
>>>                           2.There is a normalisation file employed,
>>>             which is however
>>>                           incomplete
>>>                           (new unmapped values exist, some of which are
>>>             however
>>>                           obviously in the
>>>                           completely wrong place (like size in kB) )
>>>
>>>                           3.With current concept-mapping we cover only
>>>             some 60.000 out
>>>                           of 800.000
>>>                           records !!!
>>>
>>>                           Regarding 2: the Normalization
>>>
>>>                           The current normalization uses the 3-4 values
>>>             distinction:
>>>                           Free; Free
>>>                           for academic use; Restricted; Upon request 
>>> (in
>>>             line with
>>>                           PUB/ACA/RES –
>>>                           laundry tags)
>>>
>>>                           This sounds easy, but as far as I could
>>>             gather, it is
>>>                           problematic (in
>>>                           many ways).
>>>
>>>                           In Wroclaw, we discussed with Krister an
>>>             alternative approach:
>>>
>>>                           We could try to map to the license categories
>>>             as they are
>>>                           defined [2] by
>>>                           the Legal Issues Committee and available also
>>>             in the License
>>>                           Category
>>>                           Calculator [3]. By that we would avoid the
>>>             problematic
>>>                           reduction, still
>>>                           keeping the “laundry-tag” approach. And we
>>>             would be in sync
>>>                           with the
>>>                           Legal committee recommendations. Also each of
>>>             these atomic
>>>                           tags is well
>>>                           defined and most of them broadly used in 
>>> the webs.
>>>
>>>                           We could employ here the decomposition
>>>             approach, in line
>>>                           with what we
>>>                           try to adopt for resourceType and other
>>>             facets, that means,
>>>                           we wouldn’t
>>>                           have facet values: [ “PUB”, “PUB+BY”,
>>>             “PUB+BY+SA”] but
>>>                           rather [“PUB”,
>>>                           “BY”, “SA”].
>>>
>>>                           Allowing multiple possible values for the
>>>             facet in each
>>>                           record in
>>>                           combination with the (already implemented)
>>>             multi-select
>>>                           feature in VLO
>>>                           this should cover for all use cases and be
>>>             more ergonomic
>>>                           (e.g. if I am
>>>                           interested only in the Non-Commercial clause,
>>>             I need to
>>>                           select only one
>>>                           facet value and don’t have to search for all
>>>             the combination
>>>                           that
>>>                           contain NC.)
>>>
>>>                           There is already a ​normalisation map used in
>>>             production
>>>
>>> <https://github.com/clarin-eric/VLO/blob/master/vlo-commons/src/main
>>> /resources/LicenseAvailabilityMap.xml>
>>>
>>>                           [4](committed 2015-04-23). But there are new
>>>             values that are
>>>                           not mapped
>>>                           yet. ​Normalisation map as gsheet
>>>
>>> <https://drive.google.com/open?id=1Pf8Jk_P7RaA-7-dj8fcLOKNH5DjprraFE
>>> WXgQ3FvVtQ>
>>>
>>>                           [5] with already existing mappings (see
>>>             normalisation map
>>>                           above) + new
>>>                           values encountered not yet normalized; Values
>>>             come from elements
>>>                           annotated with concepts linked to one of the
>>>             two facets
>>>                           License/Availability.
>>>
>>>                           If we agree on the decomposition approach,
>>>             this list would
>>>                           need to be
>>>                           reviewed completely, but it’s just around 240
>>>             entries.
>>>
>>>                           And ad 3. Missing values
>>>
>>>                           Here we have 3 possible situations:
>>>
>>>                           1.Profile does not have any information about
>>>                           licensing/availability
>>>                           (worst case)
>>>
>>>                           2.Profile has information about L/A, but is
>>>             not linked to a
>>>                           concept, or
>>>                           the concept is not in the facet mapping
>>>
>>>                           3.Profile is well defined, with linking to 
>>> one
>>>             of the
>>>                           concepts in the
>>>                           facet mapping, but the information is simply
>>>             not filled in
>>>                           the record.
>>>
>>>                           We prepared a list ​profile/facet coverage
>>>
>>> <https://docs.google.com/spreadsheets/d/1eeOr0ShOWxdY8BLzp62LDyfGgHo
>>> 0gZ95Myw0qauzLxU/edit#gid=0&vpid=A1>
>>>
>>>                           [6] with special considerations of
>>>             availability and
>>>                           licensing facet.
>>>                           Especially also the individual concepts
>>>             contributing to the
>>>                           facet are
>>>                           plotted (see the c-* columns).
>>>
>>>                           If you want to further investigate this 
>>> issue,
>>>             I strongly
>>>                           recommend our
>>>                           experimental instance of the VLO on 
>>> Minerva <https://minerva.arz.oeaw.ac.at/vlo/> [7].
>>>
>>>                           It features normalized and unnormalized
>>>             facets, explicit
>>>                           [missing
>>>                           values], profileID and name as facets, data
>>>             provider facet
>>>                           showing the
>>>                           actual data provider, multi-value selection
>>>             and also special
>>>                           facets for
>>>                           the concepts contributing to facet
>>>             availability (i.e. every
>>>                           concept is
>>>                           plotted as a separate facet; these are marked
>>>             with prefix
>>>             c-)
>>>
>>>                           With all this you can only to easily see that
>>>             the biggest
>>>                           contributor to
>>>                           missing values in availability facet is
>>>             Meertens [8]
>>>                           (Playing the blame
>>>                           game ;)  And you can equally easily see what
>>>             are the
>>>                           respective profiles
>>>                           (just open the profile Name facet).
>>>
>>>                           So much to our findings until now. We would
>>>             love to hear
>>>                           from you, what
>>>                           do you think, perhaps we c/should arrange a
>>>             telco to discuss
>>>                           how to go
>>>                           on about this.
>>>
>>>                           Best,
>>>
>>>                           Matej
>>>
>>>                           [1]
>>>
>>> https://trac.clarin.eu/wiki/Taskforces/Curation/ValueNormalization/Lic
>>>             ense
>>>
>>>                           [2]
>>>             https://www.clarin.eu/content/license-categories
>>>
>>>                           [3]
>>>
>>> https://www.clarin.eu/content/clarin-license-category-calculator
>>>
>>>                           [4]
>>>
>>> https://github.com/clarin-eric/VLO/blob/master/vlo-commons/src/main/re
>>>             sources/LicenseAvailabilityMap.xml
>>>
>>>                           [5]
>>>
>>> https://drive.google.com/open?id=1Pf8Jk_P7RaA-7-dj8fcLOKNH5DjprraFEWXg
>>>             Q3FvVtQ
>>>
>>>                           [6]
>>>
>>> https://docs.google.com/spreadsheets/d/1eeOr0ShOWxdY8BLzp62LDyfGgHo0gZ
>>>             95Myw0qauzLxU/edit#gid=0&vpid=A1
>>>
>>>                           [7] https://minerva.arz.oeaw.ac.at/vlo/
>>>
>>>                           [8]
>>>
>>> http://minerva.arz.oeaw.ac.at/vlo/search?fq=dataProvider:Meertens_In
>>> st itute_Metadata_Repository&fq=availability:%5Bmissing+value%5D
>>>
>>>
>>>
>>>                           -------- Forwarded Message --------
>>>
>>>                           *Subject: *
>>>
>>>
>>>
>>>                           Re: LicenseAvailabilityMap.xml in 
>>> vlo/trunk/vlo-commons/src/main/resources –
>>>             CLARIN Trac
>>>
>>>                           *Date: *
>>>
>>>
>>>
>>>                           Sat, 17 Oct 2015 10:34:04 +0200
>>>
>>>                           *From: *
>>>
>>>
>>>
>>>                           Twan Goosen <twan.goosen at mpi.nl
>>>             <mailto:twan.goosen at mpi.nl> <mailto:twan.goosen at mpi.nl
>>>             <mailto:twan.goosen at mpi.nl>>>
>>>                           <mailto:twan.goosen at mpi.nl
>>>             <mailto:twan.goosen at mpi.nl> <mailto:twan.goosen at mpi.nl
>>>             <mailto:twan.goosen at mpi.nl>>>
>>>
>>>                           *To: *
>>>
>>>
>>>
>>>                           Penny Labropoulou <penny at ilsp.gr
>>>             <mailto:penny at ilsp.gr> <mailto:penny at ilsp.gr
>>>             <mailto:penny at ilsp.gr>>>
>>>                           <mailto:penny at ilsp.gr <mailto:penny at ilsp.gr>
>>>             <mailto:penny at ilsp.gr <mailto:penny at ilsp.gr>>>
>>>
>>>                           *CC: *
>>>
>>>
>>>
>>>                           Thomas Eckart
>>>             <teckart at informatik.uni-leipzig.de
>>>             <mailto:teckart at informatik.uni-leipzig.de>
>>> <mailto:teckart at informatik.uni-leipzig.de
>>> <mailto:teckart at informatik.uni-leipzig.de>>>
>>> <mailto:teckart at informatik.uni-leipzig.de
>>>             <mailto:teckart at informatik.uni-leipzig.de>
>>> <mailto:teckart at informatik.uni-leipzig.de
>>> <mailto:teckart at informatik.uni-leipzig.de>>>, Matej Durco
>>>                           <xnrn at gmx.net <mailto:xnrn at gmx.net>
>>>             <mailto:xnrn at gmx.net <mailto:xnrn at gmx.net>>>
>>>                           <mailto:xnrn at gmx.net <mailto:xnrn at gmx.net>
>>>             <mailto:xnrn at gmx.net <mailto:xnrn at gmx.net>>>
>>>
>>>
>>>
>>>                           That would be great. To get more information
>>>             on the mapping
>>>                           from the
>>>                           values in resourceInfo records to VLO facets,
>>>             you can enter
>>>                           the profile
>>>                           id 'clarin.eu:cr1:p_1361876010571' in the
>>>             input box of the
>>>                           "check
>>>                           profile" form at
>>>
>>> <https://lux17.mpi.nl/isocat/clarin/vlo/mapping/index.html>
>>>
>>> <https://lux17.mpi.nl/isocat/clarin/vlo/mapping/index.html>.
>>>
>>>                           This will give you quite a lot of information
>>>             but the
>>>                           relevant sections
>>>                           would be
>>>
>>>                                 Facet: availability
>>>                                      Matched CMD Element ConceptLink:
>>>
>>> http://hdl.handle.net/11459/CCR_C-2457_45bbaa1a-7002-2ecd-ab9d-57a189f
>>>             694a6
>>>
>>>
>>> /c:CMD/c:Components/c:resourceInfo/c:distributionInfo/c:licenceInfo/
>>> c:licence/text()
>>>
>>>                                 xpath accepted
>>>                                      Matched CMD Element ConceptLink:
>>>
>>> http://hdl.handle.net/11459/CCR_C-2453_1f0c3ea5-7966-ae11-d3c6-448424d
>>>             4e6e8
>>>
>>>
>>> /c:CMD/c:Components/c:resourceInfo/c:distributionInfo/c:licenceInfo/
>>> c:restrictionsOfUse/text()
>>>
>>>                                 xpath accepted
>>>
>>>                           and
>>>
>>>                                 Facet: license
>>>                                      Matched CMD Element ConceptLink:
>>>
>>> http://hdl.handle.net/11459/CCR_C-2457_45bbaa1a-7002-2ecd-ab9d-57a189f
>>>             694a6
>>>
>>>
>>> /c:CMD/c:Components/c:resourceInfo/c:distributionInfo/c:licenceInfo/
>>> c:licence/text()
>>>
>>>                                 xpath accepted
>>>                                      Matched CMD Element ConceptLink:
>>>
>>> http://hdl.handle.net/11459/CCR_C-2453_1f0c3ea5-7966-ae11-d3c6-448424d
>>>             4e6e8
>>>
>>>
>>> /c:CMD/c:Components/c:resourceInfo/c:distributionInfo/c:licenceInfo/
>>> c:restrictionsOfUse/text()
>>>
>>>                                 xpath accepted
>>>
>>>
>>>                           So the two fields 'licence' and
>>>             'restrictionsOfUse' are
>>>                           mapped to both
>>>                           facets (via the concepts 'availability' and
>>>             'license'). By
>>>                           looking at
>>>                           the mapping file we can able to see why this
>>>             results in the
>>>                           three
>>>                           different availability levels we are now
>>>             getting in the VLO
>>>                           (at least in
>>>                           the case of
>>>
>>> <http://catalog-clarin.esc.rzg.mpg.de/vlo/search?q=perso&fq=country:
>>> Finland>
>>>
>>>
>>> <http://catalog-clarin.esc.rzg.mpg.de/vlo/search?q=perso&fq=country:Finland>): 
>>>
>>>                           - license 'CLARIN_ACA-NC' maps to 'Free for
>>>             academic use'
>>>                           - restriction 'attribution' maps to 'Free'
>>>                           - restriction 'noRedistribution' maps to
>>>             'Restricted'
>>>
>>>                           The next step is to decide what would be the
>>>             desired mapping
>>>                           (logic).
>>>
>>>                           Best,
>>>                           Twan
>>>
>>>                           On 16/10/15 22:14, Penny Labropoulou wrote:
>>>
>>>                                 No problem! Glad to do it - it was more
>>>             or less on our
>>>                           agenda for
>>>                                 CLIC, so I'll have a look and let you
>>>             know of the
>>>                           outcomes.
>>>
>>>                                 Best,
>>>
>>>                                 Penny
>>>
>>>                                 On 16 October 2015 at 16:06, Twan 
>>> Goosen
>>>                           <twan.goosen at mpi.nl
>>>             <mailto:twan.goosen at mpi.nl> <mailto:twan.goosen at mpi.nl
>>>             <mailto:twan.goosen at mpi.nl>>
>>>                                 <mailto:twan.goosen at mpi.nl
>>>             <mailto:twan.goosen at mpi.nl>
>>>                           <mailto:twan.goosen at mpi.nl
>>>             <mailto:twan.goosen at mpi.nl>>>> wrote:
>>>
>>>                                     Thanks for your offer to look
>>>             through this mapping!
>>>                                     I will also send you a link to
>>>             Menzo's mapping tool.
>>>
>>>
>>> https://trac.clarin.eu/browser/vlo/trunk/vlo-commons/src/main/resource
>>>             s/LicenseAvailabilityMap.xml
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>>                   Tf-curation mailing list
>>>             Tf-curation at lists.clarin.eu
>>>             <mailto:Tf-curation at lists.clarin.eu>
>>>             <mailto:Tf-curation at lists.clarin.eu
>>>             <mailto:Tf-curation at lists.clarin.eu>>
>>> https://lists.clarin.eu/cgi-bin/mailman/listinfo/tf-curation
>>>
>>>
>>>
>>
>





More information about the Tf-curation mailing list