[Tf-curation] License/Availability was WG: Re: LicenseAvailabilityMap.xml in vlo/trunk/vlo-commons/src/main/resources – CLARIN Trac

Twan Goosen twan.goosen at mpi.nl
Tue Jan 19 10:29:29 CET 2016


(I'm putting Dieter in the loop because I know that he's interested in 
this as well)

On 19/01/16 10:28, Twan Goosen wrote:
> Hi Krister, all,
>
> On 18/01/16 15:40, Krister Lindén wrote:
>> [..] conceptual division of language resources into three main access 
>> categories, i.e. resources
>> - which are publicly or openly available (PUB),
>> - which are available for research or academic use (ACA), or
>> - which are restricted to individual use (RES).
>> Note that PUB resources are available to everyone, whereas ACA 
>> resources normally imply a trusted community, and RES resources 
>> typically contain personal data requiring an individual access 
>> permission.
> Thanks a lot for those definitions and the elaboration. After a recent 
> discussion I was more or less convinced that we should have a "licence 
> category" facet as the main filter mechanism in the VLO that includes 
> these values. However from your description I have to conclude that 
> these categories only indicate availability rather than type of 
> licence (even though these two are not completely orthogonal).
> Also, do I understand correctly that is feasible to map all resources 
> to one of these categories/levels, as long as they provide sufficient 
> licence and/or access information? I think this would be preferable 
> for use in the VLO, since a coarse initial distinction is more useful 
> than a really fine grained one. And I don't think that, for instance, 
> the standard CC restrictions make a resource less public in the eye of 
> the our users.
>
> I do have to say that I'm still a bit confused about the _exact_ 
> relations between licence (type), accessibility and availability. 
> "Public" resources do not have to be directly openly accessible (in 
> some cases only upon request), while "restricted" resources can be 
> freely accessible, at least technically. How do such apparent 
> contradictions get reflected in the availability category for a given 
> resource? It would be nice if we had the description of an underlying 
> "algorithm" or at least some formulation of the heuristic on which our 
> mapping is based. Most likely it exists, but I have not been paying 
> attention - in which case my apologies :)
>
> In any case, from the user's perspective something like "availability" 
> (however we define it exactly) does indeed seem to be the most useful 
> primary dimension to filter by. I think we will have to reconsider the 
> decision to switch from availability to licence category, if the rest 
> of the VLO crowd agrees.
>> In hindsight, the acronyms could have been chosen differently 
>> especially when compared with the current Orchid categories, but the 
>> acronyms were chosen before the Orchid community appeared. Now the 
>> CLARIN license category acronyms have been around for so long that 
>> they are simply three-letter acronyms.
> I agree that it should be fine to use them, especially if we can 
> provide a description or definition for each of them simultaneously 
> (which can be done in the VLO).
>
> Best,
> Twan
>> On 18.1.2016 15:37, Sander Maijers wrote:
>>> Hi Krister,
>>>
>>> Requests for access to trac.clarin.eu <http://trac.clarin.eu> can be
>>> directed to trac at clarin.eu <mailto:trac at clarin.eu>. As it happens I
>>> handle those requests.
>>> I've checked and you do have access already. You can log in using your
>>> e-mail address krister.linden at helsinki.fi
>>> <mailto:krister.linden at helsinki.fi> and your password for www.clarin.eu
>>> <http://www.clarin.eu>. So you CLARIN account. :)
>>>
>>> Best,
>>> Sander
>>> -- 
>>> *Sent as system administrator and engineer for CLARIN*
>>> /Centre Registry & Service Provider Federation/ @
>>> {centres,infra}.clarin.eu <http://clarin.eu>;
>>> /software engineering tools/ @ {svn,trac}.clarin.eu <http://clarin.eu>;
>>> /identity and access management/ @ {user,idp}.clarin.eu 
>>> <http://clarin.eu>
>>> /usage statistics and service monitoring/ @ stats.clarin.eu
>>> <http://stats.clarin.eu>
>>>
>>> Max Planck Institute for Psycholinguistics <https://tla.mpi.nl/>,
>>> software developer
>>> personal Skype: sander.maijers | work address: Wundtlaan 1, 6525 XD,
>>> Nijmegen (NL)
>>>
>>>
>>>
>>> On Mon, Jan 18, 2016 at 11:35 AM, Krister Lindén
>>> <krister.linden at helsinki.fi <mailto:krister.linden at helsinki.fi>> wrote:
>>>
>>>     Dear Matej,
>>>
>>>     Thanks for the prompting. You provide a link to a spreadsheet 
>>> [5] in
>>>     which there seems to be only approx. 240 lines, many of which are
>>>     already correctly mapped, so an overhaul of them by me with some
>>>     keen-eyed checking by Penny should be doable within the next few 
>>> days.
>>>
>>>     (Your tentative tags REG and REQ are already covered by ID and RES.
>>>     Most of the other rather broad AVAILABILITY strings seem to map
>>>     nicely to existing tags as well as suggested by the current 
>>> mapping.)
>>>
>>>     For the majority (90%) of resources that do not have any
>>>     availability record or license, the idea was that CLIC could
>>>     contribute with each country taking care of its own resources.
>>>
>>>     (Note. I could not enter the trac page [6] as it seems to require a
>>>     login and password.)
>>>
>>>     If anyone needs reassurance that PUB/ACA/RES is a viable 
>>> division of
>>>     access criteria for resources, you can have a look at Orchid, where
>>>     they have adopted a similar tripartite view of the world of how
>>>     researchers share their data (= papers and resources). Orchid even
>>>     uses the same color coding as we do for their categories: public
>>>     [=green], trusted [=yellow], and private [=red]. Orchid calls their
>>>     categories "Privacy Settings" for their researcher data:
>>> http://support.orcid.org/knowledgebase/articles/124518-orcid-privacy-settings 
>>>
>>>     . This is the same idea as availability, but "Privacy settings" are
>>>     the categories seen from the right holder's perspective in stead of
>>>     the end-user's point of view.
>>>
>>>     Regards,
>>>     Krister
>>>
>>>     On 17.1.2016 16:33, Durco, Matej wrote:
>>>
>>>         Dear all,
>>>
>>>         time passes by ...
>>>         We would like to follow-up on our discussion in November,
>>>         December regarding availability and license information in 
>>> the VLO.
>>>
>>>         We would like to try the following setup:
>>>         1. introduce a License Category (search)facet using as values
>>>         the license categories tags as defined by CLIC [1]
>>>         2. restrict the License field[2] to only feature URLs or clear
>>>         identifiers of Licenses
>>>         3. (optionally) use Availability field (or however we want to
>>>         call it) as catch all with the original information (as 
>>> found in
>>>         the metadata records) that is not covered by the previous two.
>>>
>>>         1. means that we would preserve the PUB/ACA/RES distinction, 
>>> but
>>>         add also the more fine grained distinction (NC;SA;BY;...)
>>>         Not for the next 3.4 release but relatively soon it will be
>>>         possible to select multiple values within one facet (already
>>>         possible at our dev-instance [3]), which I believe would be the
>>>         cleanest solution, allowing the user to easily restrict the
>>>         search to whichever aspect (or a combination thereof). Until
>>>         then we still have to see if we need to divide the License
>>>         Categories into two facet: the main distinction + the fine
>>>         grained categories. We will let you test and feedback on both
>>>         possibilities.
>>>
>>>         Now, we would need your help!
>>>         We updated the map (as it is being used now [4])
>>>         a) with the fine-grained distinctions
>>>         b) with new previously unmapped values.
>>>
>>>         Here is the working spreadsheet [5]
>>>         column C contains the currently valid mapping, column D the
>>>         tentative new one
>>>         The mapping is a bit more complex:
>>>         - You can map one value to multiple values using semicolon 
>>> (e.g.
>>>         PUB;BY;NC;SA) still in one cell
>>>         - You can use two dashes ("--") to indicate that this values
>>>         should be disregarded for the facet (for obviously erroneous 
>>> values)
>>>
>>>         I also took the freedom to tentatively introduce two more tags:
>>>         REG := after registration
>>>         REQ := upon request
>>>
>>>         If you propose a change, please, comment on it (with your name)
>>>         in one of the next columns
>>>
>>>         We put together some information regarding the two facets and
>>>         the mapping in CLARIN trac [6]
>>>         Also, on our dev instance of VLO [3] we added additional facets
>>>         allowing to better explore the details of the mapping.
>>>         There are facets:
>>>         - AVAILABILITYORIG := the values as they were found in the 
>>> records
>>>         - AVAILABILTY := the mapped/normalized values (according to the
>>>         currently used normalization map)
>>>         - C-* := each concept contributing to the AVAILABILITY facet
>>>         listed separately.
>>>
>>>         Example:
>>>         filter by "Availability:restricted" [7]
>>>         gives you 5438 records.
>>>         In AVAILABILITYORIG facet you get listed the original values
>>>         (next to "restricted" itself, HZSK-RES, 
>>> available-restrictedUse,
>>>         etc.)
>>>         and these same values further broken down by the concept
>>>         (underlying the CMD-element from which the values come from) in
>>>         the facets C-* ...
>>>         There you can further restrict, e.g. by 
>>> "c-license:CC-BY-NC-ND" [8]
>>>         The VLO-dev at minerva instance [3] features also extra facets
>>>         PROFILE NAME (, PROFILE ID) and DATA PROVIDER (with the actual
>>>         provider listed), so once you filtered [8], you can nicely see
>>>         which profiles and data providers contribute given values.
>>>         (In the case of [8], it is LINDAT, CLARIN_PL and Language Bank
>>>         of Finland with metashare profiles (data and resourceInfo) )
>>>         Obviously you can also use it the other way round and restrict
>>>         by provider or profile first and then see all the values
>>>         contributed by those.
>>>
>>>
>>>         Then there is the what I see as the more dramatic issue of too
>>>         many records not providing any licensing/availability
>>>         information (around 90% !), but I would spare that for a
>>>         separate email before this one gets too lengthy.
>>>
>>>         I am sorry I did not come back to you earlier, but we would be
>>>         very grateful if we could have your input/feedback in the next
>>>         days, as we are approaching the VLO-3.4 milestone (and also
>>>         simply to make progress for better user experience in resource
>>>         discovery)
>>>
>>>         Best,
>>>         Matej
>>>
>>>
>>>         [1] 
>>> https://www.clarin.eu/content/clarin-license-category-calculator
>>>         [2] "field" means a facet that is only visible in the detail
>>>         view of the record and is not displayed as search facet
>>>         [3] https://minerva.arz.oeaw.ac.at/vlo/?0
>>>         [4]
>>> https://github.com/clarin-eric/VLO/blob/master/vlo-commons/src/main/resources/LicenseAvailabilityMap.xml 
>>>
>>>         [5]
>>> https://docs.google.com/spreadsheets/d/1Pf8Jk_P7RaA-7-dj8fcLOKNH5DjprraFEWXgQ3FvVtQ/edit#gid=0 
>>>
>>>         [6]
>>> https://trac.clarin.eu/wiki/Taskforces/Curation/ValueNormalization/License 
>>>
>>>         [7]
>>> http://minerva.arz.oeaw.ac.at/vlo/search?fq=availability:Restricted
>>>         [8]
>>> http://minerva.arz.oeaw.ac.at/vlo/search?fq=c-license:CC-BY-NC-ND&fq=availability:Restricted 
>>>
>>>
>>>
>>>         -----Ursprüngliche Nachricht-----
>>>         Von: Krister Lindén [mailto:krister.linden at helsinki.fi
>>>         <mailto:krister.linden at helsinki.fi>]
>>>         Gesendet: Dienstag, 01. Dezember 2015 00:01
>>>         An: Sander Maijers <sander at clarin.eu <mailto:sander at clarin.eu>>
>>>         Cc: Penny Labropoulou <penny at ilsp.gr <mailto:penny at ilsp.gr>>;
>>>         Durco, Matej <Matej.Durco at oeaw.ac.at
>>>         <mailto:Matej.Durco at oeaw.ac.at>>; Menzo Windhouwer2
>>>         <menzo.windhouwer at meertens.knaw.nl
>>>         <mailto:menzo.windhouwer at meertens.knaw.nl>>; Twan Goosen
>>>         <twan.goosen at mpi.nl <mailto:twan.goosen at mpi.nl>>; Thomas Eckart
>>>         <teckart at informatik.uni-leipzig.de
>>>         <mailto:teckart at informatik.uni-leipzig.de>>; Ostojic, Davor
>>>         <Davor.Ostojic at oeaw.ac.at <mailto:Davor.Ostojic at oeaw.ac.at>>;
>>>         tf-curation at lists.clarin.eu 
>>> <mailto:tf-curation at lists.clarin.eu>
>>>         Betreff: Re: [Tf-curation] License/Availability was WG: Re:
>>>         LicenseAvailabilityMap.xml in
>>>         vlo/trunk/vlo-commons/src/main/resources – CLARIN Trac
>>>
>>>         CLIC is sorely aware of the lack of agreement on licenses in
>>>         general (let alone on sorting them).
>>>
>>>         Normally there will be only a few resources matching a
>>>         particular real end-user resource query, so most users will
>>>         normally be satisfied with perusing the full search result.
>>>
>>>         However, providing the capability to sort the resources
>>>         according to their main CLARIN license category
>>>         (PUB/ACA/RES/undefined) will give a rough general preference to
>>>         open resources and will take most users as far as they care to
>>>         go along this particular line of investigation.
>>>
>>>         If we in addition provide the capability to exclude certain
>>>         laundry tags and the capability to provide part of a license
>>>         name as a filter, we already cover most foreseeable simple 
>>> searches.
>>>
>>>         There will always be those who want more, but for those there
>>>         can be some advanced search to be defined based on user 
>>> feedback.
>>>
>>>         --
>>>         Krister
>>>
>>>         On 30.11.2015 23:58, Sander Maijers wrote:
>>>
>>>             Hi all,
>>>
>>>             Making license type ordinal in the sense of permissiveness
>>>             favors one
>>>             interpretation of what users care about in a license, 
>>> and the
>>>             interpretation is bound to be arbitrary to some degree. 
>>> That
>>>             we all
>>>             knew, but is everyone aware of the lack of consensus on 
>>> such
>>>             an order
>>>             and categorization even in primary sources?
>>>
>>>             E.g., see the different categorizations in highly 
>>> referenced
>>>             sources
>>>             such as:
>>>             1. https://opensource.org/licenses/category
>>>
>>>             2.
>>> https://www.iprhelpdesk.eu/sites/default/files/newsdocuments/Intellect
>>> ual%20Property%20Rights%20Management%20in%20Software%20Developments_up
>>>             dated.pdf
>>>
>>>             2.
>>> https://en.wikipedia.org/wiki/License_compatibility#Compatibility_of_F
>>>             OSS_licenses
>>>
>>>             3. Academic take:
>>> http://jleo.oxfordjournals.org/content/21/1/20.full.pdf+html:
>>>
>>>                   We will consider three classes of licenses:
>>>             unrestrictive [e.g., the
>>>                   Berkeley Software Definition (BSD) license],
>>>             restrictive [e.g.,
>>>                   lesser general public license (LGPL)], and highly
>>>             restrictive
>>>                   [general public license (GPL)]. (See below for a more
>>>             complete
>>>                   discussion of these licenses.)
>>>
>>>
>>>             In conclusion, a resource license as encoded in metadata
>>>             ought to be
>>>             an enumerated/sum type and it is a matter of search
>>>             implementation how
>>>             to rank, unify and filter its levels.
>>>
>>>             Best,
>>>             Sander
>>>             --
>>>             *Sent as system administrator and engineer for CLARIN* 
>>> /Centre
>>>             Registry & Service Provider Federation/ @
>>>             {centres,infra}.clarin.eu <http://clarin.eu>
>>>             <http://clarin.eu>; /software engineering tools/ @
>>>             {svn,trac}.clarin.eu <http://clarin.eu> <http://clarin.eu>;
>>>             /identity and access
>>>             management/ @ {user,idp}.clarin.eu <http://clarin.eu>
>>>             <http://clarin.eu> /usage
>>>             statistics and service monitoring/ @ stats.clarin.eu
>>>             <http://stats.clarin.eu>
>>>             <http://stats.clarin.eu>
>>>
>>>             Max Planck Institute for Psycholinguistics
>>>             <https://tla.mpi.nl/>,
>>>             software developer personal Skype: sander.maijers | work
>>>             address:
>>>             Wundtlaan 1, 6525 XD, Nijmegen (NL)
>>>
>>>
>>>
>>>             On Mon, Nov 30, 2015 at 1:02 PM, Krister Lindén
>>>             <krister.linden at helsinki.fi
>>>             <mailto:krister.linden at helsinki.fi>
>>>             <mailto:krister.linden at helsinki.fi
>>>             <mailto:krister.linden at helsinki.fi>>> wrote:
>>>
>>>                   Quick answer to Penny's question about PUB/ACA/RES is
>>>             that they
>>>                   should also be facets that you can restrict, i.e.
>>>             choose to get data
>>>                   that is neither ACA nor RES leaving only open or
>>>             public data with
>>>                   varying licenses.
>>>
>>>                   Despite our good efforts to collect data, people will
>>>             be happy to
>>>                   find a resource at all. I do not really think that
>>>             they have the
>>>                   luxury of choosing e.g. whether they want a 
>>> treebank for a
>>>                   particular language with a CC-BY and not an MIT
>>>             license, or vice
>>>                   versa, but they may wish to say that they want a
>>>             treebank with an
>>>                   open or public license, if available.
>>>
>>>                   It is this final "if available", that has gotten me
>>>             thinking that we
>>>                   should probably also let legal metadata provide a
>>>             sorting order,
>>>                   because other criteria will be more important, 
>>> i.e. if
>>>             a favorite
>>>                   license is not on offer, I may settle for a
>>>                   slightly-more-difficult-to-manage license, as the
>>>             legal status of
>>>                   the resource is more like a price tag, i.e. it will
>>>             cost me more
>>>                   effort to deal with a restricted resource than an 
>>> open
>>>             one, but if I
>>>                   need an English speech data set, I will not settle 
>>> for
>>>             some Russian
>>>                   text data simply because the license is more 
>>> interesting.
>>>
>>>                   This said, if it is not too much of an effort, we
>>>             could of course
>>>                   provide the option to also write the name (or part of
>>>             a name) of a
>>>                   license as a search criterion. After all, that can be
>>>             implemented as
>>>                   rather straightforward string matching in the license
>>>             name field.
>>>
>>>                   --
>>>                   Krister
>>>
>>>                   On 30.11.2015 13:11, Penny Labropoulou wrote:
>>>
>>>                       Hi Matej, Krister and all
>>>
>>>                       Some thoughts on the topics raised:
>>>                       - license & availability are indeed too close
>>>             semantically and
>>>                       that's where the confusion comes; moreover, 
>>> for the
>>>                       normalization, the values are taken from 
>>> different
>>>             attributes
>>>                       which brings about the contradicting outcomes we
>>>             noticed in Wroclaw.
>>>                       - Now, if I understand correctly the new 
>>> approach,
>>>             both facets
>>>                       will be replaced by the License Categories, is
>>>             that it? If yes,
>>>                       I think this would improve the situation and we
>>>             need to check
>>>                       the new mappings. In this case, my only question
>>>             to Krister and
>>>                       the CLIC, is whether the PUB/ACA/RES should be
>>>             treated at the
>>>                       same level as the other tags.
>>>                       - in this scenario, I agree in principle with
>>>             Krister's email
>>>                       about the sorting of the resources when shown to
>>>             the user - but
>>>                       I don't know what other sortings (apart from
>>>             alphabetical
>>>                       ordering on the resource name) you have also
>>>             implemented on the
>>>                       new VLO; it would also be nice to somewhere state
>>>             that this is
>>>                       the ordering of the resources or allow the user
>>>             decide on the
>>>                       sorting, perhaps?
>>>                       - the only problem I have with keeping only
>>>             license categories
>>>                       in the facets, is that we "lose" information of
>>>             resources that
>>>                       are licensed with a standard license, e.g. CC, 
>>> GNU
>>>             etc. Could we
>>>                       have a second facet for these? Problems with 
>>> this:
>>>             (a) confusion
>>>                       between the two facets (will the user 
>>> understand that
>>>                       "attribution" will give him more results than
>>>             CC-BY?) and (b)
>>>                       mapping of resources without a standard license
>>>             value... I'm
>>>                       just putting up the question without having a
>>>             definite answer.
>>>
>>>                       And to Matej's question: sorry, I haven't done
>>>             anything yet :-(.
>>>                       Do you have a deadline for the normalization
>>>             issues? I can look
>>>                       at it closer in the next couple of weeks, taking
>>>             into account
>>>                       the current discussion outcomes. And, if 
>>> possible,
>>>             it would be
>>>                       nice to know the attributes these concepts come
>>>             from and the
>>>                       combinations thereof (i.e. if the same resource
>>>             has two or more
>>>                       licensing-related attributes and/or values, get
>>>             the combinations
>>>                       thereof).
>>>
>>>                       Best,
>>>                       Penny
>>>
>>>
>>>                       -----Original Message-----
>>>                       From: Krister Lindén
>>>             [mailto:krister.linden at helsinki.fi
>>>             <mailto:krister.linden at helsinki.fi>
>>>                       <mailto:krister.linden at helsinki.fi
>>>             <mailto:krister.linden at helsinki.fi>>]
>>>                       Sent: Friday, November 27, 2015 5:55 PM
>>>                       To: Durco, Matej <Matej.Durco at oeaw.ac.at
>>>             <mailto:Matej.Durco at oeaw.ac.at>
>>>                       <mailto:Matej.Durco at oeaw.ac.at
>>>             <mailto:Matej.Durco at oeaw.ac.at>>>; penny at ilsp.gr
>>>             <mailto:penny at ilsp.gr>
>>>                       <mailto:penny at ilsp.gr <mailto:penny at ilsp.gr>>;
>>>             Menzo Windhouwer2
>>>                       <menzo.windhouwer at meertens.knaw.nl
>>>             <mailto:menzo.windhouwer at meertens.knaw.nl>
>>> <mailto:menzo.windhouwer at meertens.knaw.nl
>>> <mailto:menzo.windhouwer at meertens.knaw.nl>>>; Twan Goosen
>>>                       <twan.goosen at mpi.nl <mailto:twan.goosen at mpi.nl>
>>>             <mailto:twan.goosen at mpi.nl <mailto:twan.goosen at mpi.nl>>>;
>>>             Thomas Eckart
>>>                       <teckart at informatik.uni-leipzig.de
>>>             <mailto:teckart at informatik.uni-leipzig.de>
>>> <mailto:teckart at informatik.uni-leipzig.de
>>> <mailto:teckart at informatik.uni-leipzig.de>>>; Ostojic, Davor
>>>                       <Davor.Ostojic at oeaw.ac.at
>>>             <mailto:Davor.Ostojic at oeaw.ac.at>
>>>             <mailto:Davor.Ostojic at oeaw.ac.at
>>>             <mailto:Davor.Ostojic at oeaw.ac.at>>>
>>>                       Cc: tf-curation at lists.clarin.eu
>>>             <mailto:tf-curation at lists.clarin.eu>
>>>             <mailto:tf-curation at lists.clarin.eu
>>>             <mailto:tf-curation at lists.clarin.eu>>
>>>                       Subject: Re: License/Availability was WG: Re:
>>>                       LicenseAvailabilityMap.xml in
>>>                       vlo/trunk/vlo-commons/src/main/resources – 
>>> CLARIN Trac
>>>
>>>                       Dear all,
>>>
>>>                       One thing we discussed afterwards in CLIC, was 
>>> the
>>>             best approach
>>>                       to utilize the legal metadata in a query. We have
>>>             no illusions
>>>                       that the legal metadata would be the primary
>>>             criterion people
>>>                       use to select their data. Other criteria such as
>>>             name of the
>>>                       resource, languages covered, data type and usage
>>>             purpose are
>>>                       probably more crucial, but if users have a 
>>> choice,
>>>             they probably
>>>                       look for resources that have a clearly defined
>>>             legal status and
>>>                       have as few restrictions as possible.
>>>
>>>                       "As few restrictions as possible" implies a
>>>             sorting order.
>>>                       However, a user may dislike some restrictions for
>>>             practical
>>>                       purposes. Therefore, it would make sense to let
>>>             the user check
>>>                       the tags he would like to filter out and if none
>>>             are checked,
>>>                       none are filtered, i.e. all resources conforming
>>>             to the primary
>>>                       criteria are shown. In addition, the user should
>>>             be able to
>>>                       filter out resources whose legal status is
>>>             undefined, because
>>>                       the user would not know how to legally use the
>>>             resource, even if
>>>                       it exists. If displayed within a sorting order of
>>>             "as few
>>>                       restrictions as possible", resources with
>>>             undefined legal status
>>>                       should be at the end of the list.
>>>
>>>                       Regards,
>>>                       Krister
>>>
>>>                       On 27.11.2015 16:08, Durco, Matej wrote:
>>>
>>>                           Dear all,
>>>
>>>                           I only very late found out that there was a
>>>             follow-up on the
>>>                           License
>>>                           issue right after the conference (see email
>>>             below).
>>>
>>>                           Penny were you able to proceed on that?
>>>
>>>                           Meanwhile we did experimented quite a bit and
>>>             compiled
>>>                           information, so
>>>                           here is our current take on this for our (TF
>>>             Curation /
>>>                           ACDH-OEAW) side:
>>>
>>>                           We put down an overview (and would like to
>>>             collect there
>>>                           more findings
>>>                           and decisions as we go along) in 
>>> clarin-trac [1]
>>>
>>>                           Main points:
>>>
>>>                           1.Some of the concepts are linked to both
>>>             facets (not
>>>                           necessarily bad,
>>>                           but a hint that we don’t have a clear 
>>> distinction
>>>
>>>                           2.There is a normalisation file employed,
>>>             which is however
>>>                           incomplete
>>>                           (new unmapped values exist, some of which are
>>>             however
>>>                           obviously in the
>>>                           completely wrong place (like size in kB) )
>>>
>>>                           3.With current concept-mapping we cover only
>>>             some 60.000 out
>>>                           of 800.000
>>>                           records !!!
>>>
>>>                           Regarding 2: the Normalization
>>>
>>>                           The current normalization uses the 3-4 values
>>>             distinction:
>>>                           Free; Free
>>>                           for academic use; Restricted; Upon request 
>>> (in
>>>             line with
>>>                           PUB/ACA/RES –
>>>                           laundry tags)
>>>
>>>                           This sounds easy, but as far as I could
>>>             gather, it is
>>>                           problematic (in
>>>                           many ways).
>>>
>>>                           In Wroclaw, we discussed with Krister an
>>>             alternative approach:
>>>
>>>                           We could try to map to the license categories
>>>             as they are
>>>                           defined [2] by
>>>                           the Legal Issues Committee and available also
>>>             in the License
>>>                           Category
>>>                           Calculator [3]. By that we would avoid the
>>>             problematic
>>>                           reduction, still
>>>                           keeping the “laundry-tag” approach. And we
>>>             would be in sync
>>>                           with the
>>>                           Legal committee recommendations. Also each of
>>>             these atomic
>>>                           tags is well
>>>                           defined and most of them broadly used in 
>>> the webs.
>>>
>>>                           We could employ here the decomposition
>>>             approach, in line
>>>                           with what we
>>>                           try to adopt for resourceType and other
>>>             facets, that means,
>>>                           we wouldn’t
>>>                           have facet values: [ “PUB”, “PUB+BY”,
>>>             “PUB+BY+SA”] but
>>>                           rather [“PUB”,
>>>                           “BY”, “SA”].
>>>
>>>                           Allowing multiple possible values for the
>>>             facet in each
>>>                           record in
>>>                           combination with the (already implemented)
>>>             multi-select
>>>                           feature in VLO
>>>                           this should cover for all use cases and be
>>>             more ergonomic
>>>                           (e.g. if I am
>>>                           interested only in the Non-Commercial clause,
>>>             I need to
>>>                           select only one
>>>                           facet value and don’t have to search for all
>>>             the combination
>>>                           that
>>>                           contain NC.)
>>>
>>>                           There is already a ​normalisation map used in
>>>             production
>>>
>>> <https://github.com/clarin-eric/VLO/blob/master/vlo-commons/src/main/resources/LicenseAvailabilityMap.xml> 
>>>
>>>                           [4](committed 2015-04-23). But there are new
>>>             values that are
>>>                           not mapped
>>>                           yet. ​Normalisation map as gsheet
>>>
>>> <https://drive.google.com/open?id=1Pf8Jk_P7RaA-7-dj8fcLOKNH5DjprraFEWXgQ3FvVtQ> 
>>>
>>>                           [5] with already existing mappings (see
>>>             normalisation map
>>>                           above) + new
>>>                           values encountered not yet normalized; Values
>>>             come from elements
>>>                           annotated with concepts linked to one of the
>>>             two facets
>>>                           License/Availability.
>>>
>>>                           If we agree on the decomposition approach,
>>>             this list would
>>>                           need to be
>>>                           reviewed completely, but it’s just around 240
>>>             entries.
>>>
>>>                           And ad 3. Missing values
>>>
>>>                           Here we have 3 possible situations:
>>>
>>>                           1.Profile does not have any information about
>>>                           licensing/availability
>>>                           (worst case)
>>>
>>>                           2.Profile has information about L/A, but is
>>>             not linked to a
>>>                           concept, or
>>>                           the concept is not in the facet mapping
>>>
>>>                           3.Profile is well defined, with linking to 
>>> one
>>>             of the
>>>                           concepts in the
>>>                           facet mapping, but the information is simply
>>>             not filled in
>>>                           the record.
>>>
>>>                           We prepared a list ​profile/facet coverage
>>>
>>> <https://docs.google.com/spreadsheets/d/1eeOr0ShOWxdY8BLzp62LDyfGgHo0gZ95Myw0qauzLxU/edit#gid=0&vpid=A1> 
>>>
>>>                           [6] with special considerations of
>>>             availability and
>>>                           licensing facet.
>>>                           Especially also the individual concepts
>>>             contributing to the
>>>                           facet are
>>>                           plotted (see the c-* columns).
>>>
>>>                           If you want to further investigate this 
>>> issue,
>>>             I strongly
>>>                           recommend our
>>>                           experimental instance of the VLO on Minerva
>>> <https://minerva.arz.oeaw.ac.at/vlo/> [7].
>>>
>>>                           It features normalized and unnormalized
>>>             facets, explicit
>>>                           [missing
>>>                           values], profileID and name as facets, data
>>>             provider facet
>>>                           showing the
>>>                           actual data provider, multi-value selection
>>>             and also special
>>>                           facets for
>>>                           the concepts contributing to facet
>>>             availability (i.e. every
>>>                           concept is
>>>                           plotted as a separate facet; these are marked
>>>             with prefix
>>>             c-)
>>>
>>>                           With all this you can only to easily see that
>>>             the biggest
>>>                           contributor to
>>>                           missing values in availability facet is
>>>             Meertens [8]
>>>                           (Playing the blame
>>>                           game ;)  And you can equally easily see what
>>>             are the
>>>                           respective profiles
>>>                           (just open the profile Name facet).
>>>
>>>                           So much to our findings until now. We would
>>>             love to hear
>>>                           from you, what
>>>                           do you think, perhaps we c/should arrange a
>>>             telco to discuss
>>>                           how to go
>>>                           on about this.
>>>
>>>                           Best,
>>>
>>>                           Matej
>>>
>>>                           [1]
>>>
>>> https://trac.clarin.eu/wiki/Taskforces/Curation/ValueNormalization/Lic
>>>             ense
>>>
>>>                           [2]
>>>             https://www.clarin.eu/content/license-categories
>>>
>>>                           [3]
>>>
>>> https://www.clarin.eu/content/clarin-license-category-calculator
>>>
>>>                           [4]
>>>
>>> https://github.com/clarin-eric/VLO/blob/master/vlo-commons/src/main/re
>>>             sources/LicenseAvailabilityMap.xml
>>>
>>>                           [5]
>>>
>>> https://drive.google.com/open?id=1Pf8Jk_P7RaA-7-dj8fcLOKNH5DjprraFEWXg
>>>             Q3FvVtQ
>>>
>>>                           [6]
>>>
>>> https://docs.google.com/spreadsheets/d/1eeOr0ShOWxdY8BLzp62LDyfGgHo0gZ
>>>             95Myw0qauzLxU/edit#gid=0&vpid=A1
>>>
>>>                           [7] https://minerva.arz.oeaw.ac.at/vlo/
>>>
>>>                           [8]
>>>
>>> http://minerva.arz.oeaw.ac.at/vlo/search?fq=dataProvider:Meertens_Inst
>>> itute_Metadata_Repository&fq=availability:%5Bmissing+value%5D
>>>
>>>
>>>
>>>                           -------- Forwarded Message --------
>>>
>>>                           *Subject: *
>>>
>>>
>>>
>>>                           Re: LicenseAvailabilityMap.xml in
>>> vlo/trunk/vlo-commons/src/main/resources –
>>>             CLARIN Trac
>>>
>>>                           *Date: *
>>>
>>>
>>>
>>>                           Sat, 17 Oct 2015 10:34:04 +0200
>>>
>>>                           *From: *
>>>
>>>
>>>
>>>                           Twan Goosen <twan.goosen at mpi.nl
>>>             <mailto:twan.goosen at mpi.nl> <mailto:twan.goosen at mpi.nl
>>>             <mailto:twan.goosen at mpi.nl>>>
>>>                           <mailto:twan.goosen at mpi.nl
>>>             <mailto:twan.goosen at mpi.nl> <mailto:twan.goosen at mpi.nl
>>>             <mailto:twan.goosen at mpi.nl>>>
>>>
>>>                           *To: *
>>>
>>>
>>>
>>>                           Penny Labropoulou <penny at ilsp.gr
>>>             <mailto:penny at ilsp.gr> <mailto:penny at ilsp.gr
>>>             <mailto:penny at ilsp.gr>>>
>>>                           <mailto:penny at ilsp.gr <mailto:penny at ilsp.gr>
>>>             <mailto:penny at ilsp.gr <mailto:penny at ilsp.gr>>>
>>>
>>>                           *CC: *
>>>
>>>
>>>
>>>                           Thomas Eckart
>>>             <teckart at informatik.uni-leipzig.de
>>>             <mailto:teckart at informatik.uni-leipzig.de>
>>> <mailto:teckart at informatik.uni-leipzig.de
>>> <mailto:teckart at informatik.uni-leipzig.de>>>
>>> <mailto:teckart at informatik.uni-leipzig.de
>>>             <mailto:teckart at informatik.uni-leipzig.de>
>>> <mailto:teckart at informatik.uni-leipzig.de
>>> <mailto:teckart at informatik.uni-leipzig.de>>>, Matej Durco
>>>                           <xnrn at gmx.net <mailto:xnrn at gmx.net>
>>>             <mailto:xnrn at gmx.net <mailto:xnrn at gmx.net>>>
>>>                           <mailto:xnrn at gmx.net <mailto:xnrn at gmx.net>
>>>             <mailto:xnrn at gmx.net <mailto:xnrn at gmx.net>>>
>>>
>>>
>>>
>>>                           That would be great. To get more information
>>>             on the mapping
>>>                           from the
>>>                           values in resourceInfo records to VLO facets,
>>>             you can enter
>>>                           the profile
>>>                           id 'clarin.eu:cr1:p_1361876010571' in the
>>>             input box of the
>>>                           "check
>>>                           profile" form at
>>>
>>> <https://lux17.mpi.nl/isocat/clarin/vlo/mapping/index.html>
>>>
>>> <https://lux17.mpi.nl/isocat/clarin/vlo/mapping/index.html>.
>>>
>>>                           This will give you quite a lot of information
>>>             but the
>>>                           relevant sections
>>>                           would be
>>>
>>>                                 Facet: availability
>>>                                      Matched CMD Element ConceptLink:
>>>
>>> http://hdl.handle.net/11459/CCR_C-2457_45bbaa1a-7002-2ecd-ab9d-57a189f
>>>             694a6
>>>
>>>
>>> /c:CMD/c:Components/c:resourceInfo/c:distributionInfo/c:licenceInfo/c:licence/text() 
>>>
>>>                                 xpath accepted
>>>                                      Matched CMD Element ConceptLink:
>>>
>>> http://hdl.handle.net/11459/CCR_C-2453_1f0c3ea5-7966-ae11-d3c6-448424d
>>>             4e6e8
>>>
>>>
>>> /c:CMD/c:Components/c:resourceInfo/c:distributionInfo/c:licenceInfo/c:restrictionsOfUse/text() 
>>>
>>>                                 xpath accepted
>>>
>>>                           and
>>>
>>>                                 Facet: license
>>>                                      Matched CMD Element ConceptLink:
>>>
>>> http://hdl.handle.net/11459/CCR_C-2457_45bbaa1a-7002-2ecd-ab9d-57a189f
>>>             694a6
>>>
>>>
>>> /c:CMD/c:Components/c:resourceInfo/c:distributionInfo/c:licenceInfo/c:licence/text() 
>>>
>>>                                 xpath accepted
>>>                                      Matched CMD Element ConceptLink:
>>>
>>> http://hdl.handle.net/11459/CCR_C-2453_1f0c3ea5-7966-ae11-d3c6-448424d
>>>             4e6e8
>>>
>>>
>>> /c:CMD/c:Components/c:resourceInfo/c:distributionInfo/c:licenceInfo/c:restrictionsOfUse/text() 
>>>
>>>                                 xpath accepted
>>>
>>>
>>>                           So the two fields 'licence' and
>>>             'restrictionsOfUse' are
>>>                           mapped to both
>>>                           facets (via the concepts 'availability' and
>>>             'license'). By
>>>                           looking at
>>>                           the mapping file we can able to see why this
>>>             results in the
>>>                           three
>>>                           different availability levels we are now
>>>             getting in the VLO
>>>                           (at least in
>>>                           the case of
>>>
>>> <http://catalog-clarin.esc.rzg.mpg.de/vlo/search?q=perso&fq=country:Finland> 
>>>
>>>
>>> <http://catalog-clarin.esc.rzg.mpg.de/vlo/search?q=perso&fq=country:Finland>): 
>>>
>>>                           - license 'CLARIN_ACA-NC' maps to 'Free for
>>>             academic use'
>>>                           - restriction 'attribution' maps to 'Free'
>>>                           - restriction 'noRedistribution' maps to
>>>             'Restricted'
>>>
>>>                           The next step is to decide what would be the
>>>             desired mapping
>>>                           (logic).
>>>
>>>                           Best,
>>>                           Twan
>>>
>>>                           On 16/10/15 22:14, Penny Labropoulou wrote:
>>>
>>>                                 No problem! Glad to do it - it was more
>>>             or less on our
>>>                           agenda for
>>>                                 CLIC, so I'll have a look and let you
>>>             know of the
>>>                           outcomes.
>>>
>>>                                 Best,
>>>
>>>                                 Penny
>>>
>>>                                 On 16 October 2015 at 16:06, Twan 
>>> Goosen
>>>                           <twan.goosen at mpi.nl
>>>             <mailto:twan.goosen at mpi.nl> <mailto:twan.goosen at mpi.nl
>>>             <mailto:twan.goosen at mpi.nl>>
>>>                                 <mailto:twan.goosen at mpi.nl
>>>             <mailto:twan.goosen at mpi.nl>
>>>                           <mailto:twan.goosen at mpi.nl
>>>             <mailto:twan.goosen at mpi.nl>>>> wrote:
>>>
>>>                                     Thanks for your offer to look
>>>             through this mapping!
>>>                                     I will also send you a link to
>>>             Menzo's mapping tool.
>>>
>>>
>>> https://trac.clarin.eu/browser/vlo/trunk/vlo-commons/src/main/resource
>>>             s/LicenseAvailabilityMap.xml
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>>                   Tf-curation mailing list
>>>             Tf-curation at lists.clarin.eu
>>>             <mailto:Tf-curation at lists.clarin.eu>
>>>             <mailto:Tf-curation at lists.clarin.eu
>>>             <mailto:Tf-curation at lists.clarin.eu>>
>>> https://lists.clarin.eu/cgi-bin/mailman/listinfo/tf-curation
>>>
>>>
>>>
>>
>



More information about the Tf-curation mailing list