[Tf-curation] License/Availability was WG: Re: LicenseAvailabilityMap.xml in vlo/trunk/vlo-commons/src/main/resources – CLARIN Trac
Twan Goosen
twan.goosen at mpi.nl
Tue Jan 19 10:29:29 CET 2016
(I'm putting Dieter in the loop because I know that he's interested in
this as well)
On 19/01/16 10:28, Twan Goosen wrote:
> Hi Krister, all,
>
> On 18/01/16 15:40, Krister Lindén wrote:
>> [..] conceptual division of language resources into three main access
>> categories, i.e. resources
>> - which are publicly or openly available (PUB),
>> - which are available for research or academic use (ACA), or
>> - which are restricted to individual use (RES).
>> Note that PUB resources are available to everyone, whereas ACA
>> resources normally imply a trusted community, and RES resources
>> typically contain personal data requiring an individual access
>> permission.
> Thanks a lot for those definitions and the elaboration. After a recent
> discussion I was more or less convinced that we should have a "licence
> category" facet as the main filter mechanism in the VLO that includes
> these values. However from your description I have to conclude that
> these categories only indicate availability rather than type of
> licence (even though these two are not completely orthogonal).
> Also, do I understand correctly that is feasible to map all resources
> to one of these categories/levels, as long as they provide sufficient
> licence and/or access information? I think this would be preferable
> for use in the VLO, since a coarse initial distinction is more useful
> than a really fine grained one. And I don't think that, for instance,
> the standard CC restrictions make a resource less public in the eye of
> the our users.
>
> I do have to say that I'm still a bit confused about the _exact_
> relations between licence (type), accessibility and availability.
> "Public" resources do not have to be directly openly accessible (in
> some cases only upon request), while "restricted" resources can be
> freely accessible, at least technically. How do such apparent
> contradictions get reflected in the availability category for a given
> resource? It would be nice if we had the description of an underlying
> "algorithm" or at least some formulation of the heuristic on which our
> mapping is based. Most likely it exists, but I have not been paying
> attention - in which case my apologies :)
>
> In any case, from the user's perspective something like "availability"
> (however we define it exactly) does indeed seem to be the most useful
> primary dimension to filter by. I think we will have to reconsider the
> decision to switch from availability to licence category, if the rest
> of the VLO crowd agrees.
>> In hindsight, the acronyms could have been chosen differently
>> especially when compared with the current Orchid categories, but the
>> acronyms were chosen before the Orchid community appeared. Now the
>> CLARIN license category acronyms have been around for so long that
>> they are simply three-letter acronyms.
> I agree that it should be fine to use them, especially if we can
> provide a description or definition for each of them simultaneously
> (which can be done in the VLO).
>
> Best,
> Twan
>> On 18.1.2016 15:37, Sander Maijers wrote:
>>> Hi Krister,
>>>
>>> Requests for access to trac.clarin.eu <http://trac.clarin.eu> can be
>>> directed to trac at clarin.eu <mailto:trac at clarin.eu>. As it happens I
>>> handle those requests.
>>> I've checked and you do have access already. You can log in using your
>>> e-mail address krister.linden at helsinki.fi
>>> <mailto:krister.linden at helsinki.fi> and your password for www.clarin.eu
>>> <http://www.clarin.eu>. So you CLARIN account. :)
>>>
>>> Best,
>>> Sander
>>> --
>>> *Sent as system administrator and engineer for CLARIN*
>>> /Centre Registry & Service Provider Federation/ @
>>> {centres,infra}.clarin.eu <http://clarin.eu>;
>>> /software engineering tools/ @ {svn,trac}.clarin.eu <http://clarin.eu>;
>>> /identity and access management/ @ {user,idp}.clarin.eu
>>> <http://clarin.eu>
>>> /usage statistics and service monitoring/ @ stats.clarin.eu
>>> <http://stats.clarin.eu>
>>>
>>> Max Planck Institute for Psycholinguistics <https://tla.mpi.nl/>,
>>> software developer
>>> personal Skype: sander.maijers | work address: Wundtlaan 1, 6525 XD,
>>> Nijmegen (NL)
>>>
>>>
>>>
>>> On Mon, Jan 18, 2016 at 11:35 AM, Krister Lindén
>>> <krister.linden at helsinki.fi <mailto:krister.linden at helsinki.fi>> wrote:
>>>
>>> Dear Matej,
>>>
>>> Thanks for the prompting. You provide a link to a spreadsheet
>>> [5] in
>>> which there seems to be only approx. 240 lines, many of which are
>>> already correctly mapped, so an overhaul of them by me with some
>>> keen-eyed checking by Penny should be doable within the next few
>>> days.
>>>
>>> (Your tentative tags REG and REQ are already covered by ID and RES.
>>> Most of the other rather broad AVAILABILITY strings seem to map
>>> nicely to existing tags as well as suggested by the current
>>> mapping.)
>>>
>>> For the majority (90%) of resources that do not have any
>>> availability record or license, the idea was that CLIC could
>>> contribute with each country taking care of its own resources.
>>>
>>> (Note. I could not enter the trac page [6] as it seems to require a
>>> login and password.)
>>>
>>> If anyone needs reassurance that PUB/ACA/RES is a viable
>>> division of
>>> access criteria for resources, you can have a look at Orchid, where
>>> they have adopted a similar tripartite view of the world of how
>>> researchers share their data (= papers and resources). Orchid even
>>> uses the same color coding as we do for their categories: public
>>> [=green], trusted [=yellow], and private [=red]. Orchid calls their
>>> categories "Privacy Settings" for their researcher data:
>>> http://support.orcid.org/knowledgebase/articles/124518-orcid-privacy-settings
>>>
>>> . This is the same idea as availability, but "Privacy settings" are
>>> the categories seen from the right holder's perspective in stead of
>>> the end-user's point of view.
>>>
>>> Regards,
>>> Krister
>>>
>>> On 17.1.2016 16:33, Durco, Matej wrote:
>>>
>>> Dear all,
>>>
>>> time passes by ...
>>> We would like to follow-up on our discussion in November,
>>> December regarding availability and license information in
>>> the VLO.
>>>
>>> We would like to try the following setup:
>>> 1. introduce a License Category (search)facet using as values
>>> the license categories tags as defined by CLIC [1]
>>> 2. restrict the License field[2] to only feature URLs or clear
>>> identifiers of Licenses
>>> 3. (optionally) use Availability field (or however we want to
>>> call it) as catch all with the original information (as
>>> found in
>>> the metadata records) that is not covered by the previous two.
>>>
>>> 1. means that we would preserve the PUB/ACA/RES distinction,
>>> but
>>> add also the more fine grained distinction (NC;SA;BY;...)
>>> Not for the next 3.4 release but relatively soon it will be
>>> possible to select multiple values within one facet (already
>>> possible at our dev-instance [3]), which I believe would be the
>>> cleanest solution, allowing the user to easily restrict the
>>> search to whichever aspect (or a combination thereof). Until
>>> then we still have to see if we need to divide the License
>>> Categories into two facet: the main distinction + the fine
>>> grained categories. We will let you test and feedback on both
>>> possibilities.
>>>
>>> Now, we would need your help!
>>> We updated the map (as it is being used now [4])
>>> a) with the fine-grained distinctions
>>> b) with new previously unmapped values.
>>>
>>> Here is the working spreadsheet [5]
>>> column C contains the currently valid mapping, column D the
>>> tentative new one
>>> The mapping is a bit more complex:
>>> - You can map one value to multiple values using semicolon
>>> (e.g.
>>> PUB;BY;NC;SA) still in one cell
>>> - You can use two dashes ("--") to indicate that this values
>>> should be disregarded for the facet (for obviously erroneous
>>> values)
>>>
>>> I also took the freedom to tentatively introduce two more tags:
>>> REG := after registration
>>> REQ := upon request
>>>
>>> If you propose a change, please, comment on it (with your name)
>>> in one of the next columns
>>>
>>> We put together some information regarding the two facets and
>>> the mapping in CLARIN trac [6]
>>> Also, on our dev instance of VLO [3] we added additional facets
>>> allowing to better explore the details of the mapping.
>>> There are facets:
>>> - AVAILABILITYORIG := the values as they were found in the
>>> records
>>> - AVAILABILTY := the mapped/normalized values (according to the
>>> currently used normalization map)
>>> - C-* := each concept contributing to the AVAILABILITY facet
>>> listed separately.
>>>
>>> Example:
>>> filter by "Availability:restricted" [7]
>>> gives you 5438 records.
>>> In AVAILABILITYORIG facet you get listed the original values
>>> (next to "restricted" itself, HZSK-RES,
>>> available-restrictedUse,
>>> etc.)
>>> and these same values further broken down by the concept
>>> (underlying the CMD-element from which the values come from) in
>>> the facets C-* ...
>>> There you can further restrict, e.g. by
>>> "c-license:CC-BY-NC-ND" [8]
>>> The VLO-dev at minerva instance [3] features also extra facets
>>> PROFILE NAME (, PROFILE ID) and DATA PROVIDER (with the actual
>>> provider listed), so once you filtered [8], you can nicely see
>>> which profiles and data providers contribute given values.
>>> (In the case of [8], it is LINDAT, CLARIN_PL and Language Bank
>>> of Finland with metashare profiles (data and resourceInfo) )
>>> Obviously you can also use it the other way round and restrict
>>> by provider or profile first and then see all the values
>>> contributed by those.
>>>
>>>
>>> Then there is the what I see as the more dramatic issue of too
>>> many records not providing any licensing/availability
>>> information (around 90% !), but I would spare that for a
>>> separate email before this one gets too lengthy.
>>>
>>> I am sorry I did not come back to you earlier, but we would be
>>> very grateful if we could have your input/feedback in the next
>>> days, as we are approaching the VLO-3.4 milestone (and also
>>> simply to make progress for better user experience in resource
>>> discovery)
>>>
>>> Best,
>>> Matej
>>>
>>>
>>> [1]
>>> https://www.clarin.eu/content/clarin-license-category-calculator
>>> [2] "field" means a facet that is only visible in the detail
>>> view of the record and is not displayed as search facet
>>> [3] https://minerva.arz.oeaw.ac.at/vlo/?0
>>> [4]
>>> https://github.com/clarin-eric/VLO/blob/master/vlo-commons/src/main/resources/LicenseAvailabilityMap.xml
>>>
>>> [5]
>>> https://docs.google.com/spreadsheets/d/1Pf8Jk_P7RaA-7-dj8fcLOKNH5DjprraFEWXgQ3FvVtQ/edit#gid=0
>>>
>>> [6]
>>> https://trac.clarin.eu/wiki/Taskforces/Curation/ValueNormalization/License
>>>
>>> [7]
>>> http://minerva.arz.oeaw.ac.at/vlo/search?fq=availability:Restricted
>>> [8]
>>> http://minerva.arz.oeaw.ac.at/vlo/search?fq=c-license:CC-BY-NC-ND&fq=availability:Restricted
>>>
>>>
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: Krister Lindén [mailto:krister.linden at helsinki.fi
>>> <mailto:krister.linden at helsinki.fi>]
>>> Gesendet: Dienstag, 01. Dezember 2015 00:01
>>> An: Sander Maijers <sander at clarin.eu <mailto:sander at clarin.eu>>
>>> Cc: Penny Labropoulou <penny at ilsp.gr <mailto:penny at ilsp.gr>>;
>>> Durco, Matej <Matej.Durco at oeaw.ac.at
>>> <mailto:Matej.Durco at oeaw.ac.at>>; Menzo Windhouwer2
>>> <menzo.windhouwer at meertens.knaw.nl
>>> <mailto:menzo.windhouwer at meertens.knaw.nl>>; Twan Goosen
>>> <twan.goosen at mpi.nl <mailto:twan.goosen at mpi.nl>>; Thomas Eckart
>>> <teckart at informatik.uni-leipzig.de
>>> <mailto:teckart at informatik.uni-leipzig.de>>; Ostojic, Davor
>>> <Davor.Ostojic at oeaw.ac.at <mailto:Davor.Ostojic at oeaw.ac.at>>;
>>> tf-curation at lists.clarin.eu
>>> <mailto:tf-curation at lists.clarin.eu>
>>> Betreff: Re: [Tf-curation] License/Availability was WG: Re:
>>> LicenseAvailabilityMap.xml in
>>> vlo/trunk/vlo-commons/src/main/resources – CLARIN Trac
>>>
>>> CLIC is sorely aware of the lack of agreement on licenses in
>>> general (let alone on sorting them).
>>>
>>> Normally there will be only a few resources matching a
>>> particular real end-user resource query, so most users will
>>> normally be satisfied with perusing the full search result.
>>>
>>> However, providing the capability to sort the resources
>>> according to their main CLARIN license category
>>> (PUB/ACA/RES/undefined) will give a rough general preference to
>>> open resources and will take most users as far as they care to
>>> go along this particular line of investigation.
>>>
>>> If we in addition provide the capability to exclude certain
>>> laundry tags and the capability to provide part of a license
>>> name as a filter, we already cover most foreseeable simple
>>> searches.
>>>
>>> There will always be those who want more, but for those there
>>> can be some advanced search to be defined based on user
>>> feedback.
>>>
>>> --
>>> Krister
>>>
>>> On 30.11.2015 23:58, Sander Maijers wrote:
>>>
>>> Hi all,
>>>
>>> Making license type ordinal in the sense of permissiveness
>>> favors one
>>> interpretation of what users care about in a license,
>>> and the
>>> interpretation is bound to be arbitrary to some degree.
>>> That
>>> we all
>>> knew, but is everyone aware of the lack of consensus on
>>> such
>>> an order
>>> and categorization even in primary sources?
>>>
>>> E.g., see the different categorizations in highly
>>> referenced
>>> sources
>>> such as:
>>> 1. https://opensource.org/licenses/category
>>>
>>> 2.
>>> https://www.iprhelpdesk.eu/sites/default/files/newsdocuments/Intellect
>>> ual%20Property%20Rights%20Management%20in%20Software%20Developments_up
>>> dated.pdf
>>>
>>> 2.
>>> https://en.wikipedia.org/wiki/License_compatibility#Compatibility_of_F
>>> OSS_licenses
>>>
>>> 3. Academic take:
>>> http://jleo.oxfordjournals.org/content/21/1/20.full.pdf+html:
>>>
>>> We will consider three classes of licenses:
>>> unrestrictive [e.g., the
>>> Berkeley Software Definition (BSD) license],
>>> restrictive [e.g.,
>>> lesser general public license (LGPL)], and highly
>>> restrictive
>>> [general public license (GPL)]. (See below for a more
>>> complete
>>> discussion of these licenses.)
>>>
>>>
>>> In conclusion, a resource license as encoded in metadata
>>> ought to be
>>> an enumerated/sum type and it is a matter of search
>>> implementation how
>>> to rank, unify and filter its levels.
>>>
>>> Best,
>>> Sander
>>> --
>>> *Sent as system administrator and engineer for CLARIN*
>>> /Centre
>>> Registry & Service Provider Federation/ @
>>> {centres,infra}.clarin.eu <http://clarin.eu>
>>> <http://clarin.eu>; /software engineering tools/ @
>>> {svn,trac}.clarin.eu <http://clarin.eu> <http://clarin.eu>;
>>> /identity and access
>>> management/ @ {user,idp}.clarin.eu <http://clarin.eu>
>>> <http://clarin.eu> /usage
>>> statistics and service monitoring/ @ stats.clarin.eu
>>> <http://stats.clarin.eu>
>>> <http://stats.clarin.eu>
>>>
>>> Max Planck Institute for Psycholinguistics
>>> <https://tla.mpi.nl/>,
>>> software developer personal Skype: sander.maijers | work
>>> address:
>>> Wundtlaan 1, 6525 XD, Nijmegen (NL)
>>>
>>>
>>>
>>> On Mon, Nov 30, 2015 at 1:02 PM, Krister Lindén
>>> <krister.linden at helsinki.fi
>>> <mailto:krister.linden at helsinki.fi>
>>> <mailto:krister.linden at helsinki.fi
>>> <mailto:krister.linden at helsinki.fi>>> wrote:
>>>
>>> Quick answer to Penny's question about PUB/ACA/RES is
>>> that they
>>> should also be facets that you can restrict, i.e.
>>> choose to get data
>>> that is neither ACA nor RES leaving only open or
>>> public data with
>>> varying licenses.
>>>
>>> Despite our good efforts to collect data, people will
>>> be happy to
>>> find a resource at all. I do not really think that
>>> they have the
>>> luxury of choosing e.g. whether they want a
>>> treebank for a
>>> particular language with a CC-BY and not an MIT
>>> license, or vice
>>> versa, but they may wish to say that they want a
>>> treebank with an
>>> open or public license, if available.
>>>
>>> It is this final "if available", that has gotten me
>>> thinking that we
>>> should probably also let legal metadata provide a
>>> sorting order,
>>> because other criteria will be more important,
>>> i.e. if
>>> a favorite
>>> license is not on offer, I may settle for a
>>> slightly-more-difficult-to-manage license, as the
>>> legal status of
>>> the resource is more like a price tag, i.e. it will
>>> cost me more
>>> effort to deal with a restricted resource than an
>>> open
>>> one, but if I
>>> need an English speech data set, I will not settle
>>> for
>>> some Russian
>>> text data simply because the license is more
>>> interesting.
>>>
>>> This said, if it is not too much of an effort, we
>>> could of course
>>> provide the option to also write the name (or part of
>>> a name) of a
>>> license as a search criterion. After all, that can be
>>> implemented as
>>> rather straightforward string matching in the license
>>> name field.
>>>
>>> --
>>> Krister
>>>
>>> On 30.11.2015 13:11, Penny Labropoulou wrote:
>>>
>>> Hi Matej, Krister and all
>>>
>>> Some thoughts on the topics raised:
>>> - license & availability are indeed too close
>>> semantically and
>>> that's where the confusion comes; moreover,
>>> for the
>>> normalization, the values are taken from
>>> different
>>> attributes
>>> which brings about the contradicting outcomes we
>>> noticed in Wroclaw.
>>> - Now, if I understand correctly the new
>>> approach,
>>> both facets
>>> will be replaced by the License Categories, is
>>> that it? If yes,
>>> I think this would improve the situation and we
>>> need to check
>>> the new mappings. In this case, my only question
>>> to Krister and
>>> the CLIC, is whether the PUB/ACA/RES should be
>>> treated at the
>>> same level as the other tags.
>>> - in this scenario, I agree in principle with
>>> Krister's email
>>> about the sorting of the resources when shown to
>>> the user - but
>>> I don't know what other sortings (apart from
>>> alphabetical
>>> ordering on the resource name) you have also
>>> implemented on the
>>> new VLO; it would also be nice to somewhere state
>>> that this is
>>> the ordering of the resources or allow the user
>>> decide on the
>>> sorting, perhaps?
>>> - the only problem I have with keeping only
>>> license categories
>>> in the facets, is that we "lose" information of
>>> resources that
>>> are licensed with a standard license, e.g. CC,
>>> GNU
>>> etc. Could we
>>> have a second facet for these? Problems with
>>> this:
>>> (a) confusion
>>> between the two facets (will the user
>>> understand that
>>> "attribution" will give him more results than
>>> CC-BY?) and (b)
>>> mapping of resources without a standard license
>>> value... I'm
>>> just putting up the question without having a
>>> definite answer.
>>>
>>> And to Matej's question: sorry, I haven't done
>>> anything yet :-(.
>>> Do you have a deadline for the normalization
>>> issues? I can look
>>> at it closer in the next couple of weeks, taking
>>> into account
>>> the current discussion outcomes. And, if
>>> possible,
>>> it would be
>>> nice to know the attributes these concepts come
>>> from and the
>>> combinations thereof (i.e. if the same resource
>>> has two or more
>>> licensing-related attributes and/or values, get
>>> the combinations
>>> thereof).
>>>
>>> Best,
>>> Penny
>>>
>>>
>>> -----Original Message-----
>>> From: Krister Lindén
>>> [mailto:krister.linden at helsinki.fi
>>> <mailto:krister.linden at helsinki.fi>
>>> <mailto:krister.linden at helsinki.fi
>>> <mailto:krister.linden at helsinki.fi>>]
>>> Sent: Friday, November 27, 2015 5:55 PM
>>> To: Durco, Matej <Matej.Durco at oeaw.ac.at
>>> <mailto:Matej.Durco at oeaw.ac.at>
>>> <mailto:Matej.Durco at oeaw.ac.at
>>> <mailto:Matej.Durco at oeaw.ac.at>>>; penny at ilsp.gr
>>> <mailto:penny at ilsp.gr>
>>> <mailto:penny at ilsp.gr <mailto:penny at ilsp.gr>>;
>>> Menzo Windhouwer2
>>> <menzo.windhouwer at meertens.knaw.nl
>>> <mailto:menzo.windhouwer at meertens.knaw.nl>
>>> <mailto:menzo.windhouwer at meertens.knaw.nl
>>> <mailto:menzo.windhouwer at meertens.knaw.nl>>>; Twan Goosen
>>> <twan.goosen at mpi.nl <mailto:twan.goosen at mpi.nl>
>>> <mailto:twan.goosen at mpi.nl <mailto:twan.goosen at mpi.nl>>>;
>>> Thomas Eckart
>>> <teckart at informatik.uni-leipzig.de
>>> <mailto:teckart at informatik.uni-leipzig.de>
>>> <mailto:teckart at informatik.uni-leipzig.de
>>> <mailto:teckart at informatik.uni-leipzig.de>>>; Ostojic, Davor
>>> <Davor.Ostojic at oeaw.ac.at
>>> <mailto:Davor.Ostojic at oeaw.ac.at>
>>> <mailto:Davor.Ostojic at oeaw.ac.at
>>> <mailto:Davor.Ostojic at oeaw.ac.at>>>
>>> Cc: tf-curation at lists.clarin.eu
>>> <mailto:tf-curation at lists.clarin.eu>
>>> <mailto:tf-curation at lists.clarin.eu
>>> <mailto:tf-curation at lists.clarin.eu>>
>>> Subject: Re: License/Availability was WG: Re:
>>> LicenseAvailabilityMap.xml in
>>> vlo/trunk/vlo-commons/src/main/resources –
>>> CLARIN Trac
>>>
>>> Dear all,
>>>
>>> One thing we discussed afterwards in CLIC, was
>>> the
>>> best approach
>>> to utilize the legal metadata in a query. We have
>>> no illusions
>>> that the legal metadata would be the primary
>>> criterion people
>>> use to select their data. Other criteria such as
>>> name of the
>>> resource, languages covered, data type and usage
>>> purpose are
>>> probably more crucial, but if users have a
>>> choice,
>>> they probably
>>> look for resources that have a clearly defined
>>> legal status and
>>> have as few restrictions as possible.
>>>
>>> "As few restrictions as possible" implies a
>>> sorting order.
>>> However, a user may dislike some restrictions for
>>> practical
>>> purposes. Therefore, it would make sense to let
>>> the user check
>>> the tags he would like to filter out and if none
>>> are checked,
>>> none are filtered, i.e. all resources conforming
>>> to the primary
>>> criteria are shown. In addition, the user should
>>> be able to
>>> filter out resources whose legal status is
>>> undefined, because
>>> the user would not know how to legally use the
>>> resource, even if
>>> it exists. If displayed within a sorting order of
>>> "as few
>>> restrictions as possible", resources with
>>> undefined legal status
>>> should be at the end of the list.
>>>
>>> Regards,
>>> Krister
>>>
>>> On 27.11.2015 16:08, Durco, Matej wrote:
>>>
>>> Dear all,
>>>
>>> I only very late found out that there was a
>>> follow-up on the
>>> License
>>> issue right after the conference (see email
>>> below).
>>>
>>> Penny were you able to proceed on that?
>>>
>>> Meanwhile we did experimented quite a bit and
>>> compiled
>>> information, so
>>> here is our current take on this for our (TF
>>> Curation /
>>> ACDH-OEAW) side:
>>>
>>> We put down an overview (and would like to
>>> collect there
>>> more findings
>>> and decisions as we go along) in
>>> clarin-trac [1]
>>>
>>> Main points:
>>>
>>> 1.Some of the concepts are linked to both
>>> facets (not
>>> necessarily bad,
>>> but a hint that we don’t have a clear
>>> distinction
>>>
>>> 2.There is a normalisation file employed,
>>> which is however
>>> incomplete
>>> (new unmapped values exist, some of which are
>>> however
>>> obviously in the
>>> completely wrong place (like size in kB) )
>>>
>>> 3.With current concept-mapping we cover only
>>> some 60.000 out
>>> of 800.000
>>> records !!!
>>>
>>> Regarding 2: the Normalization
>>>
>>> The current normalization uses the 3-4 values
>>> distinction:
>>> Free; Free
>>> for academic use; Restricted; Upon request
>>> (in
>>> line with
>>> PUB/ACA/RES –
>>> laundry tags)
>>>
>>> This sounds easy, but as far as I could
>>> gather, it is
>>> problematic (in
>>> many ways).
>>>
>>> In Wroclaw, we discussed with Krister an
>>> alternative approach:
>>>
>>> We could try to map to the license categories
>>> as they are
>>> defined [2] by
>>> the Legal Issues Committee and available also
>>> in the License
>>> Category
>>> Calculator [3]. By that we would avoid the
>>> problematic
>>> reduction, still
>>> keeping the “laundry-tag” approach. And we
>>> would be in sync
>>> with the
>>> Legal committee recommendations. Also each of
>>> these atomic
>>> tags is well
>>> defined and most of them broadly used in
>>> the webs.
>>>
>>> We could employ here the decomposition
>>> approach, in line
>>> with what we
>>> try to adopt for resourceType and other
>>> facets, that means,
>>> we wouldn’t
>>> have facet values: [ “PUB”, “PUB+BY”,
>>> “PUB+BY+SA”] but
>>> rather [“PUB”,
>>> “BY”, “SA”].
>>>
>>> Allowing multiple possible values for the
>>> facet in each
>>> record in
>>> combination with the (already implemented)
>>> multi-select
>>> feature in VLO
>>> this should cover for all use cases and be
>>> more ergonomic
>>> (e.g. if I am
>>> interested only in the Non-Commercial clause,
>>> I need to
>>> select only one
>>> facet value and don’t have to search for all
>>> the combination
>>> that
>>> contain NC.)
>>>
>>> There is already a normalisation map used in
>>> production
>>>
>>> <https://github.com/clarin-eric/VLO/blob/master/vlo-commons/src/main/resources/LicenseAvailabilityMap.xml>
>>>
>>> [4](committed 2015-04-23). But there are new
>>> values that are
>>> not mapped
>>> yet. Normalisation map as gsheet
>>>
>>> <https://drive.google.com/open?id=1Pf8Jk_P7RaA-7-dj8fcLOKNH5DjprraFEWXgQ3FvVtQ>
>>>
>>> [5] with already existing mappings (see
>>> normalisation map
>>> above) + new
>>> values encountered not yet normalized; Values
>>> come from elements
>>> annotated with concepts linked to one of the
>>> two facets
>>> License/Availability.
>>>
>>> If we agree on the decomposition approach,
>>> this list would
>>> need to be
>>> reviewed completely, but it’s just around 240
>>> entries.
>>>
>>> And ad 3. Missing values
>>>
>>> Here we have 3 possible situations:
>>>
>>> 1.Profile does not have any information about
>>> licensing/availability
>>> (worst case)
>>>
>>> 2.Profile has information about L/A, but is
>>> not linked to a
>>> concept, or
>>> the concept is not in the facet mapping
>>>
>>> 3.Profile is well defined, with linking to
>>> one
>>> of the
>>> concepts in the
>>> facet mapping, but the information is simply
>>> not filled in
>>> the record.
>>>
>>> We prepared a list profile/facet coverage
>>>
>>> <https://docs.google.com/spreadsheets/d/1eeOr0ShOWxdY8BLzp62LDyfGgHo0gZ95Myw0qauzLxU/edit#gid=0&vpid=A1>
>>>
>>> [6] with special considerations of
>>> availability and
>>> licensing facet.
>>> Especially also the individual concepts
>>> contributing to the
>>> facet are
>>> plotted (see the c-* columns).
>>>
>>> If you want to further investigate this
>>> issue,
>>> I strongly
>>> recommend our
>>> experimental instance of the VLO on Minerva
>>> <https://minerva.arz.oeaw.ac.at/vlo/> [7].
>>>
>>> It features normalized and unnormalized
>>> facets, explicit
>>> [missing
>>> values], profileID and name as facets, data
>>> provider facet
>>> showing the
>>> actual data provider, multi-value selection
>>> and also special
>>> facets for
>>> the concepts contributing to facet
>>> availability (i.e. every
>>> concept is
>>> plotted as a separate facet; these are marked
>>> with prefix
>>> c-)
>>>
>>> With all this you can only to easily see that
>>> the biggest
>>> contributor to
>>> missing values in availability facet is
>>> Meertens [8]
>>> (Playing the blame
>>> game ;) And you can equally easily see what
>>> are the
>>> respective profiles
>>> (just open the profile Name facet).
>>>
>>> So much to our findings until now. We would
>>> love to hear
>>> from you, what
>>> do you think, perhaps we c/should arrange a
>>> telco to discuss
>>> how to go
>>> on about this.
>>>
>>> Best,
>>>
>>> Matej
>>>
>>> [1]
>>>
>>> https://trac.clarin.eu/wiki/Taskforces/Curation/ValueNormalization/Lic
>>> ense
>>>
>>> [2]
>>> https://www.clarin.eu/content/license-categories
>>>
>>> [3]
>>>
>>> https://www.clarin.eu/content/clarin-license-category-calculator
>>>
>>> [4]
>>>
>>> https://github.com/clarin-eric/VLO/blob/master/vlo-commons/src/main/re
>>> sources/LicenseAvailabilityMap.xml
>>>
>>> [5]
>>>
>>> https://drive.google.com/open?id=1Pf8Jk_P7RaA-7-dj8fcLOKNH5DjprraFEWXg
>>> Q3FvVtQ
>>>
>>> [6]
>>>
>>> https://docs.google.com/spreadsheets/d/1eeOr0ShOWxdY8BLzp62LDyfGgHo0gZ
>>> 95Myw0qauzLxU/edit#gid=0&vpid=A1
>>>
>>> [7] https://minerva.arz.oeaw.ac.at/vlo/
>>>
>>> [8]
>>>
>>> http://minerva.arz.oeaw.ac.at/vlo/search?fq=dataProvider:Meertens_Inst
>>> itute_Metadata_Repository&fq=availability:%5Bmissing+value%5D
>>>
>>>
>>>
>>> -------- Forwarded Message --------
>>>
>>> *Subject: *
>>>
>>>
>>>
>>> Re: LicenseAvailabilityMap.xml in
>>> vlo/trunk/vlo-commons/src/main/resources –
>>> CLARIN Trac
>>>
>>> *Date: *
>>>
>>>
>>>
>>> Sat, 17 Oct 2015 10:34:04 +0200
>>>
>>> *From: *
>>>
>>>
>>>
>>> Twan Goosen <twan.goosen at mpi.nl
>>> <mailto:twan.goosen at mpi.nl> <mailto:twan.goosen at mpi.nl
>>> <mailto:twan.goosen at mpi.nl>>>
>>> <mailto:twan.goosen at mpi.nl
>>> <mailto:twan.goosen at mpi.nl> <mailto:twan.goosen at mpi.nl
>>> <mailto:twan.goosen at mpi.nl>>>
>>>
>>> *To: *
>>>
>>>
>>>
>>> Penny Labropoulou <penny at ilsp.gr
>>> <mailto:penny at ilsp.gr> <mailto:penny at ilsp.gr
>>> <mailto:penny at ilsp.gr>>>
>>> <mailto:penny at ilsp.gr <mailto:penny at ilsp.gr>
>>> <mailto:penny at ilsp.gr <mailto:penny at ilsp.gr>>>
>>>
>>> *CC: *
>>>
>>>
>>>
>>> Thomas Eckart
>>> <teckart at informatik.uni-leipzig.de
>>> <mailto:teckart at informatik.uni-leipzig.de>
>>> <mailto:teckart at informatik.uni-leipzig.de
>>> <mailto:teckart at informatik.uni-leipzig.de>>>
>>> <mailto:teckart at informatik.uni-leipzig.de
>>> <mailto:teckart at informatik.uni-leipzig.de>
>>> <mailto:teckart at informatik.uni-leipzig.de
>>> <mailto:teckart at informatik.uni-leipzig.de>>>, Matej Durco
>>> <xnrn at gmx.net <mailto:xnrn at gmx.net>
>>> <mailto:xnrn at gmx.net <mailto:xnrn at gmx.net>>>
>>> <mailto:xnrn at gmx.net <mailto:xnrn at gmx.net>
>>> <mailto:xnrn at gmx.net <mailto:xnrn at gmx.net>>>
>>>
>>>
>>>
>>> That would be great. To get more information
>>> on the mapping
>>> from the
>>> values in resourceInfo records to VLO facets,
>>> you can enter
>>> the profile
>>> id 'clarin.eu:cr1:p_1361876010571' in the
>>> input box of the
>>> "check
>>> profile" form at
>>>
>>> <https://lux17.mpi.nl/isocat/clarin/vlo/mapping/index.html>
>>>
>>> <https://lux17.mpi.nl/isocat/clarin/vlo/mapping/index.html>.
>>>
>>> This will give you quite a lot of information
>>> but the
>>> relevant sections
>>> would be
>>>
>>> Facet: availability
>>> Matched CMD Element ConceptLink:
>>>
>>> http://hdl.handle.net/11459/CCR_C-2457_45bbaa1a-7002-2ecd-ab9d-57a189f
>>> 694a6
>>>
>>>
>>> /c:CMD/c:Components/c:resourceInfo/c:distributionInfo/c:licenceInfo/c:licence/text()
>>>
>>> xpath accepted
>>> Matched CMD Element ConceptLink:
>>>
>>> http://hdl.handle.net/11459/CCR_C-2453_1f0c3ea5-7966-ae11-d3c6-448424d
>>> 4e6e8
>>>
>>>
>>> /c:CMD/c:Components/c:resourceInfo/c:distributionInfo/c:licenceInfo/c:restrictionsOfUse/text()
>>>
>>> xpath accepted
>>>
>>> and
>>>
>>> Facet: license
>>> Matched CMD Element ConceptLink:
>>>
>>> http://hdl.handle.net/11459/CCR_C-2457_45bbaa1a-7002-2ecd-ab9d-57a189f
>>> 694a6
>>>
>>>
>>> /c:CMD/c:Components/c:resourceInfo/c:distributionInfo/c:licenceInfo/c:licence/text()
>>>
>>> xpath accepted
>>> Matched CMD Element ConceptLink:
>>>
>>> http://hdl.handle.net/11459/CCR_C-2453_1f0c3ea5-7966-ae11-d3c6-448424d
>>> 4e6e8
>>>
>>>
>>> /c:CMD/c:Components/c:resourceInfo/c:distributionInfo/c:licenceInfo/c:restrictionsOfUse/text()
>>>
>>> xpath accepted
>>>
>>>
>>> So the two fields 'licence' and
>>> 'restrictionsOfUse' are
>>> mapped to both
>>> facets (via the concepts 'availability' and
>>> 'license'). By
>>> looking at
>>> the mapping file we can able to see why this
>>> results in the
>>> three
>>> different availability levels we are now
>>> getting in the VLO
>>> (at least in
>>> the case of
>>>
>>> <http://catalog-clarin.esc.rzg.mpg.de/vlo/search?q=perso&fq=country:Finland>
>>>
>>>
>>> <http://catalog-clarin.esc.rzg.mpg.de/vlo/search?q=perso&fq=country:Finland>):
>>>
>>> - license 'CLARIN_ACA-NC' maps to 'Free for
>>> academic use'
>>> - restriction 'attribution' maps to 'Free'
>>> - restriction 'noRedistribution' maps to
>>> 'Restricted'
>>>
>>> The next step is to decide what would be the
>>> desired mapping
>>> (logic).
>>>
>>> Best,
>>> Twan
>>>
>>> On 16/10/15 22:14, Penny Labropoulou wrote:
>>>
>>> No problem! Glad to do it - it was more
>>> or less on our
>>> agenda for
>>> CLIC, so I'll have a look and let you
>>> know of the
>>> outcomes.
>>>
>>> Best,
>>>
>>> Penny
>>>
>>> On 16 October 2015 at 16:06, Twan
>>> Goosen
>>> <twan.goosen at mpi.nl
>>> <mailto:twan.goosen at mpi.nl> <mailto:twan.goosen at mpi.nl
>>> <mailto:twan.goosen at mpi.nl>>
>>> <mailto:twan.goosen at mpi.nl
>>> <mailto:twan.goosen at mpi.nl>
>>> <mailto:twan.goosen at mpi.nl
>>> <mailto:twan.goosen at mpi.nl>>>> wrote:
>>>
>>> Thanks for your offer to look
>>> through this mapping!
>>> I will also send you a link to
>>> Menzo's mapping tool.
>>>
>>>
>>> https://trac.clarin.eu/browser/vlo/trunk/vlo-commons/src/main/resource
>>> s/LicenseAvailabilityMap.xml
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Tf-curation mailing list
>>> Tf-curation at lists.clarin.eu
>>> <mailto:Tf-curation at lists.clarin.eu>
>>> <mailto:Tf-curation at lists.clarin.eu
>>> <mailto:Tf-curation at lists.clarin.eu>>
>>> https://lists.clarin.eu/cgi-bin/mailman/listinfo/tf-curation
>>>
>>>
>>>
>>
>
More information about the Tf-curation
mailing list