[Standards] food for thought and further steps

Fri Jan 29 04:27:39 CET 2021

Dear all,

I would like to share a reflection on a certain inadequacy of our 
current system, and on what seems to me the only sensible way to fix 
that inadequacy (perhaps you will identify some other paths). The 
conclusion I'm about to draw I think has already been voiced, as a 
potential future goal. It's funny to see that it actually seems 
inevitable and not at all distant, any longer.

First, let me deal with the terminology: when we say "recommendations", 
we sometimes mean the entire deliverable (= collection of recommended, 
allowed, and maybe also deprecated formats), and sometimes the scope of 
the term is narrow and it means "formats actually recommended by the 
given centre as optimal for data deposition". In this very message, I 
will try to keep to "list of formats" for our deliverable, and 
"recommendations" for the narrow-scoped recommendations (expressing a 
centre's attitude to the given format).

Secondly, a brief note on the placement of the authority that a list of 
formats can come from: the various lists that have been produced for 
CLARIN so far have, pretty uniformly, I believe, assumed a top-down 
approach: an authority directly or indirectly involved with the BoD 
would establish a list of standards/formats and communicate this list to 
the centres (or, more often, just announce it). That was definitely the 
right move for a project that was in the process of constituting itself, 
but, in time, and given the nature of CLARIN, the authority should shift 
"downwards" and it may be that our list of formats is the first one 
produced from the bottom-up perspective: that is the perspective that we 
take as we go through the particular lists produced by the individual 
centres, and pool that information together into the "KPI spreadsheet". 
Eventually, the reflection that I'm sharing here has led me full cycle, 
to a (let's call it) 'mature' top-down perspective that originates from 
gathering the bottom-up views and structuring them with the benefit of a 
global outlook, and feeding the result of the restructuring back to the 
centres.

Next, let's look at the granularity of the recommendations that we want 
to first pool and then share, and certain inadequacies of our first 
release. It is at this point obvious to us that a format can be, minimally:

* recommended,

* allowed (tolerated), or

* discouraged.

(I say "minimally", because a little analytical task is hiding here, but 
I'd rather ignore it while trying to make this message reasonably 
brief...). Let's consider "being recommended", "being tolerated" and 
"being discouraged" atomic properties. Now, what are they properties of?

They can be:

a. properties of a format

b. properties of a relation between a format and a domain of application 
(metadata, documentation, annotated data, etc.)

c. properties of a relation between a centre and a format+domain pairing.

I mentioned an inadequacy of our current release. I think it lies in the 
fact that, at the moment, we are only able to deal with (a.), while we 
should be dealing with (c.), or at least with (b.).

Let me first make sure that we understand a-c in the same way: (a.) 
means that the format in question (let's say, PDF/A) is either 
recommended, tolerated, or discouraged 'globally' across CLARIN. We know 
that that is inadequate, because, minimally, we need to see PDF/A in 
relation to the domain that it is used in: if the domain is 
"documentation", then it is recommended, but if it is "metadata" or 
"annotated data", then it is not recommended. These statements become 
possible if option (b.) above is assumed.

As far as (c.) is concerned, this is where the bottom-up approach 
becomes highly relevant: one centre may recommend a format such as "TEI 
for spoken language" for data deposition in the domain "annotated data", 
while another centre, with a different research profile, may need to 
tolerate it, or will even decide to discourage it, because it is not 
able to transduce it into something it can handle.

In release 0.1, we only make statements at the level of (a.) above: if a 
format is listed by a centre, in whichever domain, we assign it a "1" 
and then count the 1s across columns. We don't have a principled way of 
distinguishing between, e.g., recommended vs. tolerated, because that 
distinction is not systematically represented in our sources, and a "1" 
is a "1" -- it's atomic. Next, we don't have a principled way of making 
global statements with relation to the recommendation of a format within 
a given domain, because centres don't use a uniform division into domains.

Now, how can we cope with that? One way is to introduce finer 
distinctions in the spreadsheet, and then make a second pass across all 
the lists published by the individual centres and categorize the 
individual recommendations. But that, apart from person-hours, means a 
massive amount of interpretation, while we should be striving towards 
(a) minimization of the amount of work put into data extraction and (b) 
minimization of the area where our own interpretation determines the 
outcome (in other words: maximization of objectivity). That leads me to 
what I earlier dubbed 'mature' top-down approach: let us ourselves 
define the _structure_ of the information that we want from the 
particular centres, and have them fill in the values into the predefined 
frames, time-stamped, so that we can process them uniformly and notice 
when it changes. If that works, the information would be harvestable 
with a minimal amount of interpretation. The same information should 
serve as the basis for the transformation into HTML, for rendering on 
the centre's page.

If we decide on that, then we need to, minimally, establish the following:

i. taxonomy of properties (or 'kinds of recommendation'), minimally: 
recommended, tolerated (acceptable), discouraged -- would that be enough 
and easy to grasp? (or should we rather follow HZSK and its four-way 
distinction?)

ii. taxonomy of domains (metadata, documentation, etc.)

iii. we'd need a reliable list of formats with MIME types (not sure 
about file extensions) -- from those that the Switchboard uses, perhaps?

-- all three of the above, we need one way or another, so pursuing this 
is not a waste of time. We would also need:

iv. the structural skeleton for the information, and

v. ways to visualize it in the SIS and on the centres' pages.

But (iv) and (v) can safely wait (they are basically a technological issue).

If no objections get voiced on the mailing list, then I would gladly add 
(i-iii) to the agenda of the next meeting -- because we need to 
establish these anyway.

Looking forward to your potential reactions to the above!

With best wishes,

   Piotr

-- 
Piotr Bański, Ph.D.
Senior Researcher,
Leibniz-Institut für Deutsche Sprache,
R5 6-13
68-161 Mannheim, Germany