[Standards] food for thought and further steps
Piotr Banski
banski at ids-mannheim.de
Fri Jan 29 04:27:39 CET 2021
Dear all,
I would like to share a reflection on a certain inadequacy of our
current system, and on what seems to me the only sensible way to fix
that inadequacy (perhaps you will identify some other paths). The
conclusion I'm about to draw I think has already been voiced, as a
potential future goal. It's funny to see that it actually seems
inevitable and not at all distant, any longer.
First, let me deal with the terminology: when we say "recommendations",
we sometimes mean the entire deliverable (= collection of recommended,
allowed, and maybe also deprecated formats), and sometimes the scope of
the term is narrow and it means "formats actually recommended by the
given centre as optimal for data deposition". In this very message, I
will try to keep to "list of formats" for our deliverable, and
"recommendations" for the narrow-scoped recommendations (expressing a
centre's attitude to the given format).
Secondly, a brief note on the placement of the authority that a list of
formats can come from: the various lists that have been produced for
CLARIN so far have, pretty uniformly, I believe, assumed a top-down
approach: an authority directly or indirectly involved with the BoD
would establish a list of standards/formats and communicate this list to
the centres (or, more often, just announce it). That was definitely the
right move for a project that was in the process of constituting itself,
but, in time, and given the nature of CLARIN, the authority should shift
"downwards" and it may be that our list of formats is the first one
produced from the bottom-up perspective: that is the perspective that we
take as we go through the particular lists produced by the individual
centres, and pool that information together into the "KPI spreadsheet".
Eventually, the reflection that I'm sharing here has led me full cycle,
to a (let's call it) 'mature' top-down perspective that originates from
gathering the bottom-up views and structuring them with the benefit of a
global outlook, and feeding the result of the restructuring back to the
centres.
Next, let's look at the granularity of the recommendations that we want
to first pool and then share, and certain inadequacies of our first
release. It is at this point obvious to us that a format can be, minimally:
* recommended,
* allowed (tolerated), or
* discouraged.
(I say "minimally", because a little analytical task is hiding here, but
I'd rather ignore it while trying to make this message reasonably
brief...). Let's consider "being recommended", "being tolerated" and
"being discouraged" atomic properties. Now, what are they properties of?
They can be:
a. properties of a format
b. properties of a relation between a format and a domain of application
(metadata, documentation, annotated data, etc.)
c. properties of a relation between a centre and a format+domain pairing.
I mentioned an inadequacy of our current release. I think it lies in the
fact that, at the moment, we are only able to deal with (a.), while we
should be dealing with (c.), or at least with (b.).
Let me first make sure that we understand a-c in the same way: (a.)
means that the format in question (let's say, PDF/A) is either
recommended, tolerated, or discouraged 'globally' across CLARIN. We know
that that is inadequate, because, minimally, we need to see PDF/A in
relation to the domain that it is used in: if the domain is
"documentation", then it is recommended, but if it is "metadata" or
"annotated data", then it is not recommended. These statements become
possible if option (b.) above is assumed.
As far as (c.) is concerned, this is where the bottom-up approach
becomes highly relevant: one centre may recommend a format such as "TEI
for spoken language" for data deposition in the domain "annotated data",
while another centre, with a different research profile, may need to
tolerate it, or will even decide to discourage it, because it is not
able to transduce it into something it can handle.
In release 0.1, we only make statements at the level of (a.) above: if a
format is listed by a centre, in whichever domain, we assign it a "1"
and then count the 1s across columns. We don't have a principled way of
distinguishing between, e.g., recommended vs. tolerated, because that
distinction is not systematically represented in our sources, and a "1"
is a "1" -- it's atomic. Next, we don't have a principled way of making
global statements with relation to the recommendation of a format within
a given domain, because centres don't use a uniform division into domains.
Now, how can we cope with that? One way is to introduce finer
distinctions in the spreadsheet, and then make a second pass across all
the lists published by the individual centres and categorize the
individual recommendations. But that, apart from person-hours, means a
massive amount of interpretation, while we should be striving towards
(a) minimization of the amount of work put into data extraction and (b)
minimization of the area where our own interpretation determines the
outcome (in other words: maximization of objectivity). That leads me to
what I earlier dubbed 'mature' top-down approach: let us ourselves
define the _structure_ of the information that we want from the
particular centres, and have them fill in the values into the predefined
frames, time-stamped, so that we can process them uniformly and notice
when it changes. If that works, the information would be harvestable
with a minimal amount of interpretation. The same information should
serve as the basis for the transformation into HTML, for rendering on
the centre's page.
If we decide on that, then we need to, minimally, establish the following:
i. taxonomy of properties (or 'kinds of recommendation'), minimally:
recommended, tolerated (acceptable), discouraged -- would that be enough
and easy to grasp? (or should we rather follow HZSK and its four-way
distinction?)
ii. taxonomy of domains (metadata, documentation, etc.)
iii. we'd need a reliable list of formats with MIME types (not sure
about file extensions) -- from those that the Switchboard uses, perhaps?
-- all three of the above, we need one way or another, so pursuing this
is not a waste of time. We would also need:
iv. the structural skeleton for the information, and
v. ways to visualize it in the SIS and on the centres' pages.
But (iv) and (v) can safely wait (they are basically a technological issue).
If no objections get voiced on the mailing list, then I would gladly add
(i-iii) to the agenda of the next meeting -- because we need to
establish these anyway.
Looking forward to your potential reactions to the above!
With best wishes,
Piotr
--
Piotr Bański, Ph.D.
Senior Researcher,
Leibniz-Institut für Deutsche Sprache,
R5 6-13
68-161 Mannheim, Germany
More information about the Standards
mailing list