[Standards] [Dev] [Teiweblicht] proposal: using a common mime type for TEI files

Piotr Bański banski at ids-mannheim.de
Fri Jul 8 09:41:45 CEST 2016


Dear Thomas and All,

[I'm a bit afraid that my message may bounce off the teiweblicht list, 
since I'm not a member, but let me try and count on the list 
administrators' "ok" for the attempt]

It's taken me a while to go through this thread together with its 
branches into the Net via the URLs quoted. I'm taken up by the 
format-variant solution, it indeed seems the neatest of all for the 
purpose of processing. I've been dusting off some old TEI tickets in the 
meantime, to see how I could suggest a formalization of this on the TEI 
side, but this is of course a slightly different issue and task. A very 
new one, too, from what I can see, because indeed, apart from the quick 
round of applause after the introduction of MIME type, nothing much 
followed, at least in the official channels.

I do not presume the authority to speak on behalf of the Standards 
Committee in this respect. In my own view, let me repeat, the solution 
seems brilliant. As far as the Standards Committee is concerned, and 
with apologies to those who have seen this message in the Standards 
mailing list, I would like to repeat my declaration of providing a 
proposal for a unified set of standards documents that could be 
advocated by CLARIN centres (and again I hasten to stress that this is 
not meant as a revolution, but rather as taking stock of the inventory 
and seeing what got obsolete beyond embarrassment and what has appeared 
on the scene in the time since the Short Guides and other proposals for 
standardization within CLARIN were created; Jan Odijk has in the 
meantime approached the issue from a slightly different but closely 
related angle, which gives me a good starting point, and I am happy to 
have received some important backchannel support for this initiative as 
well). Andreas Witt and I will present the proposal (we sometimes 
cautiously speak about it as a sketch) in Aix-en-Provence, and circulate 
it earlier among the Standards group and other interested parties. 
Naturally, we count on the support and advice of the Centres Committee 
in this endeavour.

Also, the process of standardization is not about putting a stamp of one 
committee or another over a proposal, but rather it consists in 
recognition and promotion of existing good practices, so I would say, 
let's put this idea into practice and see if it works (I guess we're all 
pretty optimistic about that), and I will be happy to document it as a 
working practice that can constitute the basis for standardization. And 
while I do that I'll keep the TEI part of my brain and life in sync with 
that -- this is a pretty fortunate moment to speak of this, given the 
approaching TEI meeting, because it gives an opportunity to seed the TEI 
Technical Council's consciousness with these ideas, and hope that they 
ripen enough by the next release cycle to get reflected in the TEI 
documents as well.

It seems like some exciting weeks may lie ahead. In the meantime, I wish 
everyone a good weekend (and some of us a good show on Sunday ;-)).

Best regards,

  Piotr Banski


On 08/07/16 09:04, Thomas Schmidt wrote:
> Dear all,
>
> in the absence of further input from the standards committee and 
> before we lose the momentum, I'd like to summarise our action plan 
> according to the discussion so far:
>
> (1a) In WebLicht (in CLARIN in general?) ISO/TEI transcriptions of 
> spoken language will be identified by the MIME type 
> text/tei+xml;format-variant=tei-iso-spoken. A parameter "token=0/1" 
> can be added to indicate whether (=1) or not (=0) the respective TEI 
> file is tokenized (i.e. has <w> markup).
> (1b) HZSK and myself will adapt the respective web services accordingly
>
> (2a) In WebLicht (in CLARIN in general?) DTA/TEI files will be 
> identified by the MIME type text/tei+xml;format-variant=tei-dta. A 
> parameter "token=0/1" can be added to indicate whether (=1) or not 
> (=0) the respective TEI file is tokenized (i.e. has <w> markup).
> (2b) Bryan Jurish will adapt the respective web services at BBAW 
> accordingly
>
> (3a) In WebLicht (in CLARIN in general?), EXMARaLDA Basic 
> Transcriptions will be identified by the MIME type text/xml; 
> format-variant=exmaralda-exb
> (3b) In WebLicht (in CLARIN in general?), FOLKER/OrthoNormal 
> transcription files will be identified by the MIME type text/xml; 
> format-variant=folker-fln
> (3c) In WebLicht (in CLARIN in general?), Transcriber transcription 
> files will be identified by the MIME type text/xml; 
> format-variant=transcriber-trs
> (3d) HZSK and myself will adapt the respective web services accordingly
>
> (4a) It would have to be checked (note the passive, I don't know who 
> could be in charge of this) whether competing MIME types for these 
> file types are already registered somewhere. I know that WebLicht 
> already seems to have two variants of EXMARaLDA transcriptions. The 
> mechanims specifying those would probably have to be deprecated. 
> Transcriber is also not unlikely to have been given some kind of 
> mimetype elsewhere in CLARIN.
> (4b) Further relevant formats will be ELAN/EAF, CLAN/CHA and 
> PRAAT/TextGrid (the latter two being text, not XML formats). All three 
> of them are also likely to have been registered somewhere already, so 
> "someone" (again, I wouldn't know who) should check if mime types have 
> been defined for those.
>
> I guess that this is as good an answer as we can currently give to 
> address points 1-3 in Marie Hinrich's list. @Marie: can you confirm 
> that this is suffient for you, also to address point 4 in your list? 
> In my understanding, whatever works for WebLicht in this respect 
> should also be a suitable basis for a larger context (the SwitchBoard 
> in particular?).
>
> In my eyes, it remains crucial, however, that such standardization 
> "decisions" are centrally documented (including the information Tomaž 
> suggested). The CLARIN standards pages as they are now 
> (https://www.clarin.eu/content/standard-recommendations / 
> http://clarin.ids-mannheim.de/standards/index.xq are the ones I know) 
> are, IMHO, incomplete, inconistent and outdated, and they certainly do 
> not provide accurate information on the mime types. Any input from the 
> standard committee on this question would therefore still be much 
> appreciated.
>
> Best,
>
> Thomas
>
>
>
>
> On Thu, Jun 23, 2016 at 10:56 AM, Marie Hinrichs 
> <marie.hinrichs at uni-tuebingen.de 
> <mailto:marie.hinrichs at uni-tuebingen.de>> wrote:
>
>     Hi All,
>
>     Thanks to all of you for all the work you’ve done so far to get
>     TEI processing integrated into WebLicht.
>
>     From WebLicht’s side, there are several places where some
>     work/coordination needs to happen:
>
>     1. TCF: agree on the textsource.type attribute and make sure that
>     the encoder services set it properly
>     2. Agree on type names (i.e. text/tei+xml or text/x-tei-dta-xml)
>     3. Make sure the CMDI for encoder and decoder services reflect
>     outcomes of 1 and 2
>     4. Add new mappings to WebLicht for TEI.
>
>     Steps 1-3 are being worked out here on the mailing list and
>     whichever solution/conventions you agree on are fine with us.
>
>     Step 4 requires some changes to the WebLicht code - in particular
>     to the component that we call the “profiler”. When a user uploads
>     a file, the profiler tries to figure out what it is and if any of
>     the WebLicht services can process it. The contentType of the
>     uploaded file, in combination with standard libraries for file
>     type recognition are used for this. But sometimes more digging is
>     necessary, as in the case with tcf - which is recognized as xml,
>     but it needs a closer look to see if it is tcf.  The profiler will
>     have to be updated in a similar way to recognize TEI, and
>     hopefully there is even some straightforward way of distinguishing
>     between the DTA and the spoken variants. Finally, mappings need to
>     be established between the results of the profiler and the service
>     input types so that the right services are offered to the user for
>     selection.
>
>     Also note that WebLicht chains can be called from the command-line
>     or programmatically using WebLicht as a Service (WaaS) - see
>     instructions here:
>     https://weblicht.sfs.uni-tuebingen.de/WaaS/ This is useful for
>     larger inputs and avoids timeout issues that arise when using the
>     web interface.
>
>     Best Regards,
>     Marie
>
>
>>     On 21.06.2016, at 14:28, Tomaž Erjavec <Tomaz.Erjavec at ijs.si
>>     <mailto:Tomaz.Erjavec at ijs.si>> wrote:
>>
>>     Hi,
>>
>>     as regards
>>
>>     > these format-related specifications (in this case: the name and
>>     possible
>>     > values of attributes which are used in addition to a mime type)
>>     would
>>     > need to be documented and made known at a central place.
>>
>>     I'd say the documentation for each would need to be accompanied
>>     by its TEI schema, i.e. the TEI ODD file and the derived
>>     (probably) RelaxNG schema. Then it would be a simple matter to
>>     check if a document conforms to the mime type.
>>
>>     Best,
>>     Tomaž
>>
>>     Bryan Jurish je 21/06/2016 ob 14:22 napisal:
>>>     morning all,
>>>
>>>     sounds good to me.
>>>
>>>     @Marie: can you give an estimation of how well this might work
>>>     for WebLicht?
>>>
>>>     I'll add the "format-variant=tei-dta" parameter to the DTA
>>>     TEI<->TCF web service in the next few days, so we can see how
>>>     that at least works out.
>>>
>>>     marmosets,
>>>       Bryan
>>>
>>>     On Tue, Jun 21, 2016 at 12:32 PM, Thomas Schmidt
>>>     <thomas.schmidt at ids-mannheim.de
>>>     <mailto:thomas.schmidt at ids-mannheim.de>> wrote:
>>>
>>>         Dear all,
>>>
>>>         revising my suggestions from the teiweblicht list according
>>>         to Bryan's
>>>         proposal to use official mime-types plus parameters (instead of
>>>         x-extended custom mime types) would mean that:
>>>
>>>         "text/x-tei-isospoken+xml" could become "text/tei+xml;
>>>         format-variant=tei-iso-spoken" (+ tokenized=0/1)
>>>         "text/x-tei-dta+xml" could become "text/tei+xml;
>>>         format-variant=tei-dta" (+ tokenized=0/1)
>>>         "text/x-exmaralda-exb+xml" could become "text/xml;
>>>         format-variant=exmaralda-exb"
>>>         ... and so forth (for other TEI oder XML based formats)
>>>
>>>         Wouldn't that be a solomonic solution? What do the WebLicht
>>>         developers
>>>         say? And independently of that, I think that Hanna is right
>>>         that these
>>>         format-related specifications (in this case: the name and
>>>         possible
>>>         values of attributes which are used in addition to a mime
>>>         type) would
>>>         need to be documented and made known at a central place. I
>>>         guess it
>>>         would be up to the standards committee to decide on that?
>>>
>>>         Best regards,
>>>
>>>         Thomas
>>>
>>>
>>>
>>>
>>>
>>>         On Sat, Jun 18, 2016 at 10:56 AM, Bryan Jurish
>>>         <jurish at bbaw.de <mailto:jurish at bbaw.de>> wrote:
>>>         > moin all,
>>>         >
>>>         > fwiw, I agree with Dieter that we need to differentiate
>>>         between "proper"
>>>         > MIME types (i.e. standardized conventions registered with
>>>         IANA) and
>>>         > CLARIN-internal (rsp. WebLicht-internal) conventions. We
>>>         have been using
>>>         > MIME types as the basis of the WebLicht textSource/@type
>>>         attribute,
>>>         > analogous to the HTTP "ContentType" header, cf.
>>>         > https://tools.ietf.org/html/rfc2045#section-5.1 .  At the
>>>         risk of repeating
>>>         > what I've already said on the tei-weblicht list, use of
>>>         the ContentType
>>>         > syntax allows us to have our cake and eat it too: we can
>>>         go ahead and use
>>>         > "official" IANA-sanctioned "true" MIME types and specify
>>>         variants
>>>         > ("dialects", "flavors") using parameters.  The DTA
>>>         TEI<->TCF converter is
>>>         > already doing this, setting textSource/@type to either
>>>         "text/tei+xml;
>>>         > tokenized=0" or "text/tei+xml; tokenized=1", depending on
>>>         the relevant
>>>         > properties of the input document.
>>>         >
>>>         > just my €0.02.
>>>         >
>>>         > marmosets,
>>>         >   Bryan
>>>         >
>>>         >
>>>         > On Fri, Jun 17, 2016 at 1:43 PM, Dieter Van Uytvanck
>>>         <dieter at clarin.eu <mailto:dieter at clarin.eu>>
>>>         > wrote:
>>>         >>
>>>         >> On 17/06/16 12:59, Sander Maijers wrote:
>>>         >> > After all, you would want a
>>>         >> > resource's metadata to be completely descriptive of
>>>         such elementary
>>>         >> > aspects as internal structure and content of the TEI
>>>         files, and not
>>>         >> > dependent on system configuration (served as custom
>>>         media type x or y,
>>>         >> > as long as the server remains so configured).
>>>         >>
>>>         >> Hi Sander,
>>>         >>
>>>         >> Thank you for sharing your opinion.
>>>         >>
>>>         >> One side note: we are talking about detecting the
>>>         mimetype as indicated
>>>         >> in the CMDI ResourceProxy attribute, see:
>>>         >>
>>>         >>
>>>         >>
>>>         https://www.clarin.eu/faq/how-can-i-specify-additional-details-about-resourceproxy
>>>         >>
>>>         >> So for the scenario VLO -> LR switchboard -> processing
>>>         application
>>>         >>
>>>         >> the system configuration would not be relevant, since the
>>>         mimetype is
>>>         >> explicitly mentioned in the metadata. The key is to find
>>>         agreement about
>>>         >> a simple and light-weight way of designating the variants
>>>         of TEI.
>>>         >>
>>>         >> best,
>>>         >>
>>>         >> --
>>>         >> Dieter Van Uytvanck
>>>         >> Technical Director CLARIN ERIC
>>>         >> www.clarin.eu <http://www.clarin.eu/> | tel.
>>>         +31-(0)850091363 <tel:%2B31-%280%29850091363> | skype:
>>>         dietervu.mpi
>>>         >> _______________________________________________
>>>         >> Teiweblicht mailing list
>>>         >> Teiweblicht at lists.informatik.uni-leipzig.de
>>>         <mailto:Teiweblicht at lists.informatik.uni-leipzig.de>
>>>         >>
>>>         http://lists.informatik.uni-leipzig.de/mailman/listinfo/teiweblicht
>>>         >>
>>>         >
>>>         >
>>>         >
>>>         > --
>>>         > ***************************************************
>>>         > Bryan Jurish
>>>         > Deutsches Textarchiv
>>>         > Digitales Wörterbuch der deutschen Sprache
>>>         > Berlin-Brandenburgische Akademie der Wissenschaften
>>>         >
>>>         > Jägerstr. 22/23
>>>         > 10117 Berlin
>>>         >
>>>         > Tel.: +49 (0)30 20370 539
>>>         <tel:%2B49%20%280%2930%2020370%20539>
>>>         > E-Mail: jurish at bbaw.de <mailto:jurish at bbaw.de>
>>>         > ***************************************************
>>>         >
>>>         > _______________________________________________
>>>         > Teiweblicht mailing list
>>>         > Teiweblicht at lists.informatik.uni-leipzig.de
>>>         <mailto:Teiweblicht at lists.informatik.uni-leipzig.de>
>>>         >
>>>         http://lists.informatik.uni-leipzig.de/mailman/listinfo/teiweblicht
>>>         >
>>>
>>>
>>>
>>>         --
>>>         Thomas Schmidt
>>>         IDS Mannheim
>>>         R5, 6-13
>>>         D-68161 Mannheim
>>>         Tel.: +49 (621) 1581-313 <tel:%2B49%20%28621%29%201581-313>
>>>         http://agd.ids-mannheim.de/index.shtml
>>>         http://www.exmaralda.org <http://www.exmaralda.org/>
>>>
>>>
>>>
>>>
>>>     -- 
>>>     ***************************************************
>>>     Bryan Jurish
>>>     Deutsches Textarchiv
>>>     Digitales Wörterbuch der deutschen Sprache
>>>     Berlin-Brandenburgische Akademie der Wissenschaften
>>>
>>>     Jägerstr. 22/23
>>>     10117 Berlin
>>>
>>>     Tel.: +49 (0)30 20370 539 <tel:%2B49%20%280%2930%2020370%20539>
>>>     E-Mail: jurish at bbaw.de <mailto:jurish at bbaw.de>
>>>     ***************************************************
>>
>
>
>
>
> -- 
> Thomas Schmidt
> IDS Mannheim
> R5, 6-13
> D-68161 Mannheim
> Tel.: +49 (621) 1581-313
> http://agd.ids-mannheim.de/index.shtml
> http://www.exmaralda.org


-- 
Piotr Bański, Ph.D.
Senior Researcher,
Institut für Deutsche Sprache,
R5 6-13
68-161 Mannheim, Germany

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clarin.eu/pipermail/standards/attachments/20160708/0e14ced5/attachment-0003.html>


More information about the Standards mailing list