[Standards] [Dev] [Teiweblicht] proposal: using a common mime type for TEI files

Fri Jul 8 17:42:36 CEST 2016

Dear All,

I have summarized Thomas's proposal at 
https://trac.clarin.eu/wiki/MIME%20format%20variants

I'll also try to be the "someone" whom Thomas has hesitated to name. It 
will provide me with a good opportunity to look into the various corners 
of CLARIN's infrastructure that I could otherwise overlook.

Best regards,

   Piotr

On 08/07/16 10:25, Thomas Schmidt wrote:
> Sorry: as long as this discussion is the reference document, I should
> point out that I made a mistake:
>
>> A parameter "token=0/1" can be added to indicate whether (=1) or
>> not (=0) the respective TEI file is tokenized (i.e. has <w> markup)
> The name of the parameter as described by Bryan is "tokenized", not "token".
>
> - Thomas
>
>
>
> On Fri, Jul 8, 2016 at 9:04 AM, Thomas Schmidt
> <thomas.schmidt at ids-mannheim.de> wrote:
>> Dear all,
>>
>> in the absence of further input from the standards committee and before we
>> lose the momentum, I'd like to summarise our action plan according to the
>> discussion so far:
>>
>> (1a) In WebLicht (in CLARIN in general?) ISO/TEI transcriptions of spoken
>> language will be identified by the MIME type
>> text/tei+xml;format-variant=tei-iso-spoken. A parameter "token=0/1" can be
>> added to indicate whether (=1) or not (=0) the respective TEI file is
>> tokenized (i.e. has <w> markup).
>> (1b) HZSK and myself will adapt the respective web services accordingly
>>
>> (2a) In WebLicht (in CLARIN in general?) DTA/TEI files will be identified by
>> the MIME type text/tei+xml;format-variant=tei-dta. A parameter "token=0/1"
>> can be added to indicate whether (=1) or not (=0) the respective TEI file is
>> tokenized (i.e. has <w> markup).
>> (2b) Bryan Jurish will adapt the respective web services at BBAW accordingly
>>
>> (3a) In WebLicht (in CLARIN in general?), EXMARaLDA Basic Transcriptions
>> will be identified by the MIME type text/xml; format-variant=exmaralda-exb
>> (3b) In WebLicht (in CLARIN in general?), FOLKER/OrthoNormal transcription
>> files will be identified by the MIME type text/xml;
>> format-variant=folker-fln
>> (3c) In WebLicht (in CLARIN in general?), Transcriber transcription files
>> will be identified by the MIME type text/xml; format-variant=transcriber-trs
>> (3d) HZSK and myself will adapt the respective web services accordingly
>>
>> (4a) It would have to be checked (note the passive, I don't know who could
>> be in charge of this) whether competing MIME types for these file types are
>> already registered somewhere. I know that WebLicht already seems to have two
>> variants of EXMARaLDA transcriptions. The mechanims specifying those would
>> probably have to be deprecated. Transcriber is also not unlikely to have
>> been given some kind of mimetype elsewhere in CLARIN.
>> (4b) Further relevant formats will be ELAN/EAF, CLAN/CHA and PRAAT/TextGrid
>> (the latter two being text, not XML formats). All three of them are also
>> likely to have been registered somewhere already, so "someone" (again, I
>> wouldn't know who) should check if mime types have been defined for those.
>>
>> I guess that this is as good an answer as we can currently give to address
>> points 1-3 in Marie Hinrich's list. @Marie: can you confirm that this is
>> suffient for you, also to address point 4 in your list? In my understanding,
>> whatever works for WebLicht in this respect should also be a suitable basis
>> for a larger context (the SwitchBoard in particular?).
>>
>> In my eyes, it remains crucial, however, that such standardization
>> "decisions" are centrally documented (including the information Tomaž
>> suggested). The CLARIN standards pages as they are now
>> (https://www.clarin.eu/content/standard-recommendations /
>> http://clarin.ids-mannheim.de/standards/index.xq are the ones I know) are,
>> IMHO, incomplete, inconistent and outdated, and they certainly do not
>> provide accurate information on the mime types. Any input from the standard
>> committee on this question would therefore still be much appreciated.
>>
>> Best,
>>
>> Thomas
>>
>>
>>
>>
>> On Thu, Jun 23, 2016 at 10:56 AM, Marie Hinrichs
>> <marie.hinrichs at uni-tuebingen.de> wrote:
>>> Hi All,
>>>
>>> Thanks to all of you for all the work you’ve done so far to get TEI
>>> processing integrated into WebLicht.
>>>
>>>  From WebLicht’s side, there are several places where some
>>> work/coordination needs to happen:
>>>
>>> 1. TCF: agree on the textsource.type attribute and make sure that the
>>> encoder services set it properly
>>> 2. Agree on type names (i.e. text/tei+xml or text/x-tei-dta-xml)
>>> 3. Make sure the CMDI for encoder and decoder services reflect outcomes of
>>> 1 and 2
>>> 4. Add new mappings to WebLicht for TEI.
>>>
>>> Steps 1-3 are being worked out here on the mailing list and whichever
>>> solution/conventions you agree on are fine with us.
>>>
>>> Step 4 requires some changes to the WebLicht code - in particular to the
>>> component that we call the “profiler”. When a user uploads a file, the
>>> profiler tries to figure out what it is and if any of the WebLicht services
>>> can process it. The contentType of the uploaded file, in combination with
>>> standard libraries for file type recognition are used for this. But
>>> sometimes more digging is necessary, as in the case with tcf - which is
>>> recognized as xml, but it needs a closer look to see if it is tcf.  The
>>> profiler will have to be updated in a similar way to recognize TEI, and
>>> hopefully there is even some straightforward way of distinguishing between
>>> the DTA and the spoken variants. Finally, mappings need to be established
>>> between the results of the profiler and the service input types so that the
>>> right services are offered to the user for selection.
>>>
>>> Also note that WebLicht chains can be called from the command-line or
>>> programmatically using WebLicht as a Service (WaaS) - see instructions here:
>>> https://weblicht.sfs.uni-tuebingen.de/WaaS/ This is useful for larger inputs
>>> and avoids timeout issues that arise when using the web interface.
>>>
>>> Best Regards,
>>> Marie
>>>
>>>
>>> On 21.06.2016, at 14:28, Tomaž Erjavec <Tomaz.Erjavec at ijs.si> wrote:
>>>
>>> Hi,
>>>
>>> as regards
>>>
>>>> these format-related specifications (in this case: the name and possible
>>>> values of attributes which are used in addition to a mime type) would
>>>> need to be documented and made known at a central place.
>>> I'd say the documentation for each would need to be accompanied by its TEI
>>> schema, i.e. the TEI ODD file and the derived (probably) RelaxNG schema.
>>> Then it would be a simple matter to check if a document conforms to the mime
>>> type.
>>>
>>> Best,
>>> Tomaž
>>>
>>> Bryan Jurish je 21/06/2016 ob 14:22 napisal:
>>>
>>> morning all,
>>>
>>> sounds good to me.
>>>
>>> @Marie: can you give an estimation of how well this might work for
>>> WebLicht?
>>>
>>> I'll add the "format-variant=tei-dta" parameter to the DTA TEI<->TCF web
>>> service in the next few days, so we can see how that at least works out.
>>>
>>> marmosets,
>>>    Bryan
>>>
>>> On Tue, Jun 21, 2016 at 12:32 PM, Thomas Schmidt
>>> <thomas.schmidt at ids-mannheim.de> wrote:
>>>> Dear all,
>>>>
>>>> revising my suggestions from the teiweblicht list according to Bryan's
>>>> proposal to use official mime-types plus parameters (instead of
>>>> x-extended custom mime types) would mean that:
>>>>
>>>> "text/x-tei-isospoken+xml" could become "text/tei+xml;
>>>> format-variant=tei-iso-spoken" (+ tokenized=0/1)
>>>> "text/x-tei-dta+xml" could become "text/tei+xml;
>>>> format-variant=tei-dta" (+ tokenized=0/1)
>>>> "text/x-exmaralda-exb+xml" could become "text/xml;
>>>> format-variant=exmaralda-exb"
>>>> ... and so forth (for other TEI oder XML based formats)
>>>>
>>>> Wouldn't that be a solomonic solution? What do the WebLicht developers
>>>> say? And independently of that, I think that Hanna is right that these
>>>> format-related specifications (in this case: the name and possible
>>>> values of attributes which are used in addition to a mime type) would
>>>> need to be documented and made known at a central place. I guess it
>>>> would be up to the standards committee to decide on that?
>>>>
>>>> Best regards,
>>>>
>>>> Thomas
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, Jun 18, 2016 at 10:56 AM, Bryan Jurish <jurish at bbaw.de> wrote:
>>>>> moin all,
>>>>>
>>>>> fwiw, I agree with Dieter that we need to differentiate between
>>>>> "proper"
>>>>> MIME types (i.e. standardized conventions registered with IANA) and
>>>>> CLARIN-internal (rsp. WebLicht-internal) conventions.  We have been
>>>>> using
>>>>> MIME types as the basis of the WebLicht textSource/@type attribute,
>>>>> analogous to the HTTP "ContentType" header, cf.
>>>>> https://tools.ietf.org/html/rfc2045#section-5.1 .  At the risk of
>>>>> repeating
>>>>> what I've already said on the tei-weblicht list, use of the ContentType
>>>>> syntax allows us to have our cake and eat it too: we can go ahead and
>>>>> use
>>>>> "official" IANA-sanctioned "true" MIME types and specify variants
>>>>> ("dialects", "flavors") using parameters.  The DTA TEI<->TCF converter
>>>>> is
>>>>> already doing this, setting textSource/@type to either "text/tei+xml;
>>>>> tokenized=0" or "text/tei+xml; tokenized=1", depending on the relevant
>>>>> properties of the input document.
>>>>>
>>>>> just my €0.02.
>>>>>
>>>>> marmosets,
>>>>>    Bryan
>>>>>
>>>>>
>>>>> On Fri, Jun 17, 2016 at 1:43 PM, Dieter Van Uytvanck <dieter at clarin.eu>
>>>>> wrote:
>>>>>> On 17/06/16 12:59, Sander Maijers wrote:
>>>>>>> After all, you would want a
>>>>>>> resource's metadata to be completely descriptive of such elementary
>>>>>>> aspects as internal structure and content of the TEI files, and not
>>>>>>> dependent on system configuration (served as custom media type x or
>>>>>>> y,
>>>>>>> as long as the server remains so configured).
>>>>>> Hi Sander,
>>>>>>
>>>>>> Thank you for sharing your opinion.
>>>>>>
>>>>>> One side note: we are talking about detecting the mimetype as
>>>>>> indicated
>>>>>> in the CMDI ResourceProxy attribute, see:
>>>>>>
>>>>>>
>>>>>>
>>>>>> https://www.clarin.eu/faq/how-can-i-specify-additional-details-about-resourceproxy
>>>>>>
>>>>>> So for the scenario VLO -> LR switchboard -> processing application
>>>>>>
>>>>>> the system configuration would not be relevant, since the mimetype is
>>>>>> explicitly mentioned in the metadata. The key is to find agreement
>>>>>> about
>>>>>> a simple and light-weight way of designating the variants of TEI.
>>>>>>
>>>>>> best,
>>>>>>
>>>>>> --
>>>>>> Dieter Van Uytvanck
>>>>>> Technical Director CLARIN ERIC
>>>>>> www.clarin.eu | tel. +31-(0)850091363 | skype: dietervu.mpi
>>>>>> _______________________________________________
>>>>>> Teiweblicht mailing list
>>>>>> Teiweblicht at lists.informatik.uni-leipzig.de
>>>>>> http://lists.informatik.uni-leipzig.de/mailman/listinfo/teiweblicht
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> ***************************************************
>>>>> Bryan Jurish
>>>>> Deutsches Textarchiv
>>>>> Digitales Wörterbuch der deutschen Sprache
>>>>> Berlin-Brandenburgische Akademie der Wissenschaften
>>>>>
>>>>> Jägerstr. 22/23
>>>>> 10117 Berlin
>>>>>
>>>>> Tel.:     +49 (0)30 20370 539
>>>>> E-Mail:   jurish at bbaw.de
>>>>> ***************************************************
>>>>>
>>>>> _______________________________________________
>>>>> Teiweblicht mailing list
>>>>> Teiweblicht at lists.informatik.uni-leipzig.de
>>>>> http://lists.informatik.uni-leipzig.de/mailman/listinfo/teiweblicht
>>>>>
>>>>
>>>>
>>>> --
>>>> Thomas Schmidt
>>>> IDS Mannheim
>>>> R5, 6-13
>>>> D-68161 Mannheim
>>>> Tel.: +49 (621) 1581-313
>>>> http://agd.ids-mannheim.de/index.shtml
>>>> http://www.exmaralda.org
>>>>
>>>
>>>
>>> --
>>> ***************************************************
>>> Bryan Jurish
>>> Deutsches Textarchiv
>>> Digitales Wörterbuch der deutschen Sprache
>>> Berlin-Brandenburgische Akademie der Wissenschaften
>>>
>>> Jägerstr. 22/23
>>> 10117 Berlin
>>>
>>> Tel.:     +49 (0)30 20370 539
>>> E-Mail:   jurish at bbaw.de
>>> ***************************************************
>>>
>>>
>>>
>>
>>
>> --
>> Thomas Schmidt
>> IDS Mannheim
>> R5, 6-13
>> D-68161 Mannheim
>> Tel.: +49 (621) 1581-313
>> http://agd.ids-mannheim.de/index.shtml
>> http://www.exmaralda.org
>
>

-- 
Piotr Bański, Ph.D.
Senior Researcher,
Institut für Deutsche Sprache,
R5 6-13
68-161 Mannheim, Germany