[Standards] [Dev] [Teiweblicht] proposal: using a common mime type for TEI files

Thomas Schmidt thomas.schmidt at ids-mannheim.de
Fri Jul 8 10:25:59 CEST 2016


Sorry: as long as this discussion is the reference document, I should
point out that I made a mistake:

> A parameter "token=0/1" can be added to indicate whether (=1) or
> not (=0) the respective TEI file is tokenized (i.e. has <w> markup)

The name of the parameter as described by Bryan is "tokenized", not "token".

- Thomas



On Fri, Jul 8, 2016 at 9:04 AM, Thomas Schmidt
<thomas.schmidt at ids-mannheim.de> wrote:
> Dear all,
>
> in the absence of further input from the standards committee and before we
> lose the momentum, I'd like to summarise our action plan according to the
> discussion so far:
>
> (1a) In WebLicht (in CLARIN in general?) ISO/TEI transcriptions of spoken
> language will be identified by the MIME type
> text/tei+xml;format-variant=tei-iso-spoken. A parameter "token=0/1" can be
> added to indicate whether (=1) or not (=0) the respective TEI file is
> tokenized (i.e. has <w> markup).
> (1b) HZSK and myself will adapt the respective web services accordingly
>
> (2a) In WebLicht (in CLARIN in general?) DTA/TEI files will be identified by
> the MIME type text/tei+xml;format-variant=tei-dta. A parameter "token=0/1"
> can be added to indicate whether (=1) or not (=0) the respective TEI file is
> tokenized (i.e. has <w> markup).
> (2b) Bryan Jurish will adapt the respective web services at BBAW accordingly
>
> (3a) In WebLicht (in CLARIN in general?), EXMARaLDA Basic Transcriptions
> will be identified by the MIME type text/xml; format-variant=exmaralda-exb
> (3b) In WebLicht (in CLARIN in general?), FOLKER/OrthoNormal transcription
> files will be identified by the MIME type text/xml;
> format-variant=folker-fln
> (3c) In WebLicht (in CLARIN in general?), Transcriber transcription files
> will be identified by the MIME type text/xml; format-variant=transcriber-trs
> (3d) HZSK and myself will adapt the respective web services accordingly
>
> (4a) It would have to be checked (note the passive, I don't know who could
> be in charge of this) whether competing MIME types for these file types are
> already registered somewhere. I know that WebLicht already seems to have two
> variants of EXMARaLDA transcriptions. The mechanims specifying those would
> probably have to be deprecated. Transcriber is also not unlikely to have
> been given some kind of mimetype elsewhere in CLARIN.
> (4b) Further relevant formats will be ELAN/EAF, CLAN/CHA and PRAAT/TextGrid
> (the latter two being text, not XML formats). All three of them are also
> likely to have been registered somewhere already, so "someone" (again, I
> wouldn't know who) should check if mime types have been defined for those.
>
> I guess that this is as good an answer as we can currently give to address
> points 1-3 in Marie Hinrich's list. @Marie: can you confirm that this is
> suffient for you, also to address point 4 in your list? In my understanding,
> whatever works for WebLicht in this respect should also be a suitable basis
> for a larger context (the SwitchBoard in particular?).
>
> In my eyes, it remains crucial, however, that such standardization
> "decisions" are centrally documented (including the information Tomaž
> suggested). The CLARIN standards pages as they are now
> (https://www.clarin.eu/content/standard-recommendations /
> http://clarin.ids-mannheim.de/standards/index.xq are the ones I know) are,
> IMHO, incomplete, inconistent and outdated, and they certainly do not
> provide accurate information on the mime types. Any input from the standard
> committee on this question would therefore still be much appreciated.
>
> Best,
>
> Thomas
>
>
>
>
> On Thu, Jun 23, 2016 at 10:56 AM, Marie Hinrichs
> <marie.hinrichs at uni-tuebingen.de> wrote:
>>
>> Hi All,
>>
>> Thanks to all of you for all the work you’ve done so far to get TEI
>> processing integrated into WebLicht.
>>
>> From WebLicht’s side, there are several places where some
>> work/coordination needs to happen:
>>
>> 1. TCF: agree on the textsource.type attribute and make sure that the
>> encoder services set it properly
>> 2. Agree on type names (i.e. text/tei+xml or text/x-tei-dta-xml)
>> 3. Make sure the CMDI for encoder and decoder services reflect outcomes of
>> 1 and 2
>> 4. Add new mappings to WebLicht for TEI.
>>
>> Steps 1-3 are being worked out here on the mailing list and whichever
>> solution/conventions you agree on are fine with us.
>>
>> Step 4 requires some changes to the WebLicht code - in particular to the
>> component that we call the “profiler”. When a user uploads a file, the
>> profiler tries to figure out what it is and if any of the WebLicht services
>> can process it. The contentType of the uploaded file, in combination with
>> standard libraries for file type recognition are used for this. But
>> sometimes more digging is necessary, as in the case with tcf - which is
>> recognized as xml, but it needs a closer look to see if it is tcf.  The
>> profiler will have to be updated in a similar way to recognize TEI, and
>> hopefully there is even some straightforward way of distinguishing between
>> the DTA and the spoken variants. Finally, mappings need to be established
>> between the results of the profiler and the service input types so that the
>> right services are offered to the user for selection.
>>
>> Also note that WebLicht chains can be called from the command-line or
>> programmatically using WebLicht as a Service (WaaS) - see instructions here:
>> https://weblicht.sfs.uni-tuebingen.de/WaaS/ This is useful for larger inputs
>> and avoids timeout issues that arise when using the web interface.
>>
>> Best Regards,
>> Marie
>>
>>
>> On 21.06.2016, at 14:28, Tomaž Erjavec <Tomaz.Erjavec at ijs.si> wrote:
>>
>> Hi,
>>
>> as regards
>>
>> > these format-related specifications (in this case: the name and possible
>> > values of attributes which are used in addition to a mime type) would
>> > need to be documented and made known at a central place.
>>
>> I'd say the documentation for each would need to be accompanied by its TEI
>> schema, i.e. the TEI ODD file and the derived (probably) RelaxNG schema.
>> Then it would be a simple matter to check if a document conforms to the mime
>> type.
>>
>> Best,
>> Tomaž
>>
>> Bryan Jurish je 21/06/2016 ob 14:22 napisal:
>>
>> morning all,
>>
>> sounds good to me.
>>
>> @Marie: can you give an estimation of how well this might work for
>> WebLicht?
>>
>> I'll add the "format-variant=tei-dta" parameter to the DTA TEI<->TCF web
>> service in the next few days, so we can see how that at least works out.
>>
>> marmosets,
>>   Bryan
>>
>> On Tue, Jun 21, 2016 at 12:32 PM, Thomas Schmidt
>> <thomas.schmidt at ids-mannheim.de> wrote:
>>>
>>> Dear all,
>>>
>>> revising my suggestions from the teiweblicht list according to Bryan's
>>> proposal to use official mime-types plus parameters (instead of
>>> x-extended custom mime types) would mean that:
>>>
>>> "text/x-tei-isospoken+xml" could become "text/tei+xml;
>>> format-variant=tei-iso-spoken" (+ tokenized=0/1)
>>> "text/x-tei-dta+xml" could become "text/tei+xml;
>>> format-variant=tei-dta" (+ tokenized=0/1)
>>> "text/x-exmaralda-exb+xml" could become "text/xml;
>>> format-variant=exmaralda-exb"
>>> ... and so forth (for other TEI oder XML based formats)
>>>
>>> Wouldn't that be a solomonic solution? What do the WebLicht developers
>>> say? And independently of that, I think that Hanna is right that these
>>> format-related specifications (in this case: the name and possible
>>> values of attributes which are used in addition to a mime type) would
>>> need to be documented and made known at a central place. I guess it
>>> would be up to the standards committee to decide on that?
>>>
>>> Best regards,
>>>
>>> Thomas
>>>
>>>
>>>
>>>
>>>
>>> On Sat, Jun 18, 2016 at 10:56 AM, Bryan Jurish <jurish at bbaw.de> wrote:
>>> > moin all,
>>> >
>>> > fwiw, I agree with Dieter that we need to differentiate between
>>> > "proper"
>>> > MIME types (i.e. standardized conventions registered with IANA) and
>>> > CLARIN-internal (rsp. WebLicht-internal) conventions.  We have been
>>> > using
>>> > MIME types as the basis of the WebLicht textSource/@type attribute,
>>> > analogous to the HTTP "ContentType" header, cf.
>>> > https://tools.ietf.org/html/rfc2045#section-5.1 .  At the risk of
>>> > repeating
>>> > what I've already said on the tei-weblicht list, use of the ContentType
>>> > syntax allows us to have our cake and eat it too: we can go ahead and
>>> > use
>>> > "official" IANA-sanctioned "true" MIME types and specify variants
>>> > ("dialects", "flavors") using parameters.  The DTA TEI<->TCF converter
>>> > is
>>> > already doing this, setting textSource/@type to either "text/tei+xml;
>>> > tokenized=0" or "text/tei+xml; tokenized=1", depending on the relevant
>>> > properties of the input document.
>>> >
>>> > just my €0.02.
>>> >
>>> > marmosets,
>>> >   Bryan
>>> >
>>> >
>>> > On Fri, Jun 17, 2016 at 1:43 PM, Dieter Van Uytvanck <dieter at clarin.eu>
>>> > wrote:
>>> >>
>>> >> On 17/06/16 12:59, Sander Maijers wrote:
>>> >> > After all, you would want a
>>> >> > resource's metadata to be completely descriptive of such elementary
>>> >> > aspects as internal structure and content of the TEI files, and not
>>> >> > dependent on system configuration (served as custom media type x or
>>> >> > y,
>>> >> > as long as the server remains so configured).
>>> >>
>>> >> Hi Sander,
>>> >>
>>> >> Thank you for sharing your opinion.
>>> >>
>>> >> One side note: we are talking about detecting the mimetype as
>>> >> indicated
>>> >> in the CMDI ResourceProxy attribute, see:
>>> >>
>>> >>
>>> >>
>>> >> https://www.clarin.eu/faq/how-can-i-specify-additional-details-about-resourceproxy
>>> >>
>>> >> So for the scenario VLO -> LR switchboard -> processing application
>>> >>
>>> >> the system configuration would not be relevant, since the mimetype is
>>> >> explicitly mentioned in the metadata. The key is to find agreement
>>> >> about
>>> >> a simple and light-weight way of designating the variants of TEI.
>>> >>
>>> >> best,
>>> >>
>>> >> --
>>> >> Dieter Van Uytvanck
>>> >> Technical Director CLARIN ERIC
>>> >> www.clarin.eu | tel. +31-(0)850091363 | skype: dietervu.mpi
>>> >> _______________________________________________
>>> >> Teiweblicht mailing list
>>> >> Teiweblicht at lists.informatik.uni-leipzig.de
>>> >> http://lists.informatik.uni-leipzig.de/mailman/listinfo/teiweblicht
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > ***************************************************
>>> > Bryan Jurish
>>> > Deutsches Textarchiv
>>> > Digitales Wörterbuch der deutschen Sprache
>>> > Berlin-Brandenburgische Akademie der Wissenschaften
>>> >
>>> > Jägerstr. 22/23
>>> > 10117 Berlin
>>> >
>>> > Tel.:     +49 (0)30 20370 539
>>> > E-Mail:   jurish at bbaw.de
>>> > ***************************************************
>>> >
>>> > _______________________________________________
>>> > Teiweblicht mailing list
>>> > Teiweblicht at lists.informatik.uni-leipzig.de
>>> > http://lists.informatik.uni-leipzig.de/mailman/listinfo/teiweblicht
>>> >
>>>
>>>
>>>
>>> --
>>> Thomas Schmidt
>>> IDS Mannheim
>>> R5, 6-13
>>> D-68161 Mannheim
>>> Tel.: +49 (621) 1581-313
>>> http://agd.ids-mannheim.de/index.shtml
>>> http://www.exmaralda.org
>>>
>>
>>
>>
>> --
>> ***************************************************
>> Bryan Jurish
>> Deutsches Textarchiv
>> Digitales Wörterbuch der deutschen Sprache
>> Berlin-Brandenburgische Akademie der Wissenschaften
>>
>> Jägerstr. 22/23
>> 10117 Berlin
>>
>> Tel.:     +49 (0)30 20370 539
>> E-Mail:   jurish at bbaw.de
>> ***************************************************
>>
>>
>>
>
>
>
> --
> Thomas Schmidt
> IDS Mannheim
> R5, 6-13
> D-68161 Mannheim
> Tel.: +49 (621) 1581-313
> http://agd.ids-mannheim.de/index.shtml
> http://www.exmaralda.org



-- 
Thomas Schmidt
IDS Mannheim
R5, 6-13
D-68161 Mannheim
Tel.: +49 (621) 1581-313
http://agd.ids-mannheim.de/index.shtml
http://www.exmaralda.org


More information about the Standards mailing list