[Standards] [Dev] [Teiweblicht] proposal: using a common mime type for TEI files

Thomas Schmidt thomas.schmidt at ids-mannheim.de
Fri Jul 8 09:04:55 CEST 2016


Dear all,

in the absence of further input from the standards committee and before we
lose the momentum, I'd like to summarise our action plan according to the
discussion so far:

(1a) In WebLicht (in CLARIN in general?) ISO/TEI transcriptions of spoken
language will be identified by the MIME type
text/tei+xml;format-variant=tei-iso-spoken.
A parameter "token=0/1" can be added to indicate whether (=1) or not (=0)
the respective TEI file is tokenized (i.e. has <w> markup).
(1b) HZSK and myself will adapt the respective web services accordingly

(2a) In WebLicht (in CLARIN in general?) DTA/TEI files will be identified
by the MIME type text/tei+xml;format-variant=tei-dta. A parameter
"token=0/1" can be added to indicate whether (=1) or not (=0) the
respective TEI file is tokenized (i.e. has <w> markup).
(2b) Bryan Jurish will adapt the respective web services at BBAW accordingly

(3a) In WebLicht (in CLARIN in general?), EXMARaLDA Basic Transcriptions
will be identified by the MIME type text/xml; format-variant=exmaralda-exb
(3b) In WebLicht (in CLARIN in general?), FOLKER/OrthoNormal transcription
files will be identified by the MIME type text/xml;
format-variant=folker-fln
(3c) In WebLicht (in CLARIN in general?), Transcriber transcription files
will be identified by the MIME type text/xml; format-variant=transcriber-trs
(3d) HZSK and myself will adapt the respective web services accordingly

(4a) It would have to be checked (note the passive, I don't know who could
be in charge of this) whether competing MIME types for these file types are
already registered somewhere. I know that WebLicht already seems to have
two variants of EXMARaLDA transcriptions. The mechanims specifying those
would probably have to be deprecated. Transcriber is also not unlikely to
have been given some kind of mimetype elsewhere in CLARIN.
(4b) Further relevant formats will be ELAN/EAF, CLAN/CHA and PRAAT/TextGrid
(the latter two being text, not XML formats). All three of them are also
likely to have been registered somewhere already, so "someone" (again, I
wouldn't know who) should check if mime types have been defined for those.

I guess that this is as good an answer as we can currently give to address
points 1-3 in Marie Hinrich's list. @Marie: can you confirm that this is
suffient for you, also to address point 4 in your list? In my
understanding, whatever works for WebLicht in this respect should also be a
suitable basis for a larger context (the SwitchBoard in particular?).

In my eyes, it remains crucial, however, that such standardization
"decisions" are centrally documented (including the information Tomaž
suggested). The CLARIN standards pages as they are now (
https://www.clarin.eu/content/standard-recommendations /
http://clarin.ids-mannheim.de/standards/index.xq are the ones I know) are,
IMHO, incomplete, inconistent and outdated, and they certainly do not
provide accurate information on the mime types. Any input from the standard
committee on this question would therefore still be much appreciated.

Best,

Thomas




On Thu, Jun 23, 2016 at 10:56 AM, Marie Hinrichs <
marie.hinrichs at uni-tuebingen.de> wrote:

> Hi All,
>
> Thanks to all of you for all the work you’ve done so far to get TEI
> processing integrated into WebLicht.
>
> From WebLicht’s side, there are several places where some
> work/coordination needs to happen:
>
> 1. TCF: agree on the textsource.type attribute and make sure that the
> encoder services set it properly
> 2. Agree on type names (i.e. text/tei+xml or text/x-tei-dta-xml)
> 3. Make sure the CMDI for encoder and decoder services reflect outcomes of
> 1 and 2
> 4. Add new mappings to WebLicht for TEI.
>
> Steps 1-3 are being worked out here on the mailing list and whichever
> solution/conventions you agree on are fine with us.
>
> Step 4 requires some changes to the WebLicht code - in particular to the
> component that we call the “profiler”. When a user uploads a file, the
> profiler tries to figure out what it is and if any of the WebLicht services
> can process it. The contentType of the uploaded file, in combination with
> standard libraries for file type recognition are used for this. But
> sometimes more digging is necessary, as in the case with tcf - which is
> recognized as xml, but it needs a closer look to see if it is tcf.  The
> profiler will have to be updated in a similar way to recognize TEI, and
> hopefully there is even some straightforward way of distinguishing between
> the DTA and the spoken variants. Finally, mappings need to be established
> between the results of the profiler and the service input types so that the
> right services are offered to the user for selection.
>
> Also note that WebLicht chains can be called from the command-line or
> programmatically using WebLicht as a Service (WaaS) - see instructions
> here: https://weblicht.sfs.uni-tuebingen.de/WaaS/ This is useful for
> larger inputs and avoids timeout issues that arise when using the web
> interface.
>
> Best Regards,
> Marie
>
>
> On 21.06.2016, at 14:28, Tomaž Erjavec <Tomaz.Erjavec at ijs.si> wrote:
>
> Hi,
>
> as regards
>
> > these format-related specifications (in this case: the name and possible
> > values of attributes which are used in addition to a mime type) would
> > need to be documented and made known at a central place.
> I'd say the documentation for each would need to be accompanied by its TEI
> schema, i.e. the TEI ODD file and the derived (probably) RelaxNG schema.
> Then it would be a simple matter to check if a document conforms to the
> mime type.
>
> Best,
> Tomaž
>
> Bryan Jurish je 21/06/2016 ob 14:22 napisal:
>
> morning all,
>
> sounds good to me.
>
> @Marie: can you give an estimation of how well this might work for
> WebLicht?
>
> I'll add the "format-variant=tei-dta" parameter to the DTA TEI<->TCF web
> service in the next few days, so we can see how that at least works out.
>
> marmosets,
>   Bryan
>
> On Tue, Jun 21, 2016 at 12:32 PM, Thomas Schmidt <
> thomas.schmidt at ids-mannheim.de> wrote:
>
>> Dear all,
>>
>> revising my suggestions from the teiweblicht list according to Bryan's
>> proposal to use official mime-types plus parameters (instead of
>> x-extended custom mime types) would mean that:
>>
>> "text/x-tei-isospoken+xml" could become "text/tei+xml;
>> format-variant=tei-iso-spoken" (+ tokenized=0/1)
>> "text/x-tei-dta+xml" could become "text/tei+xml;
>> format-variant=tei-dta" (+ tokenized=0/1)
>> "text/x-exmaralda-exb+xml" could become "text/xml;
>> format-variant=exmaralda-exb"
>> ... and so forth (for other TEI oder XML based formats)
>>
>> Wouldn't that be a solomonic solution? What do the WebLicht developers
>> say? And independently of that, I think that Hanna is right that these
>> format-related specifications (in this case: the name and possible
>> values of attributes which are used in addition to a mime type) would
>> need to be documented and made known at a central place. I guess it
>> would be up to the standards committee to decide on that?
>>
>> Best regards,
>>
>> Thomas
>>
>>
>>
>>
>>
>> On Sat, Jun 18, 2016 at 10:56 AM, Bryan Jurish < <jurish at bbaw.de>
>> jurish at bbaw.de> wrote:
>> > moin all,
>> >
>> > fwiw, I agree with Dieter that we need to differentiate between "proper"
>> > MIME types (i.e. standardized conventions registered with IANA) and
>> > CLARIN-internal (rsp. WebLicht-internal) conventions.  We have been
>> using
>> > MIME types as the basis of the WebLicht textSource/@type attribute,
>> > analogous to the HTTP "ContentType" header, cf.
>> > https://tools.ietf.org/html/rfc2045#section-5.1 .  At the risk of
>> repeating
>> > what I've already said on the tei-weblicht list, use of the ContentType
>> > syntax allows us to have our cake and eat it too: we can go ahead and
>> use
>> > "official" IANA-sanctioned "true" MIME types and specify variants
>> > ("dialects", "flavors") using parameters.  The DTA TEI<->TCF converter
>> is
>> > already doing this, setting textSource/@type to either "text/tei+xml;
>> > tokenized=0" or "text/tei+xml; tokenized=1", depending on the relevant
>> > properties of the input document.
>> >
>> > just my €0.02.
>> >
>> > marmosets,
>> >   Bryan
>> >
>> >
>> > On Fri, Jun 17, 2016 at 1:43 PM, Dieter Van Uytvanck <dieter at clarin.eu>
>> > wrote:
>> >>
>> >> On 17/06/16 12:59, Sander Maijers wrote:
>> >> > After all, you would want a
>> >> > resource's metadata to be completely descriptive of such elementary
>> >> > aspects as internal structure and content of the TEI files, and not
>> >> > dependent on system configuration (served as custom media type x or
>> y,
>> >> > as long as the server remains so configured).
>> >>
>> >> Hi Sander,
>> >>
>> >> Thank you for sharing your opinion.
>> >>
>> >> One side note: we are talking about detecting the mimetype as indicated
>> >> in the CMDI ResourceProxy attribute, see:
>> >>
>> >>
>> >>
>> https://www.clarin.eu/faq/how-can-i-specify-additional-details-about-resourceproxy
>> >>
>> >> So for the scenario VLO -> LR switchboard -> processing application
>> >>
>> >> the system configuration would not be relevant, since the mimetype is
>> >> explicitly mentioned in the metadata. The key is to find agreement
>> about
>> >> a simple and light-weight way of designating the variants of TEI.
>> >>
>> >> best,
>> >>
>> >> --
>> >> Dieter Van Uytvanck
>> >> Technical Director CLARIN ERIC
>> >> www.clarin.eu | tel. +31-(0)850091363 | skype: dietervu.mpi
>> >> _______________________________________________
>> >> Teiweblicht mailing list
>> >> Teiweblicht at lists.informatik.uni-leipzig.de
>> >> http://lists.informatik.uni-leipzig.de/mailman/listinfo/teiweblicht
>> >>
>> >
>> >
>> >
>> > --
>> > ***************************************************
>> > Bryan Jurish
>> > Deutsches Textarchiv
>> > Digitales Wörterbuch der deutschen Sprache
>> > Berlin-Brandenburgische Akademie der Wissenschaften
>> >
>> > Jägerstr. 22/23
>> > 10117 Berlin
>> >
>> > Tel.:     +49 (0)30 20370 539
>> > E-Mail:   jurish at bbaw.de
>> > ***************************************************
>> >
>> > _______________________________________________
>> > Teiweblicht mailing list
>> > Teiweblicht at lists.informatik.uni-leipzig.de
>> > http://lists.informatik.uni-leipzig.de/mailman/listinfo/teiweblicht
>> >
>>
>>
>>
>> --
>> Thomas Schmidt
>> IDS Mannheim
>> R5, 6-13
>> D-68161 Mannheim
>> Tel.: +49 (621) 1581-313
>> http://agd.ids-mannheim.de/index.shtml
>> http://www.exmaralda.org
>>
>>
>
>
> --
> ***************************************************
> Bryan Jurish
> Deutsches Textarchiv
> Digitales Wörterbuch der deutschen Sprache
> Berlin-Brandenburgische Akademie der Wissenschaften
>
> Jägerstr. 22/23
> 10117 Berlin
>
> Tel.:     +49 (0)30 20370 539
> E-Mail:   jurish at bbaw.de
> ***************************************************
>
>
>
>


-- 
Thomas Schmidt
IDS Mannheim
R5, 6-13
D-68161 Mannheim
Tel.: +49 (621) 1581-313
http://agd.ids-mannheim.de/index.shtml
http://www.exmaralda.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clarin.eu/pipermail/standards/attachments/20160708/dd3ad5f8/attachment-0001.html>


More information about the Standards mailing list