[Dev] [Teiweblicht] proposal: using a common mime type for TEI files

Thomas Schmidt thomas.schmidt at ids-mannheim.de
Fri Jul 8 14:27:28 CEST 2016


Dear all,

sorry again, this time not about an error, but about a somewhat fundamental
oversight:
TCF, of course, then also needs a mime type which follows the same logic. I
suggest:

text/xml;format-variant=weblicht-tcf

I'll use that in the revision of our web services until/unless I hear
something to the contrary.

Best,

Thomas




On Fri, Jul 8, 2016 at 10:25 AM, Thomas Schmidt <
thomas.schmidt at ids-mannheim.de> wrote:

> Sorry: as long as this discussion is the reference document, I should
> point out that I made a mistake:
>
> > A parameter "token=0/1" can be added to indicate whether (=1) or
> > not (=0) the respective TEI file is tokenized (i.e. has <w> markup)
>
> The name of the parameter as described by Bryan is "tokenized", not
> "token".
>
> - Thomas
>
>
>
> On Fri, Jul 8, 2016 at 9:04 AM, Thomas Schmidt
> <thomas.schmidt at ids-mannheim.de> wrote:
> > Dear all,
> >
> > in the absence of further input from the standards committee and before
> we
> > lose the momentum, I'd like to summarise our action plan according to the
> > discussion so far:
> >
> > (1a) In WebLicht (in CLARIN in general?) ISO/TEI transcriptions of spoken
> > language will be identified by the MIME type
> > text/tei+xml;format-variant=tei-iso-spoken. A parameter "token=0/1" can
> be
> > added to indicate whether (=1) or not (=0) the respective TEI file is
> > tokenized (i.e. has <w> markup).
> > (1b) HZSK and myself will adapt the respective web services accordingly
> >
> > (2a) In WebLicht (in CLARIN in general?) DTA/TEI files will be
> identified by
> > the MIME type text/tei+xml;format-variant=tei-dta. A parameter
> "token=0/1"
> > can be added to indicate whether (=1) or not (=0) the respective TEI
> file is
> > tokenized (i.e. has <w> markup).
> > (2b) Bryan Jurish will adapt the respective web services at BBAW
> accordingly
> >
> > (3a) In WebLicht (in CLARIN in general?), EXMARaLDA Basic Transcriptions
> > will be identified by the MIME type text/xml;
> format-variant=exmaralda-exb
> > (3b) In WebLicht (in CLARIN in general?), FOLKER/OrthoNormal
> transcription
> > files will be identified by the MIME type text/xml;
> > format-variant=folker-fln
> > (3c) In WebLicht (in CLARIN in general?), Transcriber transcription files
> > will be identified by the MIME type text/xml;
> format-variant=transcriber-trs
> > (3d) HZSK and myself will adapt the respective web services accordingly
> >
> > (4a) It would have to be checked (note the passive, I don't know who
> could
> > be in charge of this) whether competing MIME types for these file types
> are
> > already registered somewhere. I know that WebLicht already seems to have
> two
> > variants of EXMARaLDA transcriptions. The mechanims specifying those
> would
> > probably have to be deprecated. Transcriber is also not unlikely to have
> > been given some kind of mimetype elsewhere in CLARIN.
> > (4b) Further relevant formats will be ELAN/EAF, CLAN/CHA and
> PRAAT/TextGrid
> > (the latter two being text, not XML formats). All three of them are also
> > likely to have been registered somewhere already, so "someone" (again, I
> > wouldn't know who) should check if mime types have been defined for
> those.
> >
> > I guess that this is as good an answer as we can currently give to
> address
> > points 1-3 in Marie Hinrich's list. @Marie: can you confirm that this is
> > suffient for you, also to address point 4 in your list? In my
> understanding,
> > whatever works for WebLicht in this respect should also be a suitable
> basis
> > for a larger context (the SwitchBoard in particular?).
> >
> > In my eyes, it remains crucial, however, that such standardization
> > "decisions" are centrally documented (including the information Tomaž
> > suggested). The CLARIN standards pages as they are now
> > (https://www.clarin.eu/content/standard-recommendations /
> > http://clarin.ids-mannheim.de/standards/index.xq are the ones I know)
> are,
> > IMHO, incomplete, inconistent and outdated, and they certainly do not
> > provide accurate information on the mime types. Any input from the
> standard
> > committee on this question would therefore still be much appreciated.
> >
> > Best,
> >
> > Thomas
> >
> >
> >
> >
> > On Thu, Jun 23, 2016 at 10:56 AM, Marie Hinrichs
> > <marie.hinrichs at uni-tuebingen.de> wrote:
> >>
> >> Hi All,
> >>
> >> Thanks to all of you for all the work you’ve done so far to get TEI
> >> processing integrated into WebLicht.
> >>
> >> From WebLicht’s side, there are several places where some
> >> work/coordination needs to happen:
> >>
> >> 1. TCF: agree on the textsource.type attribute and make sure that the
> >> encoder services set it properly
> >> 2. Agree on type names (i.e. text/tei+xml or text/x-tei-dta-xml)
> >> 3. Make sure the CMDI for encoder and decoder services reflect outcomes
> of
> >> 1 and 2
> >> 4. Add new mappings to WebLicht for TEI.
> >>
> >> Steps 1-3 are being worked out here on the mailing list and whichever
> >> solution/conventions you agree on are fine with us.
> >>
> >> Step 4 requires some changes to the WebLicht code - in particular to the
> >> component that we call the “profiler”. When a user uploads a file, the
> >> profiler tries to figure out what it is and if any of the WebLicht
> services
> >> can process it. The contentType of the uploaded file, in combination
> with
> >> standard libraries for file type recognition are used for this. But
> >> sometimes more digging is necessary, as in the case with tcf - which is
> >> recognized as xml, but it needs a closer look to see if it is tcf.  The
> >> profiler will have to be updated in a similar way to recognize TEI, and
> >> hopefully there is even some straightforward way of distinguishing
> between
> >> the DTA and the spoken variants. Finally, mappings need to be
> established
> >> between the results of the profiler and the service input types so that
> the
> >> right services are offered to the user for selection.
> >>
> >> Also note that WebLicht chains can be called from the command-line or
> >> programmatically using WebLicht as a Service (WaaS) - see instructions
> here:
> >> https://weblicht.sfs.uni-tuebingen.de/WaaS/ This is useful for larger
> inputs
> >> and avoids timeout issues that arise when using the web interface.
> >>
> >> Best Regards,
> >> Marie
> >>
> >>
> >> On 21.06.2016, at 14:28, Tomaž Erjavec <Tomaz.Erjavec at ijs.si> wrote:
> >>
> >> Hi,
> >>
> >> as regards
> >>
> >> > these format-related specifications (in this case: the name and
> possible
> >> > values of attributes which are used in addition to a mime type) would
> >> > need to be documented and made known at a central place.
> >>
> >> I'd say the documentation for each would need to be accompanied by its
> TEI
> >> schema, i.e. the TEI ODD file and the derived (probably) RelaxNG schema.
> >> Then it would be a simple matter to check if a document conforms to the
> mime
> >> type.
> >>
> >> Best,
> >> Tomaž
> >>
> >> Bryan Jurish je 21/06/2016 ob 14:22 napisal:
> >>
> >> morning all,
> >>
> >> sounds good to me.
> >>
> >> @Marie: can you give an estimation of how well this might work for
> >> WebLicht?
> >>
> >> I'll add the "format-variant=tei-dta" parameter to the DTA TEI<->TCF web
> >> service in the next few days, so we can see how that at least works out.
> >>
> >> marmosets,
> >>   Bryan
> >>
> >> On Tue, Jun 21, 2016 at 12:32 PM, Thomas Schmidt
> >> <thomas.schmidt at ids-mannheim.de> wrote:
> >>>
> >>> Dear all,
> >>>
> >>> revising my suggestions from the teiweblicht list according to Bryan's
> >>> proposal to use official mime-types plus parameters (instead of
> >>> x-extended custom mime types) would mean that:
> >>>
> >>> "text/x-tei-isospoken+xml" could become "text/tei+xml;
> >>> format-variant=tei-iso-spoken" (+ tokenized=0/1)
> >>> "text/x-tei-dta+xml" could become "text/tei+xml;
> >>> format-variant=tei-dta" (+ tokenized=0/1)
> >>> "text/x-exmaralda-exb+xml" could become "text/xml;
> >>> format-variant=exmaralda-exb"
> >>> ... and so forth (for other TEI oder XML based formats)
> >>>
> >>> Wouldn't that be a solomonic solution? What do the WebLicht developers
> >>> say? And independently of that, I think that Hanna is right that these
> >>> format-related specifications (in this case: the name and possible
> >>> values of attributes which are used in addition to a mime type) would
> >>> need to be documented and made known at a central place. I guess it
> >>> would be up to the standards committee to decide on that?
> >>>
> >>> Best regards,
> >>>
> >>> Thomas
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Sat, Jun 18, 2016 at 10:56 AM, Bryan Jurish <jurish at bbaw.de> wrote:
> >>> > moin all,
> >>> >
> >>> > fwiw, I agree with Dieter that we need to differentiate between
> >>> > "proper"
> >>> > MIME types (i.e. standardized conventions registered with IANA) and
> >>> > CLARIN-internal (rsp. WebLicht-internal) conventions.  We have been
> >>> > using
> >>> > MIME types as the basis of the WebLicht textSource/@type attribute,
> >>> > analogous to the HTTP "ContentType" header, cf.
> >>> > https://tools.ietf.org/html/rfc2045#section-5.1 .  At the risk of
> >>> > repeating
> >>> > what I've already said on the tei-weblicht list, use of the
> ContentType
> >>> > syntax allows us to have our cake and eat it too: we can go ahead and
> >>> > use
> >>> > "official" IANA-sanctioned "true" MIME types and specify variants
> >>> > ("dialects", "flavors") using parameters.  The DTA TEI<->TCF
> converter
> >>> > is
> >>> > already doing this, setting textSource/@type to either "text/tei+xml;
> >>> > tokenized=0" or "text/tei+xml; tokenized=1", depending on the
> relevant
> >>> > properties of the input document.
> >>> >
> >>> > just my €0.02.
> >>> >
> >>> > marmosets,
> >>> >   Bryan
> >>> >
> >>> >
> >>> > On Fri, Jun 17, 2016 at 1:43 PM, Dieter Van Uytvanck <
> dieter at clarin.eu>
> >>> > wrote:
> >>> >>
> >>> >> On 17/06/16 12:59, Sander Maijers wrote:
> >>> >> > After all, you would want a
> >>> >> > resource's metadata to be completely descriptive of such
> elementary
> >>> >> > aspects as internal structure and content of the TEI files, and
> not
> >>> >> > dependent on system configuration (served as custom media type x
> or
> >>> >> > y,
> >>> >> > as long as the server remains so configured).
> >>> >>
> >>> >> Hi Sander,
> >>> >>
> >>> >> Thank you for sharing your opinion.
> >>> >>
> >>> >> One side note: we are talking about detecting the mimetype as
> >>> >> indicated
> >>> >> in the CMDI ResourceProxy attribute, see:
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> https://www.clarin.eu/faq/how-can-i-specify-additional-details-about-resourceproxy
> >>> >>
> >>> >> So for the scenario VLO -> LR switchboard -> processing application
> >>> >>
> >>> >> the system configuration would not be relevant, since the mimetype
> is
> >>> >> explicitly mentioned in the metadata. The key is to find agreement
> >>> >> about
> >>> >> a simple and light-weight way of designating the variants of TEI.
> >>> >>
> >>> >> best,
> >>> >>
> >>> >> --
> >>> >> Dieter Van Uytvanck
> >>> >> Technical Director CLARIN ERIC
> >>> >> www.clarin.eu | tel. +31-(0)850091363 | skype: dietervu.mpi
> >>> >> _______________________________________________
> >>> >> Teiweblicht mailing list
> >>> >> Teiweblicht at lists.informatik.uni-leipzig.de
> >>> >> http://lists.informatik.uni-leipzig.de/mailman/listinfo/teiweblicht
> >>> >>
> >>> >
> >>> >
> >>> >
> >>> > --
> >>> > ***************************************************
> >>> > Bryan Jurish
> >>> > Deutsches Textarchiv
> >>> > Digitales Wörterbuch der deutschen Sprache
> >>> > Berlin-Brandenburgische Akademie der Wissenschaften
> >>> >
> >>> > Jägerstr. 22/23
> >>> > 10117 Berlin
> >>> >
> >>> > Tel.:     +49 (0)30 20370 539
> >>> > E-Mail:   jurish at bbaw.de
> >>> > ***************************************************
> >>> >
> >>> > _______________________________________________
> >>> > Teiweblicht mailing list
> >>> > Teiweblicht at lists.informatik.uni-leipzig.de
> >>> > http://lists.informatik.uni-leipzig.de/mailman/listinfo/teiweblicht
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Thomas Schmidt
> >>> IDS Mannheim
> >>> R5, 6-13
> >>> D-68161 Mannheim
> >>> Tel.: +49 (621) 1581-313
> >>> http://agd.ids-mannheim.de/index.shtml
> >>> http://www.exmaralda.org
> >>>
> >>
> >>
> >>
> >> --
> >> ***************************************************
> >> Bryan Jurish
> >> Deutsches Textarchiv
> >> Digitales Wörterbuch der deutschen Sprache
> >> Berlin-Brandenburgische Akademie der Wissenschaften
> >>
> >> Jägerstr. 22/23
> >> 10117 Berlin
> >>
> >> Tel.:     +49 (0)30 20370 539
> >> E-Mail:   jurish at bbaw.de
> >> ***************************************************
> >>
> >>
> >>
> >
> >
> >
> > --
> > Thomas Schmidt
> > IDS Mannheim
> > R5, 6-13
> > D-68161 Mannheim
> > Tel.: +49 (621) 1581-313
> > http://agd.ids-mannheim.de/index.shtml
> > http://www.exmaralda.org
>
>
>
> --
> Thomas Schmidt
> IDS Mannheim
> R5, 6-13
> D-68161 Mannheim
> Tel.: +49 (621) 1581-313
> http://agd.ids-mannheim.de/index.shtml
> http://www.exmaralda.org
>



-- 
Thomas Schmidt
IDS Mannheim
R5, 6-13
D-68161 Mannheim
Tel.: +49 (621) 1581-313
http://agd.ids-mannheim.de/index.shtml
http://www.exmaralda.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clarin.eu/pipermail/dev/attachments/20160708/596fc0e2/attachment-0001.html>


More information about the Dev mailing list