[Dev] [Teiweblicht] [Standards] proposal: using a common mime type for TEI files

Hanna Hedeland hanna.hedeland at uni-hamburg.de
Mon Jul 11 10:32:21 CEST 2016


Hi all,

good to see there finally seems to be a decision for WebLicht. I was also
wondering if this decision already has been taken for the Language Resource
Switchboard [1], since when I upload a TEI file I get a drop-down list with
available (partly non-standard) mimetypes to choose from. Shouldn't the
same approach on describing file types/formats be possible for WebLicht and
the LR Switchboard (and for the rest of the CLARIN infrastructure)? And
shouldn't it be possible to collect the available tool/service metadata to
at least list the mimetypes currently in use (or has this already been done
for the list in the LR Switchboard)?

@Marie and Claus, could you please find out/decide whether a uniform
approach regarding mimetypes/format variants is possible for both services?

@Piotr, great, maybe you could then also have a look at the available
metadata of CLARIN tools/services and the (newer) entrances in the not yet
really existing Format Registry [2] for the inventory of mimetypes?

Thanks and best regards,
Hanna

[1] http://weblicht.sfs.uni-tuebingen.de/clrs/
[2] https://trac.clarin.eu/wiki/FormatRegistry

-- 
Hanna Hedeland
Hamburger Zentrum für Sprachkorpora
Max-Brauer-Allee 60
D - 22765 Hamburg

Tel. + 49 40 42838 6893

2016-07-08 17:42 GMT+02:00 Piotr Bański <banski at ids-mannheim.de>:

> Dear All,
>
> I have summarized Thomas's proposal at
> https://trac.clarin.eu/wiki/MIME%20format%20variants
>
> I'll also try to be the "someone" whom Thomas has hesitated to name. It
> will provide me with a good opportunity to look into the various corners of
> CLARIN's infrastructure that I could otherwise overlook.
>
> Best regards,
>
>   Piotr
>
>
> On 08/07/16 10:25, Thomas Schmidt wrote:
>
>> Sorry: as long as this discussion is the reference document, I should
>> point out that I made a mistake:
>>
>> A parameter "token=0/1" can be added to indicate whether (=1) or
>>> not (=0) the respective TEI file is tokenized (i.e. has <w> markup)
>>>
>> The name of the parameter as described by Bryan is "tokenized", not
>> "token".
>>
>> - Thomas
>>
>>
>>
>> On Fri, Jul 8, 2016 at 9:04 AM, Thomas Schmidt
>> <thomas.schmidt at ids-mannheim.de> wrote:
>>
>>> Dear all,
>>>
>>> in the absence of further input from the standards committee and before
>>> we
>>> lose the momentum, I'd like to summarise our action plan according to the
>>> discussion so far:
>>>
>>> (1a) In WebLicht (in CLARIN in general?) ISO/TEI transcriptions of spoken
>>> language will be identified by the MIME type
>>> text/tei+xml;format-variant=tei-iso-spoken. A parameter "token=0/1" can
>>> be
>>> added to indicate whether (=1) or not (=0) the respective TEI file is
>>> tokenized (i.e. has <w> markup).
>>> (1b) HZSK and myself will adapt the respective web services accordingly
>>>
>>> (2a) In WebLicht (in CLARIN in general?) DTA/TEI files will be
>>> identified by
>>> the MIME type text/tei+xml;format-variant=tei-dta. A parameter
>>> "token=0/1"
>>> can be added to indicate whether (=1) or not (=0) the respective TEI
>>> file is
>>> tokenized (i.e. has <w> markup).
>>> (2b) Bryan Jurish will adapt the respective web services at BBAW
>>> accordingly
>>>
>>> (3a) In WebLicht (in CLARIN in general?), EXMARaLDA Basic Transcriptions
>>> will be identified by the MIME type text/xml;
>>> format-variant=exmaralda-exb
>>> (3b) In WebLicht (in CLARIN in general?), FOLKER/OrthoNormal
>>> transcription
>>> files will be identified by the MIME type text/xml;
>>> format-variant=folker-fln
>>> (3c) In WebLicht (in CLARIN in general?), Transcriber transcription files
>>> will be identified by the MIME type text/xml;
>>> format-variant=transcriber-trs
>>> (3d) HZSK and myself will adapt the respective web services accordingly
>>>
>>> (4a) It would have to be checked (note the passive, I don't know who
>>> could
>>> be in charge of this) whether competing MIME types for these file types
>>> are
>>> already registered somewhere. I know that WebLicht already seems to have
>>> two
>>> variants of EXMARaLDA transcriptions. The mechanims specifying those
>>> would
>>> probably have to be deprecated. Transcriber is also not unlikely to have
>>> been given some kind of mimetype elsewhere in CLARIN.
>>> (4b) Further relevant formats will be ELAN/EAF, CLAN/CHA and
>>> PRAAT/TextGrid
>>> (the latter two being text, not XML formats). All three of them are also
>>> likely to have been registered somewhere already, so "someone" (again, I
>>> wouldn't know who) should check if mime types have been defined for
>>> those.
>>>
>>> I guess that this is as good an answer as we can currently give to
>>> address
>>> points 1-3 in Marie Hinrich's list. @Marie: can you confirm that this is
>>> suffient for you, also to address point 4 in your list? In my
>>> understanding,
>>> whatever works for WebLicht in this respect should also be a suitable
>>> basis
>>> for a larger context (the SwitchBoard in particular?).
>>>
>>> In my eyes, it remains crucial, however, that such standardization
>>> "decisions" are centrally documented (including the information Tomaž
>>> suggested). The CLARIN standards pages as they are now
>>> (https://www.clarin.eu/content/standard-recommendations /
>>> http://clarin.ids-mannheim.de/standards/index.xq are the ones I know)
>>> are,
>>> IMHO, incomplete, inconistent and outdated, and they certainly do not
>>> provide accurate information on the mime types. Any input from the
>>> standard
>>> committee on this question would therefore still be much appreciated.
>>>
>>> Best,
>>>
>>> Thomas
>>>
>>>
>>>
>>>
>>> On Thu, Jun 23, 2016 at 10:56 AM, Marie Hinrichs
>>> <marie.hinrichs at uni-tuebingen.de> wrote:
>>>
>>>> Hi All,
>>>>
>>>> Thanks to all of you for all the work you’ve done so far to get TEI
>>>> processing integrated into WebLicht.
>>>>
>>>>  From WebLicht’s side, there are several places where some
>>>> work/coordination needs to happen:
>>>>
>>>> 1. TCF: agree on the textsource.type attribute and make sure that the
>>>> encoder services set it properly
>>>> 2. Agree on type names (i.e. text/tei+xml or text/x-tei-dta-xml)
>>>> 3. Make sure the CMDI for encoder and decoder services reflect outcomes
>>>> of
>>>> 1 and 2
>>>> 4. Add new mappings to WebLicht for TEI.
>>>>
>>>> Steps 1-3 are being worked out here on the mailing list and whichever
>>>> solution/conventions you agree on are fine with us.
>>>>
>>>> Step 4 requires some changes to the WebLicht code - in particular to the
>>>> component that we call the “profiler”. When a user uploads a file, the
>>>> profiler tries to figure out what it is and if any of the WebLicht
>>>> services
>>>> can process it. The contentType of the uploaded file, in combination
>>>> with
>>>> standard libraries for file type recognition are used for this. But
>>>> sometimes more digging is necessary, as in the case with tcf - which is
>>>> recognized as xml, but it needs a closer look to see if it is tcf.  The
>>>> profiler will have to be updated in a similar way to recognize TEI, and
>>>> hopefully there is even some straightforward way of distinguishing
>>>> between
>>>> the DTA and the spoken variants. Finally, mappings need to be
>>>> established
>>>> between the results of the profiler and the service input types so that
>>>> the
>>>> right services are offered to the user for selection.
>>>>
>>>> Also note that WebLicht chains can be called from the command-line or
>>>> programmatically using WebLicht as a Service (WaaS) - see instructions
>>>> here:
>>>> https://weblicht.sfs.uni-tuebingen.de/WaaS/ This is useful for larger
>>>> inputs
>>>> and avoids timeout issues that arise when using the web interface.
>>>>
>>>> Best Regards,
>>>> Marie
>>>>
>>>>
>>>> On 21.06.2016, at 14:28, Tomaž Erjavec <Tomaz.Erjavec at ijs.si> wrote:
>>>>
>>>> Hi,
>>>>
>>>> as regards
>>>>
>>>> these format-related specifications (in this case: the name and possible
>>>>> values of attributes which are used in addition to a mime type) would
>>>>> need to be documented and made known at a central place.
>>>>>
>>>> I'd say the documentation for each would need to be accompanied by its
>>>> TEI
>>>> schema, i.e. the TEI ODD file and the derived (probably) RelaxNG schema.
>>>> Then it would be a simple matter to check if a document conforms to the
>>>> mime
>>>> type.
>>>>
>>>> Best,
>>>> Tomaž
>>>>
>>>> Bryan Jurish je 21/06/2016 ob 14:22 napisal:
>>>>
>>>> morning all,
>>>>
>>>> sounds good to me.
>>>>
>>>> @Marie: can you give an estimation of how well this might work for
>>>> WebLicht?
>>>>
>>>> I'll add the "format-variant=tei-dta" parameter to the DTA TEI<->TCF web
>>>> service in the next few days, so we can see how that at least works out.
>>>>
>>>> marmosets,
>>>>    Bryan
>>>>
>>>> On Tue, Jun 21, 2016 at 12:32 PM, Thomas Schmidt
>>>> <thomas.schmidt at ids-mannheim.de> wrote:
>>>>
>>>>> Dear all,
>>>>>
>>>>> revising my suggestions from the teiweblicht list according to Bryan's
>>>>> proposal to use official mime-types plus parameters (instead of
>>>>> x-extended custom mime types) would mean that:
>>>>>
>>>>> "text/x-tei-isospoken+xml" could become "text/tei+xml;
>>>>> format-variant=tei-iso-spoken" (+ tokenized=0/1)
>>>>> "text/x-tei-dta+xml" could become "text/tei+xml;
>>>>> format-variant=tei-dta" (+ tokenized=0/1)
>>>>> "text/x-exmaralda-exb+xml" could become "text/xml;
>>>>> format-variant=exmaralda-exb"
>>>>> ... and so forth (for other TEI oder XML based formats)
>>>>>
>>>>> Wouldn't that be a solomonic solution? What do the WebLicht developers
>>>>> say? And independently of that, I think that Hanna is right that these
>>>>> format-related specifications (in this case: the name and possible
>>>>> values of attributes which are used in addition to a mime type) would
>>>>> need to be documented and made known at a central place. I guess it
>>>>> would be up to the standards committee to decide on that?
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Thomas
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Jun 18, 2016 at 10:56 AM, Bryan Jurish <jurish at bbaw.de> wrote:
>>>>>
>>>>>> moin all,
>>>>>>
>>>>>> fwiw, I agree with Dieter that we need to differentiate between
>>>>>> "proper"
>>>>>> MIME types (i.e. standardized conventions registered with IANA) and
>>>>>> CLARIN-internal (rsp. WebLicht-internal) conventions.  We have been
>>>>>> using
>>>>>> MIME types as the basis of the WebLicht textSource/@type attribute,
>>>>>> analogous to the HTTP "ContentType" header, cf.
>>>>>> https://tools.ietf.org/html/rfc2045#section-5.1 .  At the risk of
>>>>>> repeating
>>>>>> what I've already said on the tei-weblicht list, use of the
>>>>>> ContentType
>>>>>> syntax allows us to have our cake and eat it too: we can go ahead and
>>>>>> use
>>>>>> "official" IANA-sanctioned "true" MIME types and specify variants
>>>>>> ("dialects", "flavors") using parameters.  The DTA TEI<->TCF converter
>>>>>> is
>>>>>> already doing this, setting textSource/@type to either "text/tei+xml;
>>>>>> tokenized=0" or "text/tei+xml; tokenized=1", depending on the relevant
>>>>>> properties of the input document.
>>>>>>
>>>>>> just my €0.02.
>>>>>>
>>>>>> marmosets,
>>>>>>    Bryan
>>>>>>
>>>>>>
>>>>>> On Fri, Jun 17, 2016 at 1:43 PM, Dieter Van Uytvanck <
>>>>>> dieter at clarin.eu>
>>>>>> wrote:
>>>>>>
>>>>>>> On 17/06/16 12:59, Sander Maijers wrote:
>>>>>>>
>>>>>>>> After all, you would want a
>>>>>>>> resource's metadata to be completely descriptive of such elementary
>>>>>>>> aspects as internal structure and content of the TEI files, and not
>>>>>>>> dependent on system configuration (served as custom media type x or
>>>>>>>> y,
>>>>>>>> as long as the server remains so configured).
>>>>>>>>
>>>>>>> Hi Sander,
>>>>>>>
>>>>>>> Thank you for sharing your opinion.
>>>>>>>
>>>>>>> One side note: we are talking about detecting the mimetype as
>>>>>>> indicated
>>>>>>> in the CMDI ResourceProxy attribute, see:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> https://www.clarin.eu/faq/how-can-i-specify-additional-details-about-resourceproxy
>>>>>>>
>>>>>>> So for the scenario VLO -> LR switchboard -> processing application
>>>>>>>
>>>>>>> the system configuration would not be relevant, since the mimetype is
>>>>>>> explicitly mentioned in the metadata. The key is to find agreement
>>>>>>> about
>>>>>>> a simple and light-weight way of designating the variants of TEI.
>>>>>>>
>>>>>>> best,
>>>>>>>
>>>>>>> --
>>>>>>> Dieter Van Uytvanck
>>>>>>> Technical Director CLARIN ERIC
>>>>>>> www.clarin.eu | tel. +31-(0)850091363 | skype: dietervu.mpi
>>>>>>> _______________________________________________
>>>>>>> Teiweblicht mailing list
>>>>>>> Teiweblicht at lists.informatik.uni-leipzig.de
>>>>>>> http://lists.informatik.uni-leipzig.de/mailman/listinfo/teiweblicht
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> ***************************************************
>>>>>> Bryan Jurish
>>>>>> Deutsches Textarchiv
>>>>>> Digitales Wörterbuch der deutschen Sprache
>>>>>> Berlin-Brandenburgische Akademie der Wissenschaften
>>>>>>
>>>>>> Jägerstr. 22/23
>>>>>> 10117 Berlin
>>>>>>
>>>>>> Tel.:     +49 (0)30 20370 539
>>>>>> E-Mail:   jurish at bbaw.de
>>>>>> ***************************************************
>>>>>>
>>>>>> _______________________________________________
>>>>>> Teiweblicht mailing list
>>>>>> Teiweblicht at lists.informatik.uni-leipzig.de
>>>>>> http://lists.informatik.uni-leipzig.de/mailman/listinfo/teiweblicht
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Thomas Schmidt
>>>>> IDS Mannheim
>>>>> R5, 6-13
>>>>> D-68161 Mannheim
>>>>> Tel.: +49 (621) 1581-313
>>>>> http://agd.ids-mannheim.de/index.shtml
>>>>> http://www.exmaralda.org
>>>>>
>>>>>
>>>>
>>>> --
>>>> ***************************************************
>>>> Bryan Jurish
>>>> Deutsches Textarchiv
>>>> Digitales Wörterbuch der deutschen Sprache
>>>> Berlin-Brandenburgische Akademie der Wissenschaften
>>>>
>>>> Jägerstr. 22/23
>>>> 10117 Berlin
>>>>
>>>> Tel.:     +49 (0)30 20370 539
>>>> E-Mail:   jurish at bbaw.de
>>>> ***************************************************
>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> Thomas Schmidt
>>> IDS Mannheim
>>> R5, 6-13
>>> D-68161 Mannheim
>>> Tel.: +49 (621) 1581-313
>>> http://agd.ids-mannheim.de/index.shtml
>>> http://www.exmaralda.org
>>>
>>
>>
>>
>
> --
> Piotr Bański, Ph.D.
> Senior Researcher,
> Institut für Deutsche Sprache,
> R5 6-13
> 68-161 Mannheim, Germany
>
> _______________________________________________
> Teiweblicht mailing list
> Teiweblicht at lists.informatik.uni-leipzig.de
> http://lists.informatik.uni-leipzig.de/mailman/listinfo/teiweblicht
>



-- 
Hanna Hedeland
Hamburger Zentrum für Sprachkorpora
Max-Brauer-Allee 60
D - 22765 Hamburg

Tel. + 49 40 42838 6893
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clarin.eu/pipermail/dev/attachments/20160711/6fc86da1/attachment-0001.html>


More information about the Dev mailing list