[Dev] [Teiweblicht] [Standards] proposal: using a common mime type for TEI files

Bryan Jurish moocow.bovine at gmail.com
Mon Jul 11 10:34:00 CEST 2016


morning all,

fyi, I've now implemented the "format-variant=tei-dta" parameter as per
Thomas's suggestion for our tei/tcf codec at http://kaskade.dwds.de/tei-tcf/

marmosets,
  Bryan

On Mon, Jul 11, 2016 at 10:32 AM, Hanna Hedeland <
hanna.hedeland at uni-hamburg.de> wrote:

> Hi all,
>
> good to see there finally seems to be a decision for WebLicht. I was also
> wondering if this decision already has been taken for the Language Resource
> Switchboard [1], since when I upload a TEI file I get a drop-down list with
> available (partly non-standard) mimetypes to choose from. Shouldn't the
> same approach on describing file types/formats be possible for WebLicht and
> the LR Switchboard (and for the rest of the CLARIN infrastructure)? And
> shouldn't it be possible to collect the available tool/service metadata to
> at least list the mimetypes currently in use (or has this already been done
> for the list in the LR Switchboard)?
>
> @Marie and Claus, could you please find out/decide whether a uniform
> approach regarding mimetypes/format variants is possible for both services?
>
> @Piotr, great, maybe you could then also have a look at the available
> metadata of CLARIN tools/services and the (newer) entrances in the not yet
> really existing Format Registry [2] for the inventory of mimetypes?
>
> Thanks and best regards,
> Hanna
>
> [1] http://weblicht.sfs.uni-tuebingen.de/clrs/
> [2] https://trac.clarin.eu/wiki/FormatRegistry
>
> --
> Hanna Hedeland
> Hamburger Zentrum für Sprachkorpora
> Max-Brauer-Allee 60
> D - 22765 Hamburg
>
> Tel. + 49 40 42838 6893
>
> 2016-07-08 17:42 GMT+02:00 Piotr Bański <banski at ids-mannheim.de>:
>
>> Dear All,
>>
>> I have summarized Thomas's proposal at
>> https://trac.clarin.eu/wiki/MIME%20format%20variants
>>
>> I'll also try to be the "someone" whom Thomas has hesitated to name. It
>> will provide me with a good opportunity to look into the various corners of
>> CLARIN's infrastructure that I could otherwise overlook.
>>
>> Best regards,
>>
>>   Piotr
>>
>>
>> On 08/07/16 10:25, Thomas Schmidt wrote:
>>
>>> Sorry: as long as this discussion is the reference document, I should
>>> point out that I made a mistake:
>>>
>>> A parameter "token=0/1" can be added to indicate whether (=1) or
>>>> not (=0) the respective TEI file is tokenized (i.e. has <w> markup)
>>>>
>>> The name of the parameter as described by Bryan is "tokenized", not
>>> "token".
>>>
>>> - Thomas
>>>
>>>
>>>
>>> On Fri, Jul 8, 2016 at 9:04 AM, Thomas Schmidt
>>> <thomas.schmidt at ids-mannheim.de> wrote:
>>>
>>>> Dear all,
>>>>
>>>> in the absence of further input from the standards committee and before
>>>> we
>>>> lose the momentum, I'd like to summarise our action plan according to
>>>> the
>>>> discussion so far:
>>>>
>>>> (1a) In WebLicht (in CLARIN in general?) ISO/TEI transcriptions of
>>>> spoken
>>>> language will be identified by the MIME type
>>>> text/tei+xml;format-variant=tei-iso-spoken. A parameter "token=0/1" can
>>>> be
>>>> added to indicate whether (=1) or not (=0) the respective TEI file is
>>>> tokenized (i.e. has <w> markup).
>>>> (1b) HZSK and myself will adapt the respective web services accordingly
>>>>
>>>> (2a) In WebLicht (in CLARIN in general?) DTA/TEI files will be
>>>> identified by
>>>> the MIME type text/tei+xml;format-variant=tei-dta. A parameter
>>>> "token=0/1"
>>>> can be added to indicate whether (=1) or not (=0) the respective TEI
>>>> file is
>>>> tokenized (i.e. has <w> markup).
>>>> (2b) Bryan Jurish will adapt the respective web services at BBAW
>>>> accordingly
>>>>
>>>> (3a) In WebLicht (in CLARIN in general?), EXMARaLDA Basic Transcriptions
>>>> will be identified by the MIME type text/xml;
>>>> format-variant=exmaralda-exb
>>>> (3b) In WebLicht (in CLARIN in general?), FOLKER/OrthoNormal
>>>> transcription
>>>> files will be identified by the MIME type text/xml;
>>>> format-variant=folker-fln
>>>> (3c) In WebLicht (in CLARIN in general?), Transcriber transcription
>>>> files
>>>> will be identified by the MIME type text/xml;
>>>> format-variant=transcriber-trs
>>>> (3d) HZSK and myself will adapt the respective web services accordingly
>>>>
>>>> (4a) It would have to be checked (note the passive, I don't know who
>>>> could
>>>> be in charge of this) whether competing MIME types for these file types
>>>> are
>>>> already registered somewhere. I know that WebLicht already seems to
>>>> have two
>>>> variants of EXMARaLDA transcriptions. The mechanims specifying those
>>>> would
>>>> probably have to be deprecated. Transcriber is also not unlikely to have
>>>> been given some kind of mimetype elsewhere in CLARIN.
>>>> (4b) Further relevant formats will be ELAN/EAF, CLAN/CHA and
>>>> PRAAT/TextGrid
>>>> (the latter two being text, not XML formats). All three of them are also
>>>> likely to have been registered somewhere already, so "someone" (again, I
>>>> wouldn't know who) should check if mime types have been defined for
>>>> those.
>>>>
>>>> I guess that this is as good an answer as we can currently give to
>>>> address
>>>> points 1-3 in Marie Hinrich's list. @Marie: can you confirm that this is
>>>> suffient for you, also to address point 4 in your list? In my
>>>> understanding,
>>>> whatever works for WebLicht in this respect should also be a suitable
>>>> basis
>>>> for a larger context (the SwitchBoard in particular?).
>>>>
>>>> In my eyes, it remains crucial, however, that such standardization
>>>> "decisions" are centrally documented (including the information Tomaž
>>>> suggested). The CLARIN standards pages as they are now
>>>> (https://www.clarin.eu/content/standard-recommendations /
>>>> http://clarin.ids-mannheim.de/standards/index.xq are the ones I know)
>>>> are,
>>>> IMHO, incomplete, inconistent and outdated, and they certainly do not
>>>> provide accurate information on the mime types. Any input from the
>>>> standard
>>>> committee on this question would therefore still be much appreciated.
>>>>
>>>> Best,
>>>>
>>>> Thomas
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jun 23, 2016 at 10:56 AM, Marie Hinrichs
>>>> <marie.hinrichs at uni-tuebingen.de> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> Thanks to all of you for all the work you’ve done so far to get TEI
>>>>> processing integrated into WebLicht.
>>>>>
>>>>>  From WebLicht’s side, there are several places where some
>>>>> work/coordination needs to happen:
>>>>>
>>>>> 1. TCF: agree on the textsource.type attribute and make sure that the
>>>>> encoder services set it properly
>>>>> 2. Agree on type names (i.e. text/tei+xml or text/x-tei-dta-xml)
>>>>> 3. Make sure the CMDI for encoder and decoder services reflect
>>>>> outcomes of
>>>>> 1 and 2
>>>>> 4. Add new mappings to WebLicht for TEI.
>>>>>
>>>>> Steps 1-3 are being worked out here on the mailing list and whichever
>>>>> solution/conventions you agree on are fine with us.
>>>>>
>>>>> Step 4 requires some changes to the WebLicht code - in particular to
>>>>> the
>>>>> component that we call the “profiler”. When a user uploads a file, the
>>>>> profiler tries to figure out what it is and if any of the WebLicht
>>>>> services
>>>>> can process it. The contentType of the uploaded file, in combination
>>>>> with
>>>>> standard libraries for file type recognition are used for this. But
>>>>> sometimes more digging is necessary, as in the case with tcf - which is
>>>>> recognized as xml, but it needs a closer look to see if it is tcf.  The
>>>>> profiler will have to be updated in a similar way to recognize TEI, and
>>>>> hopefully there is even some straightforward way of distinguishing
>>>>> between
>>>>> the DTA and the spoken variants. Finally, mappings need to be
>>>>> established
>>>>> between the results of the profiler and the service input types so
>>>>> that the
>>>>> right services are offered to the user for selection.
>>>>>
>>>>> Also note that WebLicht chains can be called from the command-line or
>>>>> programmatically using WebLicht as a Service (WaaS) - see instructions
>>>>> here:
>>>>> https://weblicht.sfs.uni-tuebingen.de/WaaS/ This is useful for larger
>>>>> inputs
>>>>> and avoids timeout issues that arise when using the web interface.
>>>>>
>>>>> Best Regards,
>>>>> Marie
>>>>>
>>>>>
>>>>> On 21.06.2016, at 14:28, Tomaž Erjavec <Tomaz.Erjavec at ijs.si> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> as regards
>>>>>
>>>>> these format-related specifications (in this case: the name and
>>>>>> possible
>>>>>> values of attributes which are used in addition to a mime type) would
>>>>>> need to be documented and made known at a central place.
>>>>>>
>>>>> I'd say the documentation for each would need to be accompanied by its
>>>>> TEI
>>>>> schema, i.e. the TEI ODD file and the derived (probably) RelaxNG
>>>>> schema.
>>>>> Then it would be a simple matter to check if a document conforms to
>>>>> the mime
>>>>> type.
>>>>>
>>>>> Best,
>>>>> Tomaž
>>>>>
>>>>> Bryan Jurish je 21/06/2016 ob 14:22 napisal:
>>>>>
>>>>> morning all,
>>>>>
>>>>> sounds good to me.
>>>>>
>>>>> @Marie: can you give an estimation of how well this might work for
>>>>> WebLicht?
>>>>>
>>>>> I'll add the "format-variant=tei-dta" parameter to the DTA TEI<->TCF
>>>>> web
>>>>> service in the next few days, so we can see how that at least works
>>>>> out.
>>>>>
>>>>> marmosets,
>>>>>    Bryan
>>>>>
>>>>> On Tue, Jun 21, 2016 at 12:32 PM, Thomas Schmidt
>>>>> <thomas.schmidt at ids-mannheim.de> wrote:
>>>>>
>>>>>> Dear all,
>>>>>>
>>>>>> revising my suggestions from the teiweblicht list according to Bryan's
>>>>>> proposal to use official mime-types plus parameters (instead of
>>>>>> x-extended custom mime types) would mean that:
>>>>>>
>>>>>> "text/x-tei-isospoken+xml" could become "text/tei+xml;
>>>>>> format-variant=tei-iso-spoken" (+ tokenized=0/1)
>>>>>> "text/x-tei-dta+xml" could become "text/tei+xml;
>>>>>> format-variant=tei-dta" (+ tokenized=0/1)
>>>>>> "text/x-exmaralda-exb+xml" could become "text/xml;
>>>>>> format-variant=exmaralda-exb"
>>>>>> ... and so forth (for other TEI oder XML based formats)
>>>>>>
>>>>>> Wouldn't that be a solomonic solution? What do the WebLicht developers
>>>>>> say? And independently of that, I think that Hanna is right that these
>>>>>> format-related specifications (in this case: the name and possible
>>>>>> values of attributes which are used in addition to a mime type) would
>>>>>> need to be documented and made known at a central place. I guess it
>>>>>> would be up to the standards committee to decide on that?
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Thomas
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Jun 18, 2016 at 10:56 AM, Bryan Jurish <jurish at bbaw.de>
>>>>>> wrote:
>>>>>>
>>>>>>> moin all,
>>>>>>>
>>>>>>> fwiw, I agree with Dieter that we need to differentiate between
>>>>>>> "proper"
>>>>>>> MIME types (i.e. standardized conventions registered with IANA) and
>>>>>>> CLARIN-internal (rsp. WebLicht-internal) conventions.  We have been
>>>>>>> using
>>>>>>> MIME types as the basis of the WebLicht textSource/@type attribute,
>>>>>>> analogous to the HTTP "ContentType" header, cf.
>>>>>>> https://tools.ietf.org/html/rfc2045#section-5.1 .  At the risk of
>>>>>>> repeating
>>>>>>> what I've already said on the tei-weblicht list, use of the
>>>>>>> ContentType
>>>>>>> syntax allows us to have our cake and eat it too: we can go ahead and
>>>>>>> use
>>>>>>> "official" IANA-sanctioned "true" MIME types and specify variants
>>>>>>> ("dialects", "flavors") using parameters.  The DTA TEI<->TCF
>>>>>>> converter
>>>>>>> is
>>>>>>> already doing this, setting textSource/@type to either "text/tei+xml;
>>>>>>> tokenized=0" or "text/tei+xml; tokenized=1", depending on the
>>>>>>> relevant
>>>>>>> properties of the input document.
>>>>>>>
>>>>>>> just my €0.02.
>>>>>>>
>>>>>>> marmosets,
>>>>>>>    Bryan
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jun 17, 2016 at 1:43 PM, Dieter Van Uytvanck <
>>>>>>> dieter at clarin.eu>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> On 17/06/16 12:59, Sander Maijers wrote:
>>>>>>>>
>>>>>>>>> After all, you would want a
>>>>>>>>> resource's metadata to be completely descriptive of such elementary
>>>>>>>>> aspects as internal structure and content of the TEI files, and not
>>>>>>>>> dependent on system configuration (served as custom media type x or
>>>>>>>>> y,
>>>>>>>>> as long as the server remains so configured).
>>>>>>>>>
>>>>>>>> Hi Sander,
>>>>>>>>
>>>>>>>> Thank you for sharing your opinion.
>>>>>>>>
>>>>>>>> One side note: we are talking about detecting the mimetype as
>>>>>>>> indicated
>>>>>>>> in the CMDI ResourceProxy attribute, see:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> https://www.clarin.eu/faq/how-can-i-specify-additional-details-about-resourceproxy
>>>>>>>>
>>>>>>>> So for the scenario VLO -> LR switchboard -> processing application
>>>>>>>>
>>>>>>>> the system configuration would not be relevant, since the mimetype
>>>>>>>> is
>>>>>>>> explicitly mentioned in the metadata. The key is to find agreement
>>>>>>>> about
>>>>>>>> a simple and light-weight way of designating the variants of TEI.
>>>>>>>>
>>>>>>>> best,
>>>>>>>>
>>>>>>>> --
>>>>>>>> Dieter Van Uytvanck
>>>>>>>> Technical Director CLARIN ERIC
>>>>>>>> www.clarin.eu | tel. +31-(0)850091363 | skype: dietervu.mpi
>>>>>>>> _______________________________________________
>>>>>>>> Teiweblicht mailing list
>>>>>>>> Teiweblicht at lists.informatik.uni-leipzig.de
>>>>>>>> http://lists.informatik.uni-leipzig.de/mailman/listinfo/teiweblicht
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> ***************************************************
>>>>>>> Bryan Jurish
>>>>>>> Deutsches Textarchiv
>>>>>>> Digitales Wörterbuch der deutschen Sprache
>>>>>>> Berlin-Brandenburgische Akademie der Wissenschaften
>>>>>>>
>>>>>>> Jägerstr. 22/23
>>>>>>> 10117 Berlin
>>>>>>>
>>>>>>> Tel.:     +49 (0)30 20370 539
>>>>>>> E-Mail:   jurish at bbaw.de
>>>>>>> ***************************************************
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Teiweblicht mailing list
>>>>>>> Teiweblicht at lists.informatik.uni-leipzig.de
>>>>>>> http://lists.informatik.uni-leipzig.de/mailman/listinfo/teiweblicht
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thomas Schmidt
>>>>>> IDS Mannheim
>>>>>> R5, 6-13
>>>>>> D-68161 Mannheim
>>>>>> Tel.: +49 (621) 1581-313
>>>>>> http://agd.ids-mannheim.de/index.shtml
>>>>>> http://www.exmaralda.org
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> ***************************************************
>>>>> Bryan Jurish
>>>>> Deutsches Textarchiv
>>>>> Digitales Wörterbuch der deutschen Sprache
>>>>> Berlin-Brandenburgische Akademie der Wissenschaften
>>>>>
>>>>> Jägerstr. 22/23
>>>>> 10117 Berlin
>>>>>
>>>>> Tel.:     +49 (0)30 20370 539
>>>>> E-Mail:   jurish at bbaw.de
>>>>> ***************************************************
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Thomas Schmidt
>>>> IDS Mannheim
>>>> R5, 6-13
>>>> D-68161 Mannheim
>>>> Tel.: +49 (621) 1581-313
>>>> http://agd.ids-mannheim.de/index.shtml
>>>> http://www.exmaralda.org
>>>>
>>>
>>>
>>>
>>
>> --
>> Piotr Bański, Ph.D.
>> Senior Researcher,
>> Institut für Deutsche Sprache,
>> R5 6-13
>> 68-161 Mannheim, Germany
>>
>> _______________________________________________
>> Teiweblicht mailing list
>> Teiweblicht at lists.informatik.uni-leipzig.de
>> http://lists.informatik.uni-leipzig.de/mailman/listinfo/teiweblicht
>>
>
>
>
> --
> Hanna Hedeland
> Hamburger Zentrum für Sprachkorpora
> Max-Brauer-Allee 60
> D - 22765 Hamburg
>
> Tel. + 49 40 42838 6893
>
> _______________________________________________
> Teiweblicht mailing list
> Teiweblicht at lists.informatik.uni-leipzig.de
> http://lists.informatik.uni-leipzig.de/mailman/listinfo/teiweblicht
>
>


-- 
Bryan Jurish                           "There is *always* one more bug."
moocow.bovine at gmail.com         -Lubarsky's Law of Cybernetic Entomology
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clarin.eu/pipermail/dev/attachments/20160711/3dcbd1ec/attachment-0001.html>


More information about the Dev mailing list