[Standards] [Dev] proposal: using a common mime type for TEI files

Windhouwer, Menzo Menzo.Windhouwer at mpi.nl
Fri Jun 17 09:52:55 CEST 2016


Hi, All,

As inspiration for an alternative for a mime-type only approach. In the
web services efforts of CLARIN we designed a core model (based on the Web
Services initiatives from Spain, Germany and the Netherlands at that time)
(see [1]). This model allowed to define a so called ParameterGroup, which
could be an input/output file, e.g., a TEI or TCF document. In the
ParameterGroup one could define the properties of the input/output via
Parameters, e.g., which layers are available in the TCF or which TEI
variant is used. Here is a fictive example:

                            <ParameterGroup>
                                <Name>input</Name>
                                <MIMEType>application/tei+xml</MIMEType>
                                <Parameters>
                                    <Parameter>
                                        <Name>variant</Name>
                                        <Description>The variant of TEI
used.</Description>
                                        <SemanticType>DTA</SemanticType>
                                    </Parameter>
                                </Parameters>
                            </ParameterGroup>


I’ve used here the SemanticType, but the core model actually allows
various levels to be used for profile matching, i.e., MIME type, data
type, data category [*] and semantic type (see section 5.1 of [1] for some
more information).

Maybe this (past) efforts/approaches can still function as an inspiration …

Best,

Menzo

[1] 
http://www.lrec-conf.org/proceedings/lrec2012/workshops/11.LREC2012%20Metad
ata%20Proceedings.pdf#page=48&pagemode=none
[*] in the context of the switch from ISOcat to the CCR this should maybe
be ConceptLink now
--
The Language Archive – tla.mpi.nl




On 16/06/16 20:35, "dev-bounces at lists.clarin.eu on behalf of Thomas
Schmidt" <dev-bounces at lists.clarin.eu on behalf of
thomas.schmidt at ids-mannheim.de> wrote:

>Dear all,
>
>yes, we've been discussing this on the teiweblicht mailing list (cc'ed
>here) with the aim of making WebLicht usable for TEI data. We're stuck
>with more or less the problem that Torsten describes. We have two
>TEI-based formats which we would like to consider at the moment. One
>is DTA's format for written texts, the other is the recently finalised
>ISO standard for transcriptions of spoken language. There is no way we
>can treat those as just two different "flavours" of the same format.
>Therefore, we would need to distinguish this at whathever the place is
>where WebLicht distinguishes file formats. If it is via the mime type,
>we would need a mime type extension like "text/x-tei-isospoken+xml"
>vs. "text/x-tei-dta+xml". If it is on some other level, we would have
>to know which and agree on a suitable set of TEI variant identifiers.
>I'm copying relevant parts of the mailing list exchange below for your
>information.
>
>Best regards,
>
>Thomas
>
>----------------
>On Tue, Apr 26, 2016 at 3:55 PM, Thomas Schmidt
><thomas.schmidt at ids-mannheim.de> wrote:
>> [...] it is obvious that no sufficiently specialized processing method
>>(e.g.
>> individual WebLicht services) can handle "TEI" as a generic file type.
>> There are way too many degrees of freedom in the TEI guidelines, and
>> my understanding was that TEI is itself meant as just a framework in
>> which more specific data models/file formats can be defined (which
>> we've done...).
>>
>> The two TEI "dialects" we have so far (DTA and ISO/Spoken) should
>> therefore be handled as two separate file types, just as any other two
>> different inputs (say, RTF vs. OpenOfficeXML) are handled by WebLicht.
>> In the current scenario, both TEI variants will need a TCF
>> decoder/encoder anyway before anything meaningful can be done in
>> WebLicht, and I don't think it makes sense to attempt a single
>> decoder/encoder pair which handles both variants. So I would opt to
>> make the distinction between the TEI dialects on the same level where
>> other file types are distinguished. [...]
>----------------
>On Wed, Jun 1, 2016 at 3:40 PM, Bryan Jurish <jurish at bbaw.de> wrote:
>> do you have concrete suggestions for @type and/or its possible
>> (parameter=value) pairs?
>----------------
>On Thu, Jun 2, 2016 at 8:23 AM, Thomas Schmidt
><thomas.schmidt at ids-mannheim.de> wrote:
>> My understanding was that all services in WebLicht have to specify
>> their input and output formats via an appropriate mime type.
>> Concerning the TEI data, this would mean that two (possibly more,
>> eventually) types of TEI would have to be distinguished (instead of
>> just one, as is the case now). For example:
>>
>> text/x-tei-isospoken+xml
>> text/x-tei-dta+xml
>>
>> Since spoken language data will rarely come directly as TEI, converter
>> services for the most common tool formats would have to be prepended
>> (one from the HZSK is already available as a prototype). There already
>> seem to be two different flavours of EXMARaLDA, I couldn't find any
>> documentation on the difference between the two. Ultimately, it would
>> be good to be able to distinguish something like
>>
>> text/x-exmaralda-exb+xml (for EXMARaLDA basic transcriptions)
>> text/x-transcriber-trs+xml (for Transcriber files)
>> text/x-folker-flk+xml (for FOLKER transcriptions)
>> ... (and possibly more for ELAN and others)
>----------------
>On Thu, Jun 2, 2016 at 9:45 AM, Hanna Hedeland <hanna.hedeland at gmail.com>
>wrote:
>> [...] mimetypes are one option, [...], and I think what we need is
>> 1) a decision whether to specify TEI dialects via different mimetypes or
>> rather to use one TEI mimetype and an additional dialect parameter for
>>the
>> dialects - the WebLicht team would know about the implications of these
>> options for the system, I can only imagine that the orchestration might
>> become more complicated if some mimetypes can only be understood with an
>> additional parameter, others on their own -  on the other hand maybe
>>further
>> mimetypes will have implications for the world outside WebLicht
>>
>> 2) some way of managing the inventory of used mimetypes or
>> mimetypes+parameters to ensure we all know which file formats are in
>>use and
>> how they should be described (especially relevant for converters)
>>
>> I think the webservice developers will have really valuable input, but
>>in
>> the end, maybe the WebLicht developers have to decide on this as they
>>will
>> be implementing the chosen solution?
>
>On Thu, Jun 16, 2016 at 5:11 PM, Thorsten Trippel
><thorsten.trippel at uni-tuebingen.de> wrote:
>> Yes it was in this context where I heard this discussion. The TEI
>>importer,
>> as far as I can tell, does not import generic TEI but only specific
>>flavors.
>> If we send a TEI file to weblicht, the TEI tool will assume it is
>>according
>> to this specific flavor, I did not test what happens if it is not. I am
>> afraid it is getting messy.... WebLicht looks at the type of file to
>>suggest
>> matching webservices.  Maybe somebody else can provide more details, for
>> example the Berlin team... or Hamburg if they read along...
>>
>> Cheers
>> Thorsten
>>
>>
>> Am 16.06.16 um 17:06 schrieb Dieter Van Uytvanck:
>>>
>>> On 16/06/16 16:41, Thorsten Trippel wrote:
>>>>
>>>> Unless of course the tools really interpret all profiles or all TEI
>>>> flavors.
>>>
>>>
>>> Hi Thorsten,
>>>
>>> you are anticipating my next question - what web applictions do we have
>>> that can process TEI files in general, independent from the different
>>> subvariants?
>>>
>>> At least WebLicht seems to have a TEI importer
>>> (http://wiki.tei-c.org/index.php/WebLicht#cite_note-1 - taken from the
>>> list at http://wiki.tei-c.org/index.php/Category:Analysis_tools). Do
>>>you
>>> know if it is generic, or if it expects a specific sub-variant?
>>>
>>> And would hope there are more out there...
>>>
>>> best,
>>>
>>
>>
>> --
>> 
>>-------------------------------------------------------------------------
>>---
>> ///////// Dr. Thorsten Trippel   thorsten.trippel at uni-tuebingen.de
>>    //     Seminar für Sprachwissenschaft
>>   //  //  Eberhard-Karls-Universität Tübingen
>>  //  //   Office:  Wilhelmstr. 19 #2.17
>>     //    Phone:   +49 (0)7071-29-77352
>> ///////// Federal Republic of Germany
>> 
>>-------------------------------------------------------------------------
>>----
>> _______________________________________________
>> Dev mailing list
>> Dev at lists.clarin.eu
>> https://lists.clarin.eu/cgi-bin/mailman/listinfo/dev
>
>
>
>-- 
>Thomas Schmidt
>IDS Mannheim
>R5, 6-13
>D-68161 Mannheim
>Tel.: +49 (621) 1581-313
>http://agd.ids-mannheim.de/index.shtml
>http://www.exmaralda.org
>_______________________________________________
>Dev mailing list
>Dev at lists.clarin.eu
>https://lists.clarin.eu/cgi-bin/mailman/listinfo/dev



More information about the Standards mailing list