[Dev] [Teiweblicht] proposal: using a common mime type for TEI files

Marie Hinrichs marie.hinrichs at uni-tuebingen.de
Thu Jun 23 10:56:35 CEST 2016


Hi All,

Thanks to all of you for all the work you’ve done so far to get TEI processing integrated into WebLicht.

From WebLicht’s side, there are several places where some work/coordination needs to happen:

1. TCF: agree on the textsource.type attribute and make sure that the encoder services set it properly
2. Agree on type names (i.e. text/tei+xml or text/x-tei-dta-xml)
3. Make sure the CMDI for encoder and decoder services reflect outcomes of 1 and 2
4. Add new mappings to WebLicht for TEI.

Steps 1-3 are being worked out here on the mailing list and whichever solution/conventions you agree on are fine with us.

Step 4 requires some changes to the WebLicht code - in particular to the component that we call the “profiler”. When a user uploads a file, the profiler tries to figure out what it is and if any of the WebLicht services can process it. The contentType of the uploaded file, in combination with standard libraries for file type recognition are used for this. But sometimes more digging is necessary, as in the case with tcf - which is recognized as xml, but it needs a closer look to see if it is tcf.  The profiler will have to be updated in a similar way to recognize TEI, and hopefully there is even some straightforward way of distinguishing between the DTA and the spoken variants. Finally, mappings need to be established between the results of the profiler and the service input types so that the right services are offered to the user for selection.

Also note that WebLicht chains can be called from the command-line or programmatically using WebLicht as a Service (WaaS) - see instructions here: https://weblicht.sfs.uni-tuebingen.de/WaaS/ <https://weblicht.sfs.uni-tuebingen.de/WaaS/> This is useful for larger inputs and avoids timeout issues that arise when using the web interface.

Best Regards,
Marie


> On 21.06.2016, at 14:28, Tomaž Erjavec <Tomaz.Erjavec at ijs.si> wrote:
> 
> Hi,
> 
> as regards 
> > these format-related specifications (in this case: the name and possible
> > values of attributes which are used in addition to a mime type) would
> > need to be documented and made known at a central place. 
> I'd say the documentation for each would need to be accompanied by its TEI schema, i.e. the TEI ODD file and the derived (probably) RelaxNG schema. Then it would be a simple matter to check if a document conforms to the mime type.
> 
> Best,
> Tomaž
> 
> Bryan Jurish je 21/06/2016 ob 14:22 napisal:
>> morning all,
>> 
>> sounds good to me.
>> 
>> @Marie: can you give an estimation of how well this might work for WebLicht?
>> 
>> I'll add the "format-variant=tei-dta" parameter to the DTA TEI<->TCF web service in the next few days, so we can see how that at least works out.
>> 
>> marmosets,
>>   Bryan
>> 
>> On Tue, Jun 21, 2016 at 12:32 PM, Thomas Schmidt <thomas.schmidt at ids-mannheim.de <mailto:thomas.schmidt at ids-mannheim.de>> wrote:
>> Dear all,
>> 
>> revising my suggestions from the teiweblicht list according to Bryan's
>> proposal to use official mime-types plus parameters (instead of
>> x-extended custom mime types) would mean that:
>> 
>> "text/x-tei-isospoken+xml" could become "text/tei+xml;
>> format-variant=tei-iso-spoken" (+ tokenized=0/1)
>> "text/x-tei-dta+xml" could become "text/tei+xml;
>> format-variant=tei-dta" (+ tokenized=0/1)
>> "text/x-exmaralda-exb+xml" could become "text/xml; format-variant=exmaralda-exb"
>> ... and so forth (for other TEI oder XML based formats)
>> 
>> Wouldn't that be a solomonic solution? What do the WebLicht developers
>> say? And independently of that, I think that Hanna is right that these
>> format-related specifications (in this case: the name and possible
>> values of attributes which are used in addition to a mime type) would
>> need to be documented and made known at a central place. I guess it
>> would be up to the standards committee to decide on that?
>> 
>> Best regards,
>> 
>> Thomas
>> 
>> 
>> 
>> 
>> 
>> On Sat, Jun 18, 2016 at 10:56 AM, Bryan Jurish < <mailto:jurish at bbaw.de>jurish at bbaw.de <mailto:jurish at bbaw.de>> wrote:
>> > moin all,
>> >
>> > fwiw, I agree with Dieter that we need to differentiate between "proper"
>> > MIME types (i.e. standardized conventions registered with IANA) and
>> > CLARIN-internal (rsp. WebLicht-internal) conventions.  We have been using
>> > MIME types as the basis of the WebLicht textSource/@type attribute,
>> > analogous to the HTTP "ContentType" header, cf.
>> > https://tools.ietf.org/html/rfc2045#section-5.1 <https://tools.ietf.org/html/rfc2045#section-5.1> .  At the risk of repeating
>> > what I've already said on the tei-weblicht list, use of the ContentType
>> > syntax allows us to have our cake and eat it too: we can go ahead and use
>> > "official" IANA-sanctioned "true" MIME types and specify variants
>> > ("dialects", "flavors") using parameters.  The DTA TEI<->TCF converter is
>> > already doing this, setting textSource/@type to either "text/tei+xml;
>> > tokenized=0" or "text/tei+xml; tokenized=1", depending on the relevant
>> > properties of the input document.
>> >
>> > just my €0.02.
>> >
>> > marmosets,
>> >   Bryan
>> >
>> >
>> > On Fri, Jun 17, 2016 at 1:43 PM, Dieter Van Uytvanck <dieter at clarin.eu <mailto:dieter at clarin.eu>>
>> > wrote:
>> >>
>> >> On 17/06/16 12:59, Sander Maijers wrote:
>> >> > After all, you would want a
>> >> > resource's metadata to be completely descriptive of such elementary
>> >> > aspects as internal structure and content of the TEI files, and not
>> >> > dependent on system configuration (served as custom media type x or y,
>> >> > as long as the server remains so configured).
>> >>
>> >> Hi Sander,
>> >>
>> >> Thank you for sharing your opinion.
>> >>
>> >> One side note: we are talking about detecting the mimetype as indicated
>> >> in the CMDI ResourceProxy attribute, see:
>> >>
>> >>
>> >> https://www.clarin.eu/faq/how-can-i-specify-additional-details-about-resourceproxy <https://www.clarin.eu/faq/how-can-i-specify-additional-details-about-resourceproxy>
>> >>
>> >> So for the scenario VLO -> LR switchboard -> processing application
>> >>
>> >> the system configuration would not be relevant, since the mimetype is
>> >> explicitly mentioned in the metadata. The key is to find agreement about
>> >> a simple and light-weight way of designating the variants of TEI.
>> >>
>> >> best,
>> >>
>> >> --
>> >> Dieter Van Uytvanck
>> >> Technical Director CLARIN ERIC
>> >> www.clarin.eu <http://www.clarin.eu/> | tel. +31-(0)850091363 <tel:%2B31-%280%29850091363> | skype: dietervu.mpi
>> >> _______________________________________________
>> >> Teiweblicht mailing list
>> >> Teiweblicht at lists.informatik.uni-leipzig.de <mailto:Teiweblicht at lists.informatik.uni-leipzig.de>
>> >> http://lists.informatik.uni-leipzig.de/mailman/listinfo/teiweblicht <http://lists.informatik.uni-leipzig.de/mailman/listinfo/teiweblicht>
>> >>
>> >
>> >
>> >
>> > --
>> > ***************************************************
>> > Bryan Jurish
>> > Deutsches Textarchiv
>> > Digitales Wörterbuch der deutschen Sprache
>> > Berlin-Brandenburgische Akademie der Wissenschaften
>> >
>> > Jägerstr. 22/23
>> > 10117 Berlin
>> >
>> > Tel.:     +49 (0)30 20370 539 <tel:%2B49%20%280%2930%2020370%20539>
>> > E-Mail:   jurish at bbaw.de <mailto:jurish at bbaw.de>
>> > ***************************************************
>> >
>> > _______________________________________________
>> > Teiweblicht mailing list
>> > Teiweblicht at lists.informatik.uni-leipzig.de <mailto:Teiweblicht at lists.informatik.uni-leipzig.de>
>> > http://lists.informatik.uni-leipzig.de/mailman/listinfo/teiweblicht <http://lists.informatik.uni-leipzig.de/mailman/listinfo/teiweblicht>
>> >
>> 
>> 
>> 
>> --
>> Thomas Schmidt
>> IDS Mannheim
>> R5, 6-13
>> D-68161 Mannheim
>> Tel.: +49 (621) 1581-313 <tel:%2B49%20%28621%29%201581-313>
>> http://agd.ids-mannheim.de/index.shtml <http://agd.ids-mannheim.de/index.shtml>
>> http://www.exmaralda.org <http://www.exmaralda.org/>
>> 
>> 
>> 
>> 
>> -- 
>> ***************************************************
>> Bryan Jurish
>> Deutsches Textarchiv
>> Digitales Wörterbuch der deutschen Sprache
>> Berlin-Brandenburgische Akademie der Wissenschaften
>> 
>> Jägerstr. 22/23
>> 10117 Berlin
>> 
>> Tel.:     +49 (0)30 20370 539
>> E-Mail:   jurish at bbaw.de <mailto:jurish at bbaw.de>
>> ***************************************************
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clarin.eu/pipermail/dev/attachments/20160623/d0303ec1/attachment.html>


More information about the Dev mailing list