<div dir="ltr">Hi all,<br><br>good to see there finally seems to be a decision for WebLicht. I was also wondering if this decision already has been taken for the Language Resource Switchboard [1], since when I upload a TEI file I get a drop-down list with available (partly non-standard) mimetypes to choose from. Shouldn't the same approach on describing file types/formats be possible for WebLicht and the LR Switchboard (and for the rest of the CLARIN infrastructure)? And shouldn't it be possible to collect the available tool/service metadata to at least list the mimetypes currently in use (or has this already been done for the list in the LR Switchboard)?<div><br></div><div>@Marie and Claus, could you please find out/decide whether a uniform approach regarding mimetypes/format variants is possible for both services?</div><div><br></div><div>@Piotr, great, maybe you could then also have a look at the available metadata of CLARIN tools/services and the (newer) entrances in the not yet really existing Format Registry [2] for the inventory of mimetypes?</div><div><br></div><div>Thanks and best regards,</div><div>Hanna</div><div><br></div><div><div><div><div>[1] <a href="http://weblicht.sfs.uni-tuebingen.de/clrs/" target="_blank">http://weblicht.sfs.uni-tuebingen.de/clrs/</a></div></div></div></div><div>[2] <a href="https://trac.clarin.eu/wiki/FormatRegistry" target="_blank">https://trac.clarin.eu/wiki/FormatRegistry</a><br></div><div><br></div>-- <br><div data-smartmail="gmail_signature">Hanna Hedeland<br>Hamburger Zentrum für Sprachkorpora<br>Max-Brauer-Allee 60<br>D - 22765 Hamburg<br><br>Tel. <a href="tel:%2B%2049%2040%2042838%206893" value="+4940428386893" target="_blank">+ 49 40 42838 6893</a></div></div><div class="gmail_extra"><br><div class="gmail_quote">2016-07-08 17:42 GMT+02:00 Piotr Bański <span dir="ltr"><<a href="mailto:banski@ids-mannheim.de" target="_blank">banski@ids-mannheim.de</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Dear All,<br>

<br>

I have summarized Thomas's proposal at <a href="https://trac.clarin.eu/wiki/MIME%20format%20variants" rel="noreferrer" target="_blank">https://trac.clarin.eu/wiki/MIME%20format%20variants</a><br>

<br>

I'll also try to be the "someone" whom Thomas has hesitated to name. It will provide me with a good opportunity to look into the various corners of CLARIN's infrastructure that I could otherwise overlook.<br>

<br>

Best regards,<br>

<br>

  Piotr<div class="HOEnZb"><div class="h5"><br>

<br>

On 08/07/16 10:25, Thomas Schmidt wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Sorry: as long as this discussion is the reference document, I should<br>

point out that I made a mistake:<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

A parameter "token=0/1" can be added to indicate whether (=1) or<br>

not (=0) the respective TEI file is tokenized (i.e. has <w> markup)<br>

</blockquote>

The name of the parameter as described by Bryan is "tokenized", not "token".<br>

<br>

- Thomas<br>

<br>

<br>

<br>

On Fri, Jul 8, 2016 at 9:04 AM, Thomas Schmidt<br>

<<a href="mailto:thomas.schmidt@ids-mannheim.de" target="_blank">thomas.schmidt@ids-mannheim.de</a>> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Dear all,<br>

<br>

in the absence of further input from the standards committee and before we<br>

lose the momentum, I'd like to summarise our action plan according to the<br>

discussion so far:<br>

<br>

(1a) In WebLicht (in CLARIN in general?) ISO/TEI transcriptions of spoken<br>

language will be identified by the MIME type<br>

text/tei+xml;format-variant=tei-iso-spoken. A parameter "token=0/1" can be<br>

added to indicate whether (=1) or not (=0) the respective TEI file is<br>

tokenized (i.e. has <w> markup).<br>

(1b) HZSK and myself will adapt the respective web services accordingly<br>

<br>

(2a) In WebLicht (in CLARIN in general?) DTA/TEI files will be identified by<br>

the MIME type text/tei+xml;format-variant=tei-dta. A parameter "token=0/1"<br>

can be added to indicate whether (=1) or not (=0) the respective TEI file is<br>

tokenized (i.e. has <w> markup).<br>

(2b) Bryan Jurish will adapt the respective web services at BBAW accordingly<br>

<br>

(3a) In WebLicht (in CLARIN in general?), EXMARaLDA Basic Transcriptions<br>

will be identified by the MIME type text/xml; format-variant=exmaralda-exb<br>

(3b) In WebLicht (in CLARIN in general?), FOLKER/OrthoNormal transcription<br>

files will be identified by the MIME type text/xml;<br>

format-variant=folker-fln<br>

(3c) In WebLicht (in CLARIN in general?), Transcriber transcription files<br>

will be identified by the MIME type text/xml; format-variant=transcriber-trs<br>

(3d) HZSK and myself will adapt the respective web services accordingly<br>

<br>

(4a) It would have to be checked (note the passive, I don't know who could<br>

be in charge of this) whether competing MIME types for these file types are<br>

already registered somewhere. I know that WebLicht already seems to have two<br>

variants of EXMARaLDA transcriptions. The mechanims specifying those would<br>

probably have to be deprecated. Transcriber is also not unlikely to have<br>

been given some kind of mimetype elsewhere in CLARIN.<br>

(4b) Further relevant formats will be ELAN/EAF, CLAN/CHA and PRAAT/TextGrid<br>

(the latter two being text, not XML formats). All three of them are also<br>

likely to have been registered somewhere already, so "someone" (again, I<br>

wouldn't know who) should check if mime types have been defined for those.<br>

<br>

I guess that this is as good an answer as we can currently give to address<br>

points 1-3 in Marie Hinrich's list. @Marie: can you confirm that this is<br>

suffient for you, also to address point 4 in your list? In my understanding,<br>

whatever works for WebLicht in this respect should also be a suitable basis<br>

for a larger context (the SwitchBoard in particular?).<br>

<br>

In my eyes, it remains crucial, however, that such standardization<br>

"decisions" are centrally documented (including the information Tomaž<br>

suggested). The CLARIN standards pages as they are now<br>

(<a href="https://www.clarin.eu/content/standard-recommendations" rel="noreferrer" target="_blank">https://www.clarin.eu/content/standard-recommendations</a> /<br>

<a href="http://clarin.ids-mannheim.de/standards/index.xq" rel="noreferrer" target="_blank">http://clarin.ids-mannheim.de/standards/index.xq</a> are the ones I know) are,<br>

IMHO, incomplete, inconistent and outdated, and they certainly do not<br>

provide accurate information on the mime types. Any input from the standard<br>

committee on this question would therefore still be much appreciated.<br>

<br>

Best,<br>

<br>

Thomas<br>

<br>

<br>

<br>

<br>

On Thu, Jun 23, 2016 at 10:56 AM, Marie Hinrichs<br>

<<a href="mailto:marie.hinrichs@uni-tuebingen.de" target="_blank">marie.hinrichs@uni-tuebingen.de</a>> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hi All,<br>

<br>

Thanks to all of you for all the work you’ve done so far to get TEI<br>

processing integrated into WebLicht.<br>

<br>

 From WebLicht’s side, there are several places where some<br>

work/coordination needs to happen:<br>

<br>

1. TCF: agree on the textsource.type attribute and make sure that the<br>

encoder services set it properly<br>

2. Agree on type names (i.e. text/tei+xml or text/x-tei-dta-xml)<br>

3. Make sure the CMDI for encoder and decoder services reflect outcomes of<br>

1 and 2<br>

4. Add new mappings to WebLicht for TEI.<br>

<br>

Steps 1-3 are being worked out here on the mailing list and whichever<br>

solution/conventions you agree on are fine with us.<br>

<br>

Step 4 requires some changes to the WebLicht code - in particular to the<br>

component that we call the “profiler”. When a user uploads a file, the<br>

profiler tries to figure out what it is and if any of the WebLicht services<br>

can process it. The contentType of the uploaded file, in combination with<br>

standard libraries for file type recognition are used for this. But<br>

sometimes more digging is necessary, as in the case with tcf - which is<br>

recognized as xml, but it needs a closer look to see if it is tcf.  The<br>

profiler will have to be updated in a similar way to recognize TEI, and<br>

hopefully there is even some straightforward way of distinguishing between<br>

the DTA and the spoken variants. Finally, mappings need to be established<br>

between the results of the profiler and the service input types so that the<br>

right services are offered to the user for selection.<br>

<br>

Also note that WebLicht chains can be called from the command-line or<br>

programmatically using WebLicht as a Service (WaaS) - see instructions here:<br>

<a href="https://weblicht.sfs.uni-tuebingen.de/WaaS/" rel="noreferrer" target="_blank">https://weblicht.sfs.uni-tuebingen.de/WaaS/</a> This is useful for larger inputs<br>

and avoids timeout issues that arise when using the web interface.<br>

<br>

Best Regards,<br>

Marie<br>

<br>

<br>

On 21.06.2016, at 14:28, Tomaž Erjavec <<a href="mailto:Tomaz.Erjavec@ijs.si" target="_blank">Tomaz.Erjavec@ijs.si</a>> wrote:<br>

<br>

Hi,<br>

<br>

as regards<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

these format-related specifications (in this case: the name and possible<br>

values of attributes which are used in addition to a mime type) would<br>

need to be documented and made known at a central place.<br>

</blockquote>

I'd say the documentation for each would need to be accompanied by its TEI<br>

schema, i.e. the TEI ODD file and the derived (probably) RelaxNG schema.<br>

Then it would be a simple matter to check if a document conforms to the mime<br>

type.<br>

<br>

Best,<br>

Tomaž<br>

<br>

Bryan Jurish je 21/06/2016 ob 14:22 napisal:<br>

<br>

morning all,<br>

<br>

sounds good to me.<br>

<br>

@Marie: can you give an estimation of how well this might work for<br>

WebLicht?<br>

<br>

I'll add the "format-variant=tei-dta" parameter to the DTA TEI<->TCF web<br>

service in the next few days, so we can see how that at least works out.<br>

<br>

marmosets,<br>

   Bryan<br>

<br>

On Tue, Jun 21, 2016 at 12:32 PM, Thomas Schmidt<br>

<<a href="mailto:thomas.schmidt@ids-mannheim.de" target="_blank">thomas.schmidt@ids-mannheim.de</a>> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Dear all,<br>

<br>

revising my suggestions from the teiweblicht list according to Bryan's<br>

proposal to use official mime-types plus parameters (instead of<br>

x-extended custom mime types) would mean that:<br>

<br>

"text/x-tei-isospoken+xml" could become "text/tei+xml;<br>

format-variant=tei-iso-spoken" (+ tokenized=0/1)<br>

"text/x-tei-dta+xml" could become "text/tei+xml;<br>

format-variant=tei-dta" (+ tokenized=0/1)<br>

"text/x-exmaralda-exb+xml" could become "text/xml;<br>

format-variant=exmaralda-exb"<br>

... and so forth (for other TEI oder XML based formats)<br>

<br>

Wouldn't that be a solomonic solution? What do the WebLicht developers<br>

say? And independently of that, I think that Hanna is right that these<br>

format-related specifications (in this case: the name and possible<br>

values of attributes which are used in addition to a mime type) would<br>

need to be documented and made known at a central place. I guess it<br>

would be up to the standards committee to decide on that?<br>

<br>

Best regards,<br>

<br>

Thomas<br>

<br>

<br>

<br>

<br>

<br>

On Sat, Jun 18, 2016 at 10:56 AM, Bryan Jurish <<a href="mailto:jurish@bbaw.de" target="_blank">jurish@bbaw.de</a>> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

moin all,<br>

<br>

fwiw, I agree with Dieter that we need to differentiate between<br>

"proper"<br>

MIME types (i.e. standardized conventions registered with IANA) and<br>

CLARIN-internal (rsp. WebLicht-internal) conventions.  We have been<br>

using<br>

MIME types as the basis of the WebLicht textSource/@type attribute,<br>

analogous to the HTTP "ContentType" header, cf.<br>

<a href="https://tools.ietf.org/html/rfc2045#section-5.1" rel="noreferrer" target="_blank">https://tools.ietf.org/html/rfc2045#section-5.1</a> .  At the risk of<br>

repeating<br>

what I've already said on the tei-weblicht list, use of the ContentType<br>

syntax allows us to have our cake and eat it too: we can go ahead and<br>

use<br>

"official" IANA-sanctioned "true" MIME types and specify variants<br>

("dialects", "flavors") using parameters.  The DTA TEI<->TCF converter<br>

is<br>

already doing this, setting textSource/@type to either "text/tei+xml;<br>

tokenized=0" or "text/tei+xml; tokenized=1", depending on the relevant<br>

properties of the input document.<br>

<br>

just my €0.02.<br>

<br>

marmosets,<br>

   Bryan<br>

<br>

<br>

On Fri, Jun 17, 2016 at 1:43 PM, Dieter Van Uytvanck <<a href="mailto:dieter@clarin.eu" target="_blank">dieter@clarin.eu</a>><br>

wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

On 17/06/16 12:59, Sander Maijers wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

After all, you would want a<br>

resource's metadata to be completely descriptive of such elementary<br>

aspects as internal structure and content of the TEI files, and not<br>

dependent on system configuration (served as custom media type x or<br>

y,<br>

as long as the server remains so configured).<br>

</blockquote>

Hi Sander,<br>

<br>

Thank you for sharing your opinion.<br>

<br>

One side note: we are talking about detecting the mimetype as<br>

indicated<br>

in the CMDI ResourceProxy attribute, see:<br>

<br>

<br>

<br>

<a href="https://www.clarin.eu/faq/how-can-i-specify-additional-details-about-resourceproxy" rel="noreferrer" target="_blank">https://www.clarin.eu/faq/how-can-i-specify-additional-details-about-resourceproxy</a><br>

<br>

So for the scenario VLO -> LR switchboard -> processing application<br>

<br>

the system configuration would not be relevant, since the mimetype is<br>

explicitly mentioned in the metadata. The key is to find agreement<br>

about<br>

a simple and light-weight way of designating the variants of TEI.<br>

<br>

best,<br>

<br>

--<br>

Dieter Van Uytvanck<br>

Technical Director CLARIN ERIC<br>

<a href="http://www.clarin.eu" rel="noreferrer" target="_blank">www.clarin.eu</a> | tel. <a href="tel:%2B31-%280%29850091363" value="+31850091363" target="_blank">+31-(0)850091363</a> | skype: dietervu.mpi<br>

_______________________________________________<br>

Teiweblicht mailing list<br>

<a href="mailto:Teiweblicht@lists.informatik.uni-leipzig.de" target="_blank">Teiweblicht@lists.informatik.uni-leipzig.de</a><br>

<a href="http://lists.informatik.uni-leipzig.de/mailman/listinfo/teiweblicht" rel="noreferrer" target="_blank">http://lists.informatik.uni-leipzig.de/mailman/listinfo/teiweblicht</a><br>

<br>

</blockquote>

<br>

<br>

--<br>

***************************************************<br>

Bryan Jurish<br>

Deutsches Textarchiv<br>

Digitales Wörterbuch der deutschen Sprache<br>

Berlin-Brandenburgische Akademie der Wissenschaften<br>

<br>

Jägerstr. 22/23<br>

10117 Berlin<br>

<br>

Tel.:     <a href="tel:%2B49%20%280%2930%2020370%20539" value="+493020370539" target="_blank">+49 (0)30 20370 539</a><br>

E-Mail:   <a href="mailto:jurish@bbaw.de" target="_blank">jurish@bbaw.de</a><br>

***************************************************<br>

<br>

_______________________________________________<br>

Teiweblicht mailing list<br>

<a href="mailto:Teiweblicht@lists.informatik.uni-leipzig.de" target="_blank">Teiweblicht@lists.informatik.uni-leipzig.de</a><br>

<a href="http://lists.informatik.uni-leipzig.de/mailman/listinfo/teiweblicht" rel="noreferrer" target="_blank">http://lists.informatik.uni-leipzig.de/mailman/listinfo/teiweblicht</a><br>

<br>

</blockquote>

<br>

<br>

--<br>

Thomas Schmidt<br>

IDS Mannheim<br>

R5, 6-13<br>

D-68161 Mannheim<br>

Tel.: <a href="tel:%2B49%20%28621%29%201581-313" value="+496211581313" target="_blank">+49 (621) 1581-313</a><br>

<a href="http://agd.ids-mannheim.de/index.shtml" rel="noreferrer" target="_blank">http://agd.ids-mannheim.de/index.shtml</a><br>

<a href="http://www.exmaralda.org" rel="noreferrer" target="_blank">http://www.exmaralda.org</a><br>

<br>

</blockquote>

<br>

<br>

--<br>

***************************************************<br>

Bryan Jurish<br>

Deutsches Textarchiv<br>

Digitales Wörterbuch der deutschen Sprache<br>

Berlin-Brandenburgische Akademie der Wissenschaften<br>

<br>

Jägerstr. 22/23<br>

10117 Berlin<br>

<br>

Tel.:     <a href="tel:%2B49%20%280%2930%2020370%20539" value="+493020370539" target="_blank">+49 (0)30 20370 539</a><br>

E-Mail:   <a href="mailto:jurish@bbaw.de" target="_blank">jurish@bbaw.de</a><br>

***************************************************<br>

<br>

<br>

<br>

</blockquote>

<br>

<br>

--<br>

Thomas Schmidt<br>

IDS Mannheim<br>

R5, 6-13<br>

D-68161 Mannheim<br>

Tel.: <a href="tel:%2B49%20%28621%29%201581-313" value="+496211581313" target="_blank">+49 (621) 1581-313</a><br>

<a href="http://agd.ids-mannheim.de/index.shtml" rel="noreferrer" target="_blank">http://agd.ids-mannheim.de/index.shtml</a><br>

<a href="http://www.exmaralda.org" rel="noreferrer" target="_blank">http://www.exmaralda.org</a><br>

</blockquote>

<br>

<br>

</blockquote>

<br>

<br>

-- <br></div></div><span class="im HOEnZb">

Piotr Bański, Ph.D.<br>

Senior Researcher,<br>

Institut für Deutsche Sprache,<br>

R5 6-13<br>

68-161 Mannheim, Germany<br>

<br></span><div class="HOEnZb"><div class="h5">

_______________________________________________<br>

Teiweblicht mailing list<br>

<a href="mailto:Teiweblicht@lists.informatik.uni-leipzig.de" target="_blank">Teiweblicht@lists.informatik.uni-leipzig.de</a><br>

<a href="http://lists.informatik.uni-leipzig.de/mailman/listinfo/teiweblicht" rel="noreferrer" target="_blank">http://lists.informatik.uni-leipzig.de/mailman/listinfo/teiweblicht</a><br>

</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature">Hanna Hedeland<br>Hamburger Zentrum für Sprachkorpora<br>Max-Brauer-Allee 60<br>D - 22765 Hamburg<br><br>Tel. + 49 40 42838 6893</div>

</div>