[Dev] proposal: using a common mime type for TEI files

Hanna Hedeland hanna.hedeland at uni-hamburg.de
Fri Jun 17 13:18:17 CEST 2016


Hi all,

great to see this issue being discussed on the EU level, as Thomas said,
we've been trying for some time to arrive at some solution for the WebLicht
scenario, I suppose the WebLicht team are also on one of the lists and can
benefit from the suggestions and the input :)

I think the standards/IANA aspect might be a valid argument against the
solution with individual non-(yet-)approved mime types, but on the other
hand I don't really see how an xml or even tei mime type can be information
enough for webservices relevant for processing linguistic resources.

Either way, we really need to agree on a uniform and useful way of
describing and listing formats (not only TEI variants) in CLARIN, since
most of the relevant formats share a standard mimetype. And maybe here I
could just (once more) point to the Format Registry (
https://trac.clarin.eu/wiki/FormatRegistry) - or rather what the format
registry should be (or maybe I should ask where something replacing this
link is, and I just haven't heard about it). I guess if the declaration of
formats being used is done with CMDI, someone could theoretically harvest
all of this information and display it, and we'd find out about
inconsistencies, but there are of course other ways to achieve this. Has
there been any recent development in creating this kind of inventory?

Thanks and best regards,
Hanna

-- 
Hanna Hedeland
Hamburger Zentrum für Sprachkorpora
Max-Brauer-Allee 60
D - 22765 Hamburg

Tel. + 49 40 42838 6893

2016-06-17 13:00 GMT+02:00 Sander Maijers <sander at clarin.eu>:

> *not only the proper
> BTW, the same holds for suffixes.
>
>
> On Fri, Jun 17, 2016 at 12:59 PM, Sander Maijers <sander at clarin.eu> wrote:
> > Hi Dieter,
> >
> > Using custom media types can be done in the a number of ways,
> > described in https://en.wikipedia.org/wiki/Media_type#Registration_trees
> > .
> > You stated the benefits of your solution well. Your solution has the
> > following costs:
> > - You'll have to either go through IANA registration procedure for new
> > media types in the ‘Personal or Vanity’ tree, go through IETF
> > Standards Action to get a CLARIN-specific tree, or break the standards
> > and use custom media types outside of this process.
> > - Whatever you opt in this context, no third-party (i.e., general,
> > standards compliant tools) will recognize the media type of centre's
> > content retrieved via PID URLs anymore.
> >
> > I find Menzo's approach not the proper as well as most useful one
> > compared to media type based approaches. After all, you would want a
> > resource's metadata to be completely descriptive of such elementary
> > aspects as internal structure and content of the TEI files, and not
> > dependent on system configuration (served as custom media type x or y,
> > as long as the server remains so configured).
> >
> > Best,
> > Sander
> >
> >
> > On Fri, Jun 17, 2016 at 11:39 AM, Dieter Van Uytvanck <dieter at clarin.eu>
> wrote:
> >> On 16/06/16 20:35, Thomas Schmidt wrote:
> >>> Therefore, we would need to distinguish this at whathever the place is
> >>> where WebLicht distinguishes file formats. If it is via the mime type,
> >>> we would need a mime type extension like "text/x-tei-isospoken+xml"
> >>> vs. "text/x-tei-dta+xml". If it is on some other level, we would have
> >>> to know which and agree on a suitable set of TEI variant identifiers.
> >>> I'm copying relevant parts of the mailing list exchange below for your
> >>> information.
> >>
> >> Dear Thomas,
> >>
> >> Thank you for this very insightful summary of the discussions on this
> >> topic. Looking at all the suggestions made, I think having detailed
> >> mimetype extensions would be the most convenient for most parties
> involved:
> >>
> >> - It puts the responsibility of providing an exact data type for a file
> >> at the side of the metadata creator/resource provider. This is always
> >> better than relying on interpretation by a third-party tool.
> >>
> >> - It does not require changes to (CMDI) metadata profiles.
> >>
> >> - It makes it feasible for tool/data matching applications (WebLicht,
> >> Switchboard, ...) to provide a meaningful processing application.
> >>
> >> There are of course approaches on other levels too (like suggested by
> >> Bart and Menzo), and these could be used in addition to the extended TEI
> >> mimetypes:
> >>
> >> - Matching applications could still try to parse a TEI file (in absence
> >> of a detailed mime type) and make a guess about the sub-type, and using
> >> @type where available. This is of course not trivial.
> >>
> >> - The ParameterGroup in the CMDI description can be added. But in many
> >> cases that requires metadata providers to change their profiles, which
> >> means quite a bit of additional work.
> >>
> >> I will join the TEI weblicht list, and try to gather a bit more concrete
> >> information in the upcoming time at
> >>
> >> https://trac.clarin.eu/wiki/TEI%20variants
> >>
> >> (feel free to edit along)
> >>
> >> When we have that additional information, we can try to come up with
> >> concrete recommendations.
> >>
> >> best regards,
> >> --
> >> Dieter Van Uytvanck
> >> Technical Director CLARIN ERIC
> >> www.clarin.eu | tel. +31-(0)850091363 | skype: dietervu.mpi
> >> _______________________________________________
> >> Dev mailing list
> >> Dev at lists.clarin.eu
> >> https://lists.clarin.eu/cgi-bin/mailman/listinfo/dev
> _______________________________________________
> Dev mailing list
> Dev at lists.clarin.eu
> https://lists.clarin.eu/cgi-bin/mailman/listinfo/dev
>



-- 
Hanna Hedeland
Hamburger Zentrum für Sprachkorpora
Max-Brauer-Allee 60
D - 22765 Hamburg

Tel. + 49 40 42838 6893
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clarin.eu/pipermail/dev/attachments/20160617/a9fcded5/attachment-0001.html>


More information about the Dev mailing list