[Dev] CLARIN-FCS: clarification about FCS schema

Oliver Schonefeld schonefeld at ids-mannheim.de
Mon Oct 1 14:12:13 CEST 2012


[X-Posted to CLARIN-D developers]

Hi,

while building a SRU client for FCS, I revisited the current CLARIN-FCS
record schema [1].
I've got two issues with the current schema, I'd like to get discuss
with interested developers:

1) [minor] The dataview type currently allows only three values
   ("kwic", "fulltext", "image"). Some endpoints, e.g. Meertens, also
   have a DataView for KML. However, the "kml" is currently not within
   the set of allowed values, thus resulting in invalid XML.
   We have several options to deal with this:
   a) add "kml" to the list of allows values (and do this, every time a
      new dataview pops up; including bumping the version number of the
      schema)
   b) get rid of the predefined values and define attribute value to be
      of type xs:NMTOKEN (or something similar)
   c) drop the @type attribute in favor of a proper @mime-type
      attribute. For our own types (e.g. kwic) we could define
      a non-standard mime types (cf. RFC 2045, RFC 4288), e.g. like
      "application/x-clarin-fcs-kwic+xml"

      (SN: KML has a officially registered mime-type:
           "application/vnd.google-earth.kml+xml")

   BTW, I'd vote for solution c ...


2) [major] "Resource" is currently defined semi-recursive:
   <xs:complexType name="ResourceType">
     <xs:sequence>
       <xs:element maxOccurs="unbounded" minOccurs="0"
             name="Resource" type="fcs:ResourceType"/>
       <xs:element maxOccurs="unbounded" minOccurs="0"
             name="DataView" type="fcs:DataViewType"/>
       <xs:element maxOccurs="unbounded" minOccurs="0"
             name="ResourceFragment" type="fcs:ResourceFragmentType"/>
     </xs:sequence>
     <xs:attribute name="pid" type="fcs:pidType" use="optional"/>
     <xs:attribute name="ref" type="fcs:refType" use="optional"/>
   </xs:complexType>
   Since maxOccures defaults to 1 (not "unlimited"), the definition of
   the type in the XSD allows for structures where a Resource may have
   zero-or-one Resource as child, thus forming structure like
   (namespaces and other elements omitted for brevity):
     <Resource ...>
       <Resource ...>
         <Resource ...>
           <Resource ...>
             <!-- ad infinitum -->
           </Resource>
         </Resource>
       </Resource>
     </Resource>
   However no Resource elements, with more than one Resource elements
   as child, like:
     <Resource ...>
       <Resource ...>
       </Resource>
       <Resource ...>
       </Resource>
     </Resource>

   The first structure does not really make sense to me, while the one
   could argue, that the second could be used to produce a structures
   result in form of a (sub-)corpus.
   My suggestion is either to drop the recursiveness or define it
   properly (including some real world use cases, why this is needed).

   BTW, I'd vote for dropping the recursiveness ...

Comments, Ideas, Thoughts?

Best,
  Oliver

[1] http://trac.clarin.eu/browser/FederatedSearch/Resource.xsd
-- 
Oliver Schonefeld
Institut für Deutsche Sprache, Zentrale Forschung
R5, 6-13, D-68161 Mannheim
+49-(0)621-1581-451 | http://www.ids-mannheim.de



More information about the Dev mailing list