[Dev] CLARIN-FCS: clarification about FCS schema
Oliver Schonefeld
schonefeld at ids-mannheim.de
Mon Oct 1 14:12:13 CEST 2012
[X-Posted to CLARIN-D developers]
Hi,
while building a SRU client for FCS, I revisited the current CLARIN-FCS
record schema [1].
I've got two issues with the current schema, I'd like to get discuss
with interested developers:
1) [minor] The dataview type currently allows only three values
("kwic", "fulltext", "image"). Some endpoints, e.g. Meertens, also
have a DataView for KML. However, the "kml" is currently not within
the set of allowed values, thus resulting in invalid XML.
We have several options to deal with this:
a) add "kml" to the list of allows values (and do this, every time a
new dataview pops up; including bumping the version number of the
schema)
b) get rid of the predefined values and define attribute value to be
of type xs:NMTOKEN (or something similar)
c) drop the @type attribute in favor of a proper @mime-type
attribute. For our own types (e.g. kwic) we could define
a non-standard mime types (cf. RFC 2045, RFC 4288), e.g. like
"application/x-clarin-fcs-kwic+xml"
(SN: KML has a officially registered mime-type:
"application/vnd.google-earth.kml+xml")
BTW, I'd vote for solution c ...
2) [major] "Resource" is currently defined semi-recursive:
<xs:complexType name="ResourceType">
<xs:sequence>
<xs:element maxOccurs="unbounded" minOccurs="0"
name="Resource" type="fcs:ResourceType"/>
<xs:element maxOccurs="unbounded" minOccurs="0"
name="DataView" type="fcs:DataViewType"/>
<xs:element maxOccurs="unbounded" minOccurs="0"
name="ResourceFragment" type="fcs:ResourceFragmentType"/>
</xs:sequence>
<xs:attribute name="pid" type="fcs:pidType" use="optional"/>
<xs:attribute name="ref" type="fcs:refType" use="optional"/>
</xs:complexType>
Since maxOccures defaults to 1 (not "unlimited"), the definition of
the type in the XSD allows for structures where a Resource may have
zero-or-one Resource as child, thus forming structure like
(namespaces and other elements omitted for brevity):
<Resource ...>
<Resource ...>
<Resource ...>
<Resource ...>
<!-- ad infinitum -->
</Resource>
</Resource>
</Resource>
</Resource>
However no Resource elements, with more than one Resource elements
as child, like:
<Resource ...>
<Resource ...>
</Resource>
<Resource ...>
</Resource>
</Resource>
The first structure does not really make sense to me, while the one
could argue, that the second could be used to produce a structures
result in form of a (sub-)corpus.
My suggestion is either to drop the recursiveness or define it
properly (including some real world use cases, why this is needed).
BTW, I'd vote for dropping the recursiveness ...
Comments, Ideas, Thoughts?
Best,
Oliver
[1] http://trac.clarin.eu/browser/FederatedSearch/Resource.xsd
--
Oliver Schonefeld
Institut für Deutsche Sprache, Zentrale Forschung
R5, 6-13, D-68161 Mannheim
+49-(0)621-1581-451 | http://www.ids-mannheim.de
More information about the Dev
mailing list