[Dev] [FCS] Experimental SRU/CQL endpoint for TextGrids digital library (Lucene based)
Oliver Schonefeld
schonefeld at ids-mannheim.de
Tue May 22 01:53:56 CEST 2012
Hi,
I've bolted an experimental SRU/CQL endpoint together, which exposes
most of TextGrid's Digital Library[1] to CLARIN FCS.
The search engine is based on Apache Lucene and it's pretty fast. The
index size is about 1.6GB while the original XML data has ~2GB (XML
overhead and some documents are skipped). Only author, title and the
full text are indexed and it takes about 6:30 minutes on my desktop.
The hardest part was writing the parser for the TEI-Files and convincing
Lucene to to something that kind of resembles a KWIC view.
Limitations:
- @pid and @ref contain dummy values[2]
- CQL to Lucene query translation sometimes yields unexpected results.
- sentence tokenization is sometimes strange; this is due to using
OpenNLP's sentence tokenizer ...
- if a sentence contains the search term multiple times, it will only
be listed once in the result set (won't fix)
- KWIC hit markings are ... sometimes .. strange
- only full-text searches are supported (e.g you cannot limit search to
a specific author and/or work)
- the server will limit result sets to a maximum of 100 records
(actually, it's playing nice to your client. if you think you client
can handle it, append "x-unlimited-resultset=1" to the query
parameters in the URL ;)
Source code is still kind of work-in-progress but is available upon
request. Actually I started this to provide an easy example how to use
the SRU-Server library -- however, it became "easy" for given values of
"easy" ;)
Some examples:
http://clarin.ids-mannheim.de/digibibsru?operation=searchRetrieve&version=1.2&query=%22Pudels%20Kern%22
http://clarin.ids-mannheim.de/digibibsru?operation=searchRetrieve&version=1.2&query=words%20any%20Golem%20%20Thora
http://clarin.ids-mannheim.de/digibibsru?operation=searchRetrieve&version=1.2&query=Nijmegen
... 1 hit ;)
http://clarin.ids-mannheim.de/digibibsru?operation=searchRetrieve&version=1.2&query=Mannheim
... 254 hits :P
Have fun,
Oliver
[1] http://www.textgrid.de/digitale-bibliothek.html, licensed CC-BY
[2] TextGrid probably issues a new version with PIDs to the TextGridRep.
Then this issue can be fixed.
--
Oliver Schonefeld
Institut für Deutsche Sprache, Zentrale Forschung
R5, 6-13, D-68161 Mannheim
+49-(0)621-1581-451 | http://www.ids-mannheim.de
More information about the Dev
mailing list