[Dev] [FCS] Experimental SRU/CQL endpoint for TextGrids digital library (Lucene based)

Thomas Zastrow thomas.zastrow at uni-tuebingen.de
Tue May 22 14:00:19 CEST 2012


Hi Oli,

This s great! Lets give it an Umlaut ... Searching for "Tübingen" :

http://clarin.ids-mannheim.de/digibibsru?operation=searchRetrieve&version=1.2&query=T%C3%BCbingen

where "Tübingen" is automatically convertet by the Browser, brings up  
276 results ;-)

Best,

Tom



Zitat von Oliver Schonefeld <schonefeld at ids-mannheim.de>:

> Hi,
>
> I've bolted an experimental SRU/CQL endpoint together, which exposes
> most of TextGrid's Digital Library[1] to CLARIN FCS.
> The search engine is based on Apache Lucene and it's pretty fast. The
> index size is about 1.6GB while the original XML data has ~2GB (XML
> overhead and some documents are skipped). Only author, title and the
> full text are indexed and it takes about 6:30 minutes on my desktop.
> The hardest part was writing the parser for the TEI-Files and convincing
> Lucene to to something that kind of resembles a KWIC view.
>
> Limitations:
> - @pid and @ref contain dummy values[2]
> - CQL to Lucene query translation sometimes yields unexpected results.
> - sentence tokenization is sometimes strange; this is due to using
>   OpenNLP's sentence tokenizer ...
> - if a sentence contains the search term multiple times, it will only
>   be listed once in the result set (won't fix)
> - KWIC hit markings are ... sometimes .. strange
> - only full-text searches are supported (e.g you cannot limit search to
>   a specific author and/or work)
> - the server will limit result sets to a maximum of 100 records
>   (actually, it's playing nice to your client. if you think you client
>    can handle it, append "x-unlimited-resultset=1" to the query
>    parameters in the URL ;)
>
> Source code is still kind of work-in-progress but is available upon
> request. Actually I started this to provide an easy example how to use
> the SRU-Server library -- however, it became "easy" for given values of
> "easy" ;)
>
> Some examples:
>
> http://clarin.ids-mannheim.de/digibibsru?operation=searchRetrieve&version=1.2&query=%22Pudels%20Kern%22
>
> http://clarin.ids-mannheim.de/digibibsru?operation=searchRetrieve&version=1.2&query=words%20any%20Golem%20%20Thora
>
> http://clarin.ids-mannheim.de/digibibsru?operation=searchRetrieve&version=1.2&query=Nijmegen
> ... 1 hit ;)
>
> http://clarin.ids-mannheim.de/digibibsru?operation=searchRetrieve&version=1.2&query=Mannheim
> ... 254 hits :P
>
> Have fun,
>  Oliver
>
> [1] http://www.textgrid.de/digitale-bibliothek.html, licensed CC-BY
> [2] TextGrid probably issues a new version with PIDs to the TextGridRep.
>     Then this issue can be fixed.
> --
> Oliver Schonefeld
> Institut für Deutsche Sprache, Zentrale Forschung
> R5, 6-13, D-68161 Mannheim
> +49-(0)621-1581-451 | http://www.ids-mannheim.de
> _______________________________________________
> Dev mailing list
> Dev at lists.clarin.eu
> https://lists.clarin.eu/cgi-bin/mailman/listinfo/dev
>



-- 
Dr. Thomas Zastrow
Seminar fuer Sprachwissenschaft
Universitaet Tuebingen

Wilhelmstr. 19
D-72074 Tuebingen

http://www.thomas-zastrow.de

Tel.: 07071/29-73968
Fax: 07071/29-5214



More information about the Dev mailing list