[Tf-curation] PID for sub-corpus

Pavel Stranak stranak at ufal.mff.cuni.cz
Mon Feb 22 12:58:34 CET 2016

Hi Krister (hi all),

> On 22. 2. 2016, at 10:54, Krister Lindén <krister.linden at helsinki.fi> wrote:
> This solves the metadata problem but now we have a technical problem: what is the PID of the virtual corpus? Arguably, we could create one PID for each set of search conditions as they will produce the same virtual corpus view of the underlying real corpus each time. Is there a better solution? What do your PIDs for virtual corpus collections and federated search results point to?

I would definitely only asign a PID to data that others can retrieve and get exactly the same thing (i.e. not even shuffled differently). Otherwise it makes no sense for replicability, citation, or any other purposes of actually using PIDs that come to my mind.

Persistent ID of a query might be a different thing, though. We do support creating PIDs for queries in various tools. There I would assume the underlying dataset and the meaning of the query to be unchanging, but the query results might not be always exactly the same (e.g. database returning results in random order). This seems OK to me, but only if it is clear that it is only a PID of a query, not a dataset per se. We use shortref.org for that.


More information about the Tf-curation mailing list