PAN Search API

Overview

The PAN search API is a RESTful API to search and access the ClueWeb09 collection with the ChatNoir and Indri search engines.

Available Resources and Resource Methods

Authorization is required to access any resources.

The API allows to submit a search query and to get back JSON formatted matching ClueWeb09 documents. The query is provided using the request body. A request body must be in JSON format. The content type of the response body is application/json

These resources and methods on them are supported:

/proxy/chatnoir/batchquery

POST - Retrieve results for multiple ChatNoir queries.

/proxy/indri/batchquery
POST - Retrieve results for multiple Indri queries.
/proxy/clueweb/id

GET - Access a ClueWeb09 document.

We also provide Snippets through the regular ChatNoir API.

To start right away look at the examples.

Authorization

To access any resource you have to authorize yourself with your PAN access token.

For search requests you can either include the token in your JSON query or set an Authorization: <token> header. Note that the PAN API does not use basic HTTP auth. You have to specify our Authorization header manually. See the usage examples for Indri and ChatNoir for details.

ChatNoir

ChatNoir is a search engine for the ClueWeb09 collection. ChatNoir follows a different philosophy than Indri and should be faster for most queries. Implementation details can be found here. Also take a look at the API documentation and the ranking model

Request Syntax Elements

Element Required Default Definition
token - PAN access token. See authorization.
suspicious-docid - The suspicious document.
query-string - Query String.
max-results - 100 How many results to retrieve. max-results is only required once in a batch query. Up to 1000 results are supported.
queries - List of chatnoir-query items for a batch query. Up to 50 queries are allowed.
a - 1 BM25F factor. The factor is multiplied with the BM25F weight for a result.
b - 0.1 Proximity factor. The factor is multiplied with the proximity score of a result.
c - 1 Pagerank factor. The factor is multiplied with the pagerank of a document.
readlevel - - Filter by readability level. Valid levels: BASIC, INTERMEDIATE, EXPERT (only results with the specified level will be returned).
textlength - - Minimum length (in words) of a document to be contained in the results.

ChatNoir Response Syntax

A query "franz liszt" with the reading level filter set to EXPERT:

{
  "token":"7eb96d7390b5f76d6fc4ffb175eaedac",
  "suspicious-docid": 123,
  "max-results": 2,
  "queries": [
    {
      "readlevel": "EXPERT",
      "query-string": "franz liszt"
    }
  ]
}

...returns these JSON results:

{
  "chatnoir-batch-results": [
    {
      "results": 2,
      "result-data": [
        {
          "syllables": 679,
          "characters": 2266,
          "words": 429,
          "sentences": 25,
          "readability": 9.778857,
          "rank": 1,
          "longid": 100006720158,
          "weight": 6.1409025,
          "proximity": 0,
          "url": "http://webis15.medien.uni-weimar.de/proxy/clueweb/id/100006720158",
          "title": "Franz Liszt Academy of Music - Wikipedia, the free encyclopedia",
          "pagerank": 0.15,
          "bm25f": 6.1409025
        },
        {
          "syllables": 582,
          "characters": 2018,
          "words": 350,
          "sentences": 27,
          "readability": 9.08727,
          "rank": 2,
          "longid": 100010600053,
          "weight": 6.114362,
          "proximity": 0,
          "url": "http://webis15.medien.uni-weimar.de/proxy/clueweb/id/100010600053",
          "title": "International Franz Liszt Piano Competition - Wikipedia, the free encyclopedia",
          "pagerank": 0.15,
          "bm25f": 6.114362
        }
      ],
      "chatnoir-query": {
        "query-string": "franz liszt",
        "max-results": 2,
        "readlevel": "EXPERT",
        "id": 0
      },
      "id": 0,
      "timestamp": 1359035294037,
      "runtime": 93
    }
  ],
  "runtime": 93
}

Response Syntax Elements

Element Definition
chatnoir-query Copy of the request extended by an id field. Useful for mapping a request to a response.
result-data JSON array filled with result elements.
runtime Query runtime in milliseconds.
timestamp POSIX time timestamp.
result Single search result.

ChatNoir Single Result Syntax Elements

Element Definition
docid ClueWeb09 document id.
longid Integer representation of the document id.
rank Ranking position.
url URL to access the document with the ClueWeb09 document access API.
title Contents of <title> tag from the document.
weight Assigned weight for the result from the Indri search engine.
pagerank Pagerank score.
spamrank Spamrank score.
bm25f BM25F factor.
proximity Proximity factor.
readability Readability score.
characters Character count.
syllables Syllable count.
sentences Sentence count.
words Word count.

Usage

How to retrieve results from ChatNoir for a given query?

A POST request creates a new search. The resquest body is JSON object. A response returns a 200 OK HTTP-Status and a response body with JSON formatted output. If the response status code is 500 Internal Server Error or 400 Bad Request an error occured. In this cases the error message is contained in the response body.

Retrieve Results

Here is an example using curl to retrieve search results in JSON:

    $ curl -XPOST \
     -H'Content-Type: application/json' \
     -H'Authorization: 7eb96d7390b5f76d6fc4ffb175eaedac' \
     -d'{
       "max-results": 5,
       "suspicious-docid":123,
       "queries": [
         {
           "query-string": "franz liszt"
         },
         {
           "query-string": "pumpkin pie"
         }
      ]}' http://webis15.medien.uni-weimar.de/proxy/chatnoir/batchquery.json

Another example using Python and urlib2 for retrieving results:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#!/usr/bin/env python

import urllib2, urllib, sys

url = 'http://webis15.medien.uni-weimar.de/proxy/chatnoir/batchquery.json'

query = """
{
       "max-results": 5,
       "suspicious-docid":123,
       "queries": [
         {
           "query-string": "franz liszt"
         },
         {
           "query-string": "pumpkin pie"
         }
      ]
}
"""
    
request = urllib2.Request(url, query )
request.add_header("Content-Type","application/json")
request.add_header("Accept","application/json")
request.add_header("Authorization","7eb96d7390b5f76d6fc4ffb175eaedac")
request.get_method = lambda: 'POST'

try:
   response = urllib2.urlopen(request)
   print response.read()
except urllib2.HTTPError as e:
   error_message = e.read()
   print >> sys.stderr, error_message

The request body does not need to be url encoded. The results contain no whitespace at the moment. If you need pretty printed results with whitespace you can use tools like jq or python -m json.tool for formatting the output.

Indri

Indri is a search engine for the ClueWeb09 from the Lemur project. Indri has its own query language. Detailled documentation and a quick reference are available.

The PAN API proxy submits all incoming search requests to the Indri batch service, retrieves the results and returns a JSON representation of the results. You can use all Indri Query Language features, because the unmodified query string is send to Indri. Additional data in the response from Indri is captured and send back along with the search results in the indri-output field. Indri error messages can be found there. An example of such additional data are Indri error messages that are useful to recognize that Indri did not accept the query. The query number in the Indri output is equal to the id field in the PAN API's response.

Indri Request Syntax

An example request body for a batch query represented in JSON:

{
  "token":"7eb96d7390b5f76d6fc4ffb175eaedac",
  "suspicious-docid":123,
  "max-results": 5,
  "queries": [
    {
      "query-string": "#combine(first query)"
    },
    {
      "query-string": "#combine(second query)"
    }
  ]
}

Request Syntax Elements

Element Required Default Definition
token - PAN access token. See authorization.
suspicious-docid - The suspicious document.
max-results - 100 How many results to retrieve. max-results is only required once in a batch query. Up to 1000 results can be retrieved.
queries - List of indri-query items for a batch query. A batch query can contain up to 50 queries.
query-string - The Indri query string. Quick Reference.

Indri Response Syntax

An example for an Indri response body in JSON for a batch query:

{
  "indri-output": "Processing 2 queries. This may take a few minutes.
  process running, please wait  Processing is complete!
   Results: ClueWeb09 Category A - query file:     ",
  "indri-batch-results": [
    {
      "results": 2,
      "result-data": [
        {
          "docid": "clueweb09-en0008-27-34978",
          "longid": 82734978,
          "url": "http://webis15.medien.uni-weimar.de/proxy/clueweb/id/82734978",
          "weight": -14.8703,
          "rank": 1
        },
        {
          "docid": "clueweb09-en0004-62-07486",
          "longid": 46207486,
          "url": "http://webis15.medien.uni-weimar.de/proxy/clueweb/id/46207486",
          "weight": -14.8735,
          "rank": 2
        }
      ],
      "indri-query": {
        "query-string": "#combine(first query)",
        "max-results": 2,
        "id": 0
      },
      "id": 0,
      "timestamp": 1358983472708
    },
    {
      "results": 2,
      "result-data": [
        {
          "docid": "clueweb09-en0047-02-05830",
          "longid": 470205830,
          "url": "http://webis15.medien.uni-weimar.de/proxy/clueweb/id/470205830",
          "weight": -3.32924,
          "rank": 1
        },
        {
          "docid": "clueweb09-en0075-82-14614",
          "longid": 758214614,
          "url": "http://webis15.medien.uni-weimar.de/proxy/clueweb/id/758214614",
          "weight": -3.69035,
          "rank": 2
        }
      ],
      "indri-query": {
        "query-string": "#combine(second query)",
        "max-results": 2,
        "id": 1
      },
      "id": 1,
      "timestamp": 1358983472708
    }
  ],
  "runtime": 17645
}

Response Syntax Elements

Element Definition
indri-output Captured text of all non-result related Indri output. Useful for debugging failed queries.
indri-query Copy of the request extended by an id field. Useful for mapping a request to a response.
indri-batch-results JSON array filled with batch query results.
result-data JSON array filled with results for a single query.
runtime Query runtime in milliseconds.
timestamp POSIX time timestamp.

Single Result Item Elements

Element Definition
docid ClueWeb09 document id.
longid Integer representation of the document id
rank Ranking position
url URL to access the document with the ClueWeb09 document access API.
weight Assigned weight for the result from the Indri search engine.

Usage

How to retrieve results from Indri for a given query?

A POST request creates a new search. The resquest body is JSON object. A response returns a 200 OK HTTP-Status and a response body with JSON formatted output. Errors are indicated a 500 Internal Server Error or 400 Bad Request error code. In this case the error message is send as the response body.

Retrieve Results

Here is an example using curl to retrieve search results in JSON:

    $ curl -XPOST \
     -H'Content-Type: application/json' \
     -H'Authorization: 7eb96d7390b5f76d6fc4ffb175eaedac' \
     -d'{
       "max-results": 5,
       "suspicious-docid": 123,
       "queries": [
         {
           "query-string": "#combine(first query)"
         },
         {
           "query-string": "#combine(second query)"
         }
      ]}' http://webis15.medien.uni-weimar.de/proxy/indri/batchquery.json

Another example using Python and urlib2 for retrieving results:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#!/usr/bin/env python

import urllib2, urllib, sys

url = 'http://webis15.medien.uni-weimar.de/proxy/indri/batchquery.json'

query = """
{
       "max-results": 5,
       "suspicious-docid": 123,
       "queries": [
         {
           "query-string": "#combine(first query)"
         },
         {
           "query-string": "#combine(second query)"
         }
      ]
}
"""
    
request = urllib2.Request(url, query )
request.add_header("Content-Type","application/json")
request.add_header("Authorization","7eb96d7390b5f76d6fc4ffb175eaedac")
request.get_method = lambda: 'POST'

try:
   response = urllib2.urlopen(request)
   print response.read()
except urllib2.HTTPError as e:
   error_message = e.read()
   print >> sys.stderr, error_message

The request body does not need to be url encoded. The results contain no whitespace at the moment. If you need pretty printed results with whitespace you can use tools like jq or python -m json.tool for formatting the output.

ClueWeb Document Access

ClueWeb Documents are available through the resource /proxy/clueweb/id and can be retrieved by an authorized GET request by appending the longid for a document to the URL.

The response body is a JSON object that contains the document in plain text.

Example

Retrieve a document in plain text with the longid 100016506815 for the suspicious document 123 by using curl:

    $ curl -XGET \
     -H'Authorization: 7eb96d7390b5f76d6fc4ffb175eaedac' \
     -H'suspicious-docid: 123' \
     http://webis15.medien.uni-weimar.de/proxy/clueweb/id/100016506815

Or by using Python:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#!/usr/bin/env python

import urllib2, urllib, sys

clueweb_url = 'http://webis15.medien.uni-weimar.de/proxy/clueweb/id'

id=100016506815

request = urllib2.Request(clueweb_url+str(id))
request.add_header("Accept","application/json")
request.add_header("Authorization","7eb96d7390b5f76d6fc4ffb175eaedac")
request.add_header("suspicious-docid", "123")
request.get_method = lambda: 'GET'

try:
   response = urllib2.urlopen(request)
   print response.read()
except urllib2.HTTPError as e:
   error_message = e.read()
   print >> sys.stderr, error_message

The JSON formatted response:

{
  "suspicious-docid": 123,
  "html": "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3...",
  "text": "Liszt School of Music Weimar - Wikipedia, the free ...",
  "token": "7eb96d7390b5f76d6fc4ffb175eaedac",
  "oracle" : "no source",
  "similarity" : "0.0",
  "plagiarized-text" : null
}

Snippets

Snippets can be obtained for every ClueWeb09 document if at least one query term is given. Just use the regular ChatNoir Snippet Functionality.

The URL for a snippet request is a little bit different:

http://webis15.medien.uni-weimar.de/chatnoir/snippet

The following parameters are supported:

Parameter Value Default Required Description
id number - Yes The LongID of the WARC-Record
query string - Yes The snippet will be generated around the terms of the query.
length number 500 No Length of the snippet. Up to 500 Characters.

Authorisation headers or a suspicious-docid are not required.

Example

To request a snippet for the document 100016506815 and the query terms "franz liszt" only a GET request is required:

$ curl "http://webis15.medien.uni-weimar.de/chatnoir/snippet?id=100016506815&query=franz+liszt"

The response is JSON formatted and query terms are at the moment highlighted using <strong> tags:

{
  "snippet": "<strong>Liszt</strong> School of Music Weimar\nFrom Wikipedia, the free encyclopedia\nThe
 <strong>Liszt</strong> School of Music Weimar (in German : Hochschule f  r Musik <strong>Franz</strong>
 <strong>Liszt</strong> Weimar) is a famous academy of music in Weimar , Germany .\nContents\n[ edit ]
 The Hochschule <strong>Franz</strong> <strong>Liszt</strong> , who spent a great deal of his life in
 Weimar, encouraged the founding of a school in 1835 for the education of musicians in orchestral instr",
  "length": 500,
  "query": "franz liszt",
  "id": "100016506815"
}