home
about
services
contact

Where am I?

Deduplicating Search Results

June 08, 2014 at 12:34 PM | categories: XQuery, MarkLogic | View Comments

So you're writing a MarkLogic application, and this question comes up: How do we deduplicate search results?

In one sense MarkLogic will never return duplicates from a database lookup. A single XPath, cts:search, or search:search expression will always return unique nodes, as defined by the XPath is operator.

But your application might have its own, content-dependent definition of a duplicate. This might depend on just a subset of the XML content. For example you might be storing news articles pulled from different media sources: newspapers, magazines, blogs, etc. Often the same news story will appear in different sources, and sometimes the text will be identical or extremely close. When a user searches for the hot story of the day you want to have all the variations available, but the search results should roll them up together on the page. You can see something like this if you search Google News.

One good strategy is to avoid duplicates entirely, by ensuring that your documents have meaningful URIs. Construct the URI using the same information that determines whether or not a document is a duplicate. This way if content arrives that duplicates existing content, it turns out to have the same URI. Then you are free to update the database with the latest copy, ignore it, throw an error, or call for help. If every news story has a dateline and an author byline, we could construct document URIs based on the date, location, and byline: something like /news/2014/05/30/washington/jones.

But maybe that isn't a very good solution for our news application. Remember that we want to search for articles, but we only want one article per story. So we have to store all the duplicate articles, and we need query-time flexibility to display just one article per story.

Clearly we will need to generate a story-id for each story, one that remains the same no matter how different articles are presented. That might use a mechanism similar to the URI computation above, except that we would put the result in an element and it would not be unique. We could use the same facts we were going to use in the document URI:

<story-id>2014-05-30|washington|jones</story-id>

Once we have our application set up to generate story-id elements, we could try a brute-force approach. Search the database, then walk through the search results. Extract each story-id value and check it against a list of previously-seen story-id values. We could use a map for that. If the story-id has already been seen, ignore it. Otherwise put the story-id in the map and return the article.

(
  let $search-results := search:search(...)
  let $seen := map:map()
  for $article in $search-results
  let $story-id as xs:string := $article/story-id
  where not(map:contains($seen, $story-id))
  return (
    map:put($seen, $story-id, $story-id),
    $article))[$start to $stop]

But there are problems with this approach. Pagination is tricky because we don't know how many duplicates there will be. So we have to ask the database for a lot of results, maybe all of them at once, and then filter and paginate in user code. This gets more and more expensive as the result size increases, and trickier to manage as the user paginates through the results. If a search matches a million articles, we might have to retrieve and check all the matches before we can display any results. That's going to be slow, and probably limited by I/O speeds. Nowadays we could throw SSD at it, but even SSD has limits.

Another problem with the brute-force approach is that facets generated by the database will not match the deduplicated results. You might have a facet on author that shows 1000, but deduplicated filters out all but 100 of the articles.

So let's look at another approach. Instead of deduplicating after we search, let's deduplicate before we search. That might sound crazy, but we have a couple of powerful tools that make it possible: cts:value-co-occurrences and cts:document-query. The idea is to deduplicate based on the co-occurrence of story-id and document URI, without retrieving any documents. Then we query the database again, this time fetching only the non-duplicate documents that we want to return.

Each article is stored as a document with a unique document URI. We enable the document URI lexicon and we also create an element-range index on the element named story-id. As described above, we construct a story-id for every article as it arrives and add it to the XML. This story-id is our deduplication key: it uniquely identifies a story, and if multiple articles might have the same story-id value then they are treated as duplicates.

A deduplication key is application-specific, and might be anything. An application might even have multiple deduplication keys for different query types. However it's essential to have a deduplication key for every document that you want to query, even if only some documents will have duplicates. The technique we're going to use will only return documents that have a deduplication key. An article with no story-id simply won't show up in the co-occurrence results, so it won't show up in search results either.

Here's some code to illustrate the idea. Start with $query-original, which is the original user query as a cts:query item. We might generate that using search:parse or perhaps the xqysp library.

(: For each unique story-id there may be multiple article URIs.
 : This implementation always uses the first one.
 :)
let $query-dedup := cts:document-query(
  let $m := cts:values-co-occurrences(
    cts:element-reference(
      xs:QName('story-id'),
      'collation=http://marklogic.com/collation/codepoint'),
    cts:uri-reference(),
    'map')
  for $key in map:keys($m)
  return map:get($m, $key)[1])
(: The document-query alone would match the right articles,
 : but there would be no relevance ranking.
 : Using both queries eliminates duplicates and preserves ranking.
 :)
let $query-full := cts:and-query(($query-original, $query-dedup))
...

Now we can use $query-full with any API that uses cts:query items, such as cts:search. In order to match, an article will have to match $query-original and it will have to have one of the URIs that we selected from the co-occurrence map.

Instead of calling cts:search directly, we might want to use search:resolve. That function expects a cts:query XML element, not a cts:query item. So we need a little extra code to turn the cts:query item into an XML document and then extract its root element:

...
return search:resolve(
  document { $query-full }/*,
  $search-options,
  $pagination-start,
  $pagination-size)

Many search applications also provide facets. You can ask search:resolve for facets by providing the right search options, or you can call cts:values yourself. Note that since facets are not relevance-ranked, it might be a little faster to use $query-dedup instead of $query-full.

Speaking of performance, how fast is this? In my testing it added an O(n) component, linear with the number of keys in the cts:values-co-occurrences map. With a small map the overhead is low, and deduplicating 10,000 items only adds a few tens of milliseconds. But with hundreds of thousands of map items the profiler shows more and more time spent in the XQuery FLWOR expression that extracts the first document URI from each map item.

  let $m := cts:values-co-occurrences(
    cts:element-reference(
      xs:QName('story-id'),
      'collation=http://marklogic.com/collation/codepoint'),
    cts:uri-reference(),
    'map')
  for $key in map:keys($m)
  return map:get($m, $key)[1])

We can speed that up a little bit by trading the FLWOR for function mapping.

declare function local:get-first(
  $m as map:map,
  $key as xs:string)
as xs:string
{
  map:get($m, $key)[1]
};

let $m := cts:values-co-occurrences(
  cts:element-reference(
    xs:QName('story-id'),
    'collation=http://marklogic.com/collation/codepoint'),
  cts:uri-reference(),
  'map')
return local:get-first($m, map:keys($m))

However this is a minor optimization, and with large maps it will still be expensive to extract the non-duplicate URIs. It's both faster and more robust than the brute-force approach, but not as fast as native search.

Pragmatically, I would try to handle these performance characteristics in the application. Turn deduplication off by default, and only enable it as an option when a search returns fewer than 100,000 results. This would control the performance impact of the feature, providing its benefits without compromising overall performance.

It's also tempting to think about product enhancements. We could avoid some of this work if we could find a way to retrieve only the part of the map needed for the current search page, but this is not feasible with the current implementation of cts:values-co-occurrences. That function would have to return the co-occurrence map sorted by the score of each story-id. That's tricky because normally scores are calculated for documents, in this case articles.

One way to speed this up without changing MarkLogic Server could be to move some of the work into the forests. MarkLogic Server supports User-Defined Functions, which are C++ functions that run directly on range indexes. I haven't tried this approach myself, but in theory you could write a UDF that would deduplicate based on the story-id and URI co-occurrence. Then you could call this function with cts:aggregate. This would work best if you could partition your forests using the story-id value, so that any the duplicate values articles are guaranteed to be in the same forest. Used carefully this approach could be much faster, possibly allowing fast deduplication with millions of URIs.

For more on that idea, see the documentation for Tiered Storage and the UDF plugin tutorial. If you try it, please let me know how it works out.

blog comments powered by Disqus