home
about
services
contact

Where am I?

Deduplicating Search Results

June 08, 2014 at 12:34 PM | categories: XQuery, MarkLogic | View Comments

So you're writing a MarkLogic application, and this question comes up: How do we deduplicate search results?

In one sense MarkLogic will never return duplicates from a database lookup. A single XPath, cts:search, or search:search expression will always return unique nodes, as defined by the XPath is operator.

But your application might have its own, content-dependent definition of a duplicate. This might depend on just a subset of the XML content. For example you might be storing news articles pulled from different media sources: newspapers, magazines, blogs, etc. Often the same news story will appear in different sources, and sometimes the text will be identical or extremely close. When a user searches for the hot story of the day you want to have all the variations available, but the search results should roll them up together on the page. You can see something like this if you search Google News.

One good strategy is to avoid duplicates entirely, by ensuring that your documents have meaningful URIs. Construct the URI using the same information that determines whether or not a document is a duplicate. This way if content arrives that duplicates existing content, it turns out to have the same URI. Then you are free to update the database with the latest copy, ignore it, throw an error, or call for help. If every news story has a dateline and an author byline, we could construct document URIs based on the date, location, and byline: something like /news/2014/05/30/washington/jones.

But maybe that isn't a very good solution for our news application. Remember that we want to search for articles, but we only want one article per story. So we have to store all the duplicate articles, and we need query-time flexibility to display just one article per story.

Clearly we will need to generate a story-id for each story, one that remains the same no matter how different articles are presented. That might use a mechanism similar to the URI computation above, except that we would put the result in an element and it would not be unique. We could use the same facts we were going to use in the document URI:

<story-id>2014-05-30|washington|jones</story-id>

Once we have our application set up to generate story-id elements, we could try a brute-force approach. Search the database, then walk through the search results. Extract each story-id value and check it against a list of previously-seen story-id values. We could use a map for that. If the story-id has already been seen, ignore it. Otherwise put the story-id in the map and return the article.

(
  let $search-results := search:search(...)
  let $seen := map:map()
  for $article in $search-results
  let $story-id as xs:string := $article/story-id
  where not(map:contains($seen, $story-id))
  return (
    map:put($seen, $story-id, $story-id),
    $article))[$start to $stop]

But there are problems with this approach. Pagination is tricky because we don't know how many duplicates there will be. So we have to ask the database for a lot of results, maybe all of them at once, and then filter and paginate in user code. This gets more and more expensive as the result size increases, and trickier to manage as the user paginates through the results. If a search matches a million articles, we might have to retrieve and check all the matches before we can display any results. That's going to be slow, and probably limited by I/O speeds. Nowadays we could throw SSD at it, but even SSD has limits.

Another problem with the brute-force approach is that facets generated by the database will not match the deduplicated results. You might have a facet on author that shows 1000, but deduplicated filters out all but 100 of the articles.

So let's look at another approach. Instead of deduplicating after we search, let's deduplicate before we search. That might sound crazy, but we have a couple of powerful tools that make it possible: cts:value-co-occurrences and cts:document-query. The idea is to deduplicate based on the co-occurrence of story-id and document URI, without retrieving any documents. Then we query the database again, this time fetching only the non-duplicate documents that we want to return.

Each article is stored as a document with a unique document URI. We enable the document URI lexicon and we also create an element-range index on the element named story-id. As described above, we construct a story-id for every article as it arrives and add it to the XML. This story-id is our deduplication key: it uniquely identifies a story, and if multiple articles might have the same story-id value then they are treated as duplicates.

A deduplication key is application-specific, and might be anything. An application might even have multiple deduplication keys for different query types. However it's essential to have a deduplication key for every document that you want to query, even if only some documents will have duplicates. The technique we're going to use will only return documents that have a deduplication key. An article with no story-id simply won't show up in the co-occurrence results, so it won't show up in search results either.

Here's some code to illustrate the idea. Start with $query-original, which is the original user query as a cts:query item. We might generate that using search:parse or perhaps the xqysp library.

(: For each unique story-id there may be multiple article URIs.
 : This implementation always uses the first one.
 :)
let $query-dedup := cts:document-query(
  let $m := cts:values-co-occurrences(
    cts:element-reference(
      xs:QName('story-id'),
      'collation=http://marklogic.com/collation/codepoint'),
    cts:uri-reference(),
    'map')
  for $key in map:keys($m)
  return map:get($m, $key)[1])
(: The document-query alone would match the right articles,
 : but there would be no relevance ranking.
 : Using both queries eliminates duplicates and preserves ranking.
 :)
let $query-full := cts:and-query(($query-original, $query-dedup))
...

Now we can use $query-full with any API that uses cts:query items, such as cts:search. In order to match, an article will have to match $query-original and it will have to have one of the URIs that we selected from the co-occurrence map.

Instead of calling cts:search directly, we might want to use search:resolve. That function expects a cts:query XML element, not a cts:query item. So we need a little extra code to turn the cts:query item into an XML document and then extract its root element:

...
return search:resolve(
  document { $query-full }/*,
  $search-options,
  $pagination-start,
  $pagination-size)

Many search applications also provide facets. You can ask search:resolve for facets by providing the right search options, or you can call cts:values yourself. Note that since facets are not relevance-ranked, it might be a little faster to use $query-dedup instead of $query-full.

Speaking of performance, how fast is this? In my testing it added an O(n) component, linear with the number of keys in the cts:values-co-occurrences map. With a small map the overhead is low, and deduplicating 10,000 items only adds a few tens of milliseconds. But with hundreds of thousands of map items the profiler shows more and more time spent in the XQuery FLWOR expression that extracts the first document URI from each map item.

  let $m := cts:values-co-occurrences(
    cts:element-reference(
      xs:QName('story-id'),
      'collation=http://marklogic.com/collation/codepoint'),
    cts:uri-reference(),
    'map')
  for $key in map:keys($m)
  return map:get($m, $key)[1])

We can speed that up a little bit by trading the FLWOR for function mapping.

declare function local:get-first(
  $m as map:map,
  $key as xs:string)
as xs:string
{
  map:get($m, $key)[1]
};

let $m := cts:values-co-occurrences(
  cts:element-reference(
    xs:QName('story-id'),
    'collation=http://marklogic.com/collation/codepoint'),
  cts:uri-reference(),
  'map')
return local:get-first($m, map:keys($m))

However this is a minor optimization, and with large maps it will still be expensive to extract the non-duplicate URIs. It's both faster and more robust than the brute-force approach, but not as fast as native search.

Pragmatically, I would try to handle these performance characteristics in the application. Turn deduplication off by default, and only enable it as an option when a search returns fewer than 100,000 results. This would control the performance impact of the feature, providing its benefits without compromising overall performance.

It's also tempting to think about product enhancements. We could avoid some of this work if we could find a way to retrieve only the part of the map needed for the current search page, but this is not feasible with the current implementation of cts:values-co-occurrences. That function would have to return the co-occurrence map sorted by the score of each story-id. That's tricky because normally scores are calculated for documents, in this case articles.

One way to speed this up without changing MarkLogic Server could be to move some of the work into the forests. MarkLogic Server supports User-Defined Functions, which are C++ functions that run directly on range indexes. I haven't tried this approach myself, but in theory you could write a UDF that would deduplicate based on the story-id and URI co-occurrence. Then you could call this function with cts:aggregate. This would work best if you could partition your forests using the story-id value, so that any the duplicate values articles are guaranteed to be in the same forest. Used carefully this approach could be much faster, possibly allowing fast deduplication with millions of URIs.

For more on that idea, see the documentation for Tiered Storage and the UDF plugin tutorial. If you try it, please let me know how it works out.

Read and Post Comments
 

Introduction to Multi-Statement Transactions

June 21, 2013 at 12:34 PM | categories: XQuery, MarkLogic | View Comments

If you are an old hand with MarkLogic, you are used to writing update queries with implicit commits. Sometimes this means restructuring your code so that everything can happen in one commit, with no conflicting updates. In extreme cases you might even decide to run multiple transactions from one query, using xdmp:invoke or semicolons. Historically this meant giving up atomicity.

Multi-statement transactions, introduced in MarkLogic 6, promise a third way. We can write a transaction that spans multiple statements, with an explicit commit or rollback.

For most updates it's probably best to stick with the old ways and use implicit commits. But let's look at a concrete example of a time when multi-statement transactions are the right tool for the job.

Suppose you are using DLS (Document Library Services) to manage your document versioning. But you have a special case where you want to insert two discrete versions of the same document atomically. That may sound odd, but I ran into that exact problem recently.

First we need to discover that there is a problem. Let's bootstrap the a test document with DLS.

import module namespace dls="http://marklogic.com/xdmp/dls"
  at "/MarkLogic/dls.xqy";
try {
  dls:document-delete('test', false(), false()) }
catch ($ex) {
  if ($ex/error:code ne 'DLS-UNMANAGED') then xdmp:rethrow()
  else if (empty(doc('test'))) then ()
  else xdmp:document-delete('test') }
;
import module namespace dls="http://marklogic.com/xdmp/dls"
  at "/MarkLogic/dls.xqy";
dls:document-insert-and-manage('test', false(), <x id="x1"/>)

Now let's write some XQuery to insert two versions in one update, and see what happens.

import module namespace dls="http://marklogic.com/xdmp/dls"
  at "/MarkLogic/dls.xqy";
dls:document-checkout-update-checkin(
  'test', <x id="x2"/>, "version two", true()),
dls:document-checkout-update-checkin(
  'test', <x id="x3"/>, "version three", true())

This throws an XDMP-CONFLICTINGUPDATES error, because these calls to DLS end up trying to update the same nodes twice in the same transaction. In implicit commit mode, aka "auto" mode, this is difficult to avoid. We could ask MarkLogic to extend DLS with a new function designed for this situation. But that is a long-term solution, and we need to move on with this implementation.

So what can we do? We might read up on xdmp:invoke, xdmp:eval, etc. If we are careful, we can write a top-level read-only query that invokes one or more update transactions.

(: Entry point - must be a read-only query. :)
xdmp:invoke(
  'update.xqy',
  (xs:QName('URI'), 'test',
   xs:QName('NEW'), <x id="x2"/>,
   xs:QName('NOTE'), "version two")),
xdmp:invoke(
  'update.xqy',
  (xs:QName('URI'), 'test',
   xs:QName('NEW'), <x id="x3"/>,
   xs:QName('NOTE'), "version three"))

This invokes a module called update.xqy, which would look like this:

(: update.xqy :)
import module namespace dls="http://marklogic.com/xdmp/dls"
  at "/MarkLogic/dls.xqy";

declare variable $NEW as node() external ;
declare variable $NOTE as xs:string external ;
declare variable $URI as xs:string external ;

dls:document-checkout-update-checkin(
  $URI, $NEW, $NOTE, true())

This works - at least, it doesn't throw XDMP-CONFLICTINGUPDATES. But we have lost atomicity. Each of the two updates runs as a different transaction. This opens up a potential race condition, where a second query updates the document in between our two transactions. That could break our application.

There are ways around this, but they get complicated quickly. They are also difficult to test, so we can never be confident that we have plugged all the potential holes in our process. It would be much more convenient if we could run multiple statements inside one transaction, with each statement able to see the database state of the previous statements.

We can do exactly that using a multi-statement transaction. Let's get our feet wet by looking at a very simple MST.

declare option xdmp:transaction-mode "update";

xdmp:document-insert('temp', <one/>)
;

xdmp:document-insert('temp', <two/>),
xdmp:commit()

There are three important points to this query. 1. The option xdmp:transaction-mode="update" begins a multi-statment transaction. 1. The semicolon after the first xdmp:document-insert ends that statement and begins another. 1. The xdmp:commit ends the multi-statement transaction by commiting all updates to the database.

This runs without error, and we can verify that doc('temp') contains <two/> after it runs. But how can we prove that all this takes place in a single transaction? Let's decorate the query with a few more function calls.

declare option xdmp:transaction-mode "update";

xdmp:get-transaction-mode(),
xdmp:transaction(),
doc('temp')/*,
xdmp:document-insert('temp', <one/>)
;

xdmp:get-transaction-mode(),
xdmp:transaction(),
doc('temp')/*,
xdmp:document-insert('temp', <two/>),
xdmp:commit()

This time we return some extra information within each statement: the transaction mode, the transaction id, and the contents of the test doc. The transaction ids will be different every time, but here is one example.

update
17378667561611037626
<two/>
update
17378667561611037626
<one/>

So the document test started out with the old node <two/>, but after the first statement it changed to <one/>. Both statements see the same transaction mode and id.

Try changing the xdmp:transaction-mode declaration to auto, the default. You should see the mode change to auto, and two different transaction-ids. This tells us that in update mode we have a multi-statement transaction, and in auto mode we have a non-atomic sequence of two different transactions. Before MarkLogic 6, all update statements ran in auto mode.

Now let's apply what we've learned about MST to the original problem: inserting two different versions of a managed document in a single transaction.

import module namespace dls="http://marklogic.com/xdmp/dls"
  at "/MarkLogic/dls.xqy";

declare option xdmp:transaction-mode "update";

dls:document-checkout-update-checkin(
  'test', <x id="x2"/>, "version two", true())
;

import module namespace dls="http://marklogic.com/xdmp/dls"
  at "/MarkLogic/dls.xqy";
dls:document-checkout-update-checkin(
  'test', <x id="x3"/>, "version three", true()),
xdmp:commit()

As above, this code uses three important features: 1. Set xdmp:transaction-mode="update" to begin the MST. 1. Use semicolons to end one statement and begin another. 1. Use xdmp:commit to end the MST and commit all updates.

To abort a multi-statement transaction, use xdmp:rollback.

So now you have a new tool for situations where implicit commit is a little too awkward. Try not to overdo it, though. In most situations, the default xdmp:transaction-mode="auto" is still the best path.

Read and Post Comments
 

External Variables (Code Review, Part II)

September 28, 2012 at 12:34 PM | categories: XQuery, MarkLogic | View Comments

Remember when I talked about XQuery Code Review? The other day I was forwarding that link to a client, and noticed that I forgot to mention external variables. I talked about xdmp:eval and xdmp:value in the section titled Look_for_injection_paths, and mentioned that it's usually better to use xdmp:invoke or xdmp:unpath, which are less vulnerable to injection attacks.

But it can be convenient or even necessary to evaluate dynamic XQuery. That's what xdmp:eval and xdmp:value are there for, after all. I've even written tools like Presta to help you.

Used properly, dynamic queries can be made safe. The trick is to never let user data directly into your dynamic queries. Whenever you see xdmp:eval or xdmp:value in XQuery, ask yourself "Where did this query comes from?" If any part of it came from user input, flag it for a rewrite.

(: WRONG - This code is vulnerable to an injection attack! :)
xdmp:eval(
  concat('doc("', xdmp:get-request-field('uri'), '")'))

Actually there are at least two bugs in this code. There is a functional problem: what happens if the uri request field is fubar-"baz"? You might not expect a uri to include a quote, and maybe that will never legitimately happen in your application. But if that request-field does arrive, xdmp:value will throw an error:

XDMP-UNEXPECTED: (err:XPST0003) Unexpected token syntax error

That's because you haven't properly escaped the uri in the dynamic XQuery. And you could escape it. You could even write a function to do that for you. But if you miss any of the various characters that need escaping, XDMP-UNEXPECTED will be there, waiting for you.

So far we've only talked about innocent mistakes. But what if someone out there is actively hostile? Let's say it's me. If I know that your web service expects a uri request-field, I might guess that your code looks something like the code above, and try an injection attack.

After a little trial and error, I might find that sending uri=x"),cts:uris(),(" returns a list of all the documents in your database, whether you want me to see them or not. Then I can send something like uri=x"),xdmp:document-delete("fubar. If that document exists, and security isn't tight... it's gone. Or maybe I will decide to try xdmp:forest-clear instead.

In SQL we use bind variables to solve both of these problems. Any user input binds to a variable inside the SQL, and the database driver takes care of escaping for us. We no longer have to worry about obscure syntax errors or injection attacks, as long as we remember to use variable for all externally-supplied parameters. In XQuery these are known as external variables.

(: Always use external variables for user-supplied data. :)
xdmp:eval(
  'declare variable $URI as xs:string external ;
   doc($URI)',
  (xs:QName('URI'), xdmp:get-request-field('uri')))

The syntax is a little odd: that second parameter is a sequence of alternating QName and value. Because XQuery doesn't support nested sequences, this means you can't naively bind a sequence to a value. Instead you can pass in XML or a map, or use a convention like comma-separated values (CSV).

(: Using XML to bind a sequence to an external variable. :)
xdmp:eval(
  'declare variable $URI-LIST as element(uri-list) external ;
   doc($URI-LIST/uri)',
  (xs:QName('URI-LIST'),
   element uri-list {
     for $uri in xdmp:get-request-field('uri')
     return element uri { $uri } }))

Even though these examples all use pure XQuery, this code review principle also applies to XCC code. If you see a Java or .NET program using AdHocQuery, check to make sure that all user input binds to variables.

Remember, the best time to fix a potential security problem is before the code goes live.

Read and Post Comments
 

Let-free Style and Streaming

March 19, 2012 at 12:34 PM | categories: XQuery, MarkLogic | View Comments

If you are familiar with Lisp or Scheme, you know that a function call can replace a variable binding, and function calls can also replace most loops. This is also true in XQuery.

In XQuery this leads to a style of coding that I call "let-free". In this style, there are no FLWOR expressions. Really this is "FLWOR-free", not "let-free", but that's too much of a mouthful for me.

But why would you write let-free code?

The answer is scalability - you knew it would be, right? This breaks out into concurrency and streaming. Let's talk about concurrency first. In the MarkLogic Server implementation of XQuery, every let is evaluated in sequence. However, other expressions are evaluated lazily with concurrency-friendly "future values". So a performance-critical single-threaded request can sometimes benefit from let-free style. You can see this technique in use in some of my code: the semantic library or the task-server forest rebalancer. Both of these projects try to benefit from multi-core CPUs.

The let-free style can also help with query scalability by allowing the results to stream, rather than buffering the entire result sequence. If you need to export large result sets, for example, this technique can help avoid XDMP-EXPNTREECACHEFULL errors. Those errors result when your query's working set is too large to fit in the expanded tree cache, a sort of scratch space for XML trees. But streaming results don't have to fit into the cache.

For example, let's suppose you need to list every document URI in the database. But you do not have the URI lexicon enabled, and you cannot reindex to create it.

Note that nested evaluations cannot stream, either. So even a let-free query may throw XDMP-EXPNTREECACHEFULL in cq or another development tool. To test this query, use an http module instead. This is ideal for web service implementations too.

In this example we used function mapping, a MarkLogic extension to XQuery 1.0. If a function takes a single argument but is called using a sequence, the evaluator simply maps the sequence to multiple function calls. This is somewhat faster than a FLWOR, and it can stream.

Besides using function mapping, let-free style can use XPath steps. However, this technique only works for sequences of nodes.

While these techniques are useful, they can make for code that is hard to read and tricky to debug. Function mapping is especially prone to errors that are difficult to diagnose. If a function signature specifies an argument without a quantifier or with the + quantifier, and the runtime argument is empty, the function will not be called at all. This is surprising, since normally the function would be called and would cause a strong typing error.

The first expression returns the empty sequence, while the second throws the expected strong typing error XDMP-AS. This behavior is annoying, but in some applications the benefits of function mapping outweigh this drawback. We can make debugging easier if we weaken the function signature to document-node()? so that the function will be called even when the argument is empty. If needed, we can include an explicit check for empty input too.

Another let-free trick is to use module variables. These act much like let bindings, but they can stream.

This example is a bit contrived, since the module variable doesn't add anything. But if you find yourself struggling to refactor a let as a function call or an XPath step, consider using a module variable. Module variables are also excellent tools for avoiding repeated work, since the right-hand expression is evaluated lazily and is never evaluated more than once. If the evaluation does not use the module variable, then the right-hand expression is never evaluated. In contrast, the right-expression of a let is evaluated even when the return does not use its value.

As always, do not optimize code unless there is a problem to solve. There are also some situations where the let-free style isn't appropriate. Aside from making your code harder to read and more difficult to debug, let-free style simply doesn't work in situations where your FLWOR would have an order by clause. And after all, streaming won't work for that case anyway. The evaluator can't sort the result set without buffering it first.

Read and Post Comments
 

Conditional Profiling for MarkLogic

December 14, 2011 at 03:16 PM | categories: XQuery, MarkLogic | View Comments

Today I pushed cprof to GitHub. This XQuery library helps application developers who need to retrofit existing applications with profiling capabilities. Just replace all your existing calls to xdmp:eval, xdmp:invoke, xdmp:value, xdmp:xslt-eval, and xdmp:xslt-eval with corresponding cprof: calls. Add a little logic around cprof:enable and cprof:report, and you are done.

Read and Post Comments