Yet another search parser - XQYSP
October 24, 2011 at 01:19 PM | categories: XQuery, MarkLogic | View CommentsXQYSP takes a slightly different approach than the Search API or the older lib-parser.xqy, both of which returned cts:query items. Instead, XQYSP returns an abstract syntax tree (AST) as XML. It is up to you, the caller, to transform that AST into a cts:query. That is a little more work for you, but adds a lot of flexibility at the same time. Most of the tasks that used to go into lib-parser-custom.xqy can now be implemented without changing the parser itself. To make it easier to get started, though, I have provided sample code to generate a query from an AST. I hope it is useful.
XQUT - Unit Testing in Pure XQuery
September 13, 2011 at 04:06 PM | categories: XQuery, MarkLogic | View CommentsI was working on a couple of pure XQuery projects that needed unit testing. While I could have integrated with JUnit or another existing framework, I really wanted something simple that I could run directly from cq. Hence XQUT.
XQUT will usually be invoked like this:
The cq app server should point to the code you are testing,
so that your test suite can import libraries. The eval root is
different: it is the location of the XQUT code, so that you only need
one copy of XQUT. The external variable SUITE
is an XML
test suite. A simple test suite might look like this:
The XML is fairly simple. Under the root suite
element we have one or more unit
elements, each
representing a test. The test XQuery can be defined as the lexical
value of the element, or as its expr
child. The result
can be defined by a result
attribute or element.
For more sophisticated tests, you can
add xsi:type
attributes and sequences
of result
elements. You can also use an
optional environment
element to import libraries, define
variables, and define namespace prefixes. If you
add setup
elements, these will be evaluated before any
tests. The test
suite for XQUT itself contains more examples.
Group By in XQuery 1.0 for MarkLogic Server
August 23, 2011 at 02:12 PM | categories: XQuery, MarkLogic | View CommentsXQuery 3.0 introduces new syntax for "group by". At this time, MarkLogic Server 4.2 is the latest release, and it doesn't have support for that syntax. So how can we implement "group by" when writing XQuery for MarkLogic?
Let's start with the W3C use cases. First, let's fetch the sample data and put it into MarkLogic. We can do that using cq. I'll leave out the schemas, since we don't need those. I also won't be exhaustive about optimizing every expression in these examples: suffice to say that there is room for even more improvement.
Sorry about the long block of code, but we need those documents. Paste that into cq, evaluate it, and you should get the empty sequence. That means your documents were inserted correctly: you can use the 'explore' link to check.
Now we can write some queries. Here is the first use case (Q1).
And the result should look like this:
We can't write XQuery 3.0 using XQuery 1.0 — but we can get the same result using an extra distinct-values step.
This code is a little awkward, though. Instead of looping through the records once, we have to perform a database lookup on each product name. Normally this would be an unavoidable cost, and perhaps a reason to look forward to XQuery 3.0. But MarkLogic gives us a way to cheat, and use an accumulator model to get the same result more quickly. I'm talking about maps.
This produces the same output, and will scale better the
distinct-values()
version would. Of course it is also less
portable. But database application developers often have to implement
non-portable optimizations, and the less portable code can be
segregated into its own library modules.
Now let's look at the next example (Q2).
Expected result:
Here is a solution using maps:
Again, this solution produces the same results. This time we had two elements in the grouping key, and the map key must be a string. So we had to use an old database trick and concatenate the two values with a known delimiter. Naturally we have to be careful in our choice of delimiter.
For the remaining queries, I'll skip the W3C examples and output XML. Here are my solutions. Again, these return the desired results, but could benefit from more optimization work.
This final use-case is kind of odd, because the sample code works if you simply comment out the "group by". In other words, the sample data only contains one group. But I reimplemented it anyway.
That's it. I hope this was worth your time.
XQuery Code Review
December 05, 2010 at 01:57 PM | categories: XQuery, MarkLogic | View CommentsCode review is a crucial part of good development methodology. It can be a pre-checkin activity, or an ongoing background task in a pair-programming environment. What should you look for when reviewing XQuery code?
I'm sure I've forgotten a few things, but hopefully this will get you started.
Use strong typing
You can catch many potential bugs this way. If a function's parameters and return value aren't strongly typed, there should be a comment explaining why that isn't practical. Try to narrow down the types, too: be as specific as possible. If a parameter or return type is node(), ask why it can't be element()? If it's element(), why can't it be element(foo) or element(*, typename) instead? Don't forget about quantifiers and schema-element().
Besides function variables and return types, this also applies to module variables, FLWOR variable bindings, and typeswitch variable bindings.
Look for injection paths
The xdmp:eval
and xdmp:value
functions
are especially vulnerable, and should be avoided as much as possible
in favor of xdmp:invoke
and xdmp:unpath
, respectively.
But it's important to check any user input. For example, is the user
entitled to see the contents of the document requested via
xdmp:get-request-field
? It's best if the document is
protected by a good security configuration with read permissions, but
if security checks are being done in application code, be especially
careful to guard against injection attacks.
Are there order-by clauses or expressions with gt or lt operators?
If so, could they be accelerated using range indexes? If range indexes do exist, are they working? Sometimes the FLWOR syntax will be subtly outside the boundaries of what MarkLogic Server can optimize, and sometimes the query collation won't match a string range index collation.
You can check using xdmp:query-trace
or xdmp:plan
. Often this makes a good unit test,
especially for performance-critical functionality.
Does the code rely on effective boolean values?
If you see 'if ($foo)then...', ask what the type of $foo is. If it isn't already a boolean, then the effective boolean value rules come into play. That could cause mysterious bugs or errors. Would it be better to write 'if(exists($foo))...' or perhaps 'if ($foo ne 0)...' instead? The core idea here is to write what you mean, and mean what you write.
This idea also applies to XPath predicates, where-clauses, and any other boolean context.
Review any XPath with '/text()' or '//text()' steps
When I see a query is full of text()
steps, my first
thought is that the developer may not really understand
atomization. Believe it or not, there's often some performance to be
gained by eliminating these extra steps in favor of atomization. But
more importantly, text()
can be dangerous. It doesn't
mean "give me the lexical string value of the context nodes". It does
mean "give me the text children, if any, of each of the context
nodes". So it can return nothing, or one text node, or one million
text nodes. These two code examples look similar, but behave in very
different ways. As a result, I try to avoid text()
as
much as possible. When it is needed, I leave comments describing why
string()
won't work, and what will happen in cases such
as empty sequence or multiple nodes.
let $n := <p>nodes <i>can</i> have multiple text children</p>
return concat('testing... ', $n)
=> testing... nodes can have multiple text children
let $n := <p>nodes <i>can</i> have multiple text children</p>
return concat('testing... ', $n/text())
=> XDMP-ARGTYPE: (err:XPTY0004) concat("testing... ", (text{"nodes "},
text{" have multiple text children"})) -- arg2 is not of type xs:anyAtomicType?
Review any XPath with '//' steps
Sometimes these are necessary, and sometimes they even benefit performance. But more often they are a sign of short-sighted laziness, or outright ignorance of the content. Rewriting these expressions as fully-qualified paths can improve code resiliency and also performance.
If the business logic really calls for descendant-or-self::node(), comment the code to explain why.
Review any function calls within XPath predicates
Function calls inside an XPath predicate can be horrible for performance, since the function must be called for every item in the predicate's input sequence. If the result of the function call is static, simply bind the result to a variable. This is also true of operators: even simple math operations like:
$list[$start to $start +$size]should be rewritten as
$list[$start to $stop]
If you have trouble seeing why this might be a problem, consider a list with 100 items. Now consider this expression:
$list[ xdmp:sleep(100) ]
Evaluation will cost 100-ms per item, or 10 seconds total. Every expression takes a finite amount of time to evaluate, and performance optimization is sometimes a matter of reducing the expression count. Also, most function calls will defeat any index lookups that might otherwise take place.
Review any XPath sub-expressions within XPath predicates
Here I don't mean [ foo ]
but rather
[ foo/bar ]
or [ foo[bar] ]
.
These sub-expressions can be
bad for performance. If the result of the sub-expression is static,
bind it to a variable.
Watch for nested if-then-else-if-then-else... sub-expressions
Can it be rewritten as a typeswitch? If so, this is often easier to read, and faster.
Look for opportunities to eliminated unnecessary expressions
For example, I frequently see code resembling this:
...
return if ($flag) then $result else ()
It would be clearer and more XQuery-like to write this as:
...
where $flag
return $result
Another common pattern is:
if ($foo/bar) then $foo/bar else ()
This can be re-written without changing the results, as:
$foo/bar
The trick is to understand that if there are no results for an XPath expression, then the result is (). A similar pattern is:
if ($foo eq $bar) then true() else false()
This can be written as:
($foo eq $bar)
The parenthesis aren't strictly necessary, but make it easier to read expressions like this:
let $baz := ($foo eq $bar)
...
Watch out for function mapping
Function mapping is a powerful friend and a terrible enemy. If the code is a 0.9-to-1.0 port, I recommend disabling function mapping in every module. If it's new 1.0 code, think about how function mapping might help or cause problems in different situations.
Disabling function mapping is easy. In the prologue, simply add:
declare option xdmp:mapping "false";
Your turn
As Mary might ask: comments, questions, or denunciations? Any tips of your own that you would like to share? Let's hear them.
My Name Here
November 29, 2010 at 10:38 AM | categories: home, MarkLogic | View CommentsThanks to all at MarkLogic, and to the customers and partners that I worked with there. If you want to keep in touch, I'm on LinkedIn.