<?xml version="1.0" encoding="UTF-8"?>
<feed
  xmlns="http://www.w3.org/2005/Atom"
  xmlns:thr="http://purl.org/syndication/thread/1.0"
  xml:lang="en"
   >
  <title type="text">Where am I?</title>
  <subtitle type="text">Performance, scalability, databases, and whatever comes up.</subtitle>

  <updated>2023-07-07T19:44:55Z</updated>
  <generator uri="http://blogofile.com/">Blogofile</generator>

  <link rel="alternate" type="text/html" href="http://blakeley.com/blogofile" />
  <id>http://blakeley.com/blogofile/feed/atom/</id>
  <link rel="self" type="application/atom+xml" href="http://blakeley.com/blogofile/feed/atom/" />
  <entry>
    <author>
      <name></name>
      <uri>http://blakeley.com/blogofile</uri>
    </author>
    <title type="html"><![CDATA[Miso-Wasabi Dip]]></title>
    <link rel="alternate" type="text/html" href="http://blakeley.com/blogofile/2023/07/07/miso-wasabi-dip" />
    <id>http://blakeley.com/blogofile/2023/07/07/miso-wasabi-dip</id>
    <updated>2023-07-07T12:34:56Z</updated>
    <published>2023-07-07T12:34:56Z</published>
    <category scheme="http://blakeley.com/blogofile" term="food" />
    <category scheme="http://blakeley.com/blogofile" term="recipes" />
    <summary type="html"><![CDATA[Miso-Wasabi Dip]]></summary>
    <content type="html" xml:base="http://blakeley.com/blogofile/2023/07/07/miso-wasabi-dip"><![CDATA[<p>Good with avocado slices or blanched vegetables.</p>
<ul>
<li>3-T sake</li>
<li>1-T white miso</li>
<li>2-tsp prepared wasabi (use more or less to taste)</li>
<li>1.5-tsp black toasted sesame seed</li>
</ul>
<p>Whisk the sake, miso, and wasabi, then add sesame seed.</p>
<p>Keep refrigerated.</p>]]></content>
  </entry>
  <entry>
    <author>
      <name></name>
      <uri>http://blakeley.com/blogofile</uri>
    </author>
    <title type="html"><![CDATA[Solara]]></title>
    <link rel="alternate" type="text/html" href="http://blakeley.com/blogofile/2023/04/22/solara" />
    <id>http://blakeley.com/blogofile/2023/04/22/solara</id>
    <updated>2023-04-22T12:59:59Z</updated>
    <published>2023-04-22T12:59:59Z</published>
    <category scheme="http://blakeley.com/blogofile" term="solar" />
    <summary type="html"><![CDATA[Solara]]></summary>
    <content type="html" xml:base="http://blakeley.com/blogofile/2023/04/22/solara"><![CDATA[<p>I've been working a 
<a href="https://github.com/mblakele/solara">simple web service</a>
to help maximize self-consumption
of solar power, with the help of 
<a href="https://www.emporiaenergy.com/how-the-vue-utility-connect-works">Emporia Energy's VUE Utility Connect</a>.
Today seems like a good day to release it: I hope it's useful!</p>
<p><a href="https://github.com/mblakele/solara">mblakele/solara on GitHub</a></p>]]></content>
  </entry>
  <entry>
    <author>
      <name></name>
      <uri>http://blakeley.com/blogofile</uri>
    </author>
    <title type="html"><![CDATA[Sambal Olek]]></title>
    <link rel="alternate" type="text/html" href="http://blakeley.com/blogofile/2023/04/21/sambal-olek" />
    <id>http://blakeley.com/blogofile/2023/04/21/sambal-olek</id>
    <updated>2023-04-21T12:34:56Z</updated>
    <published>2023-04-21T12:34:56Z</published>
    <category scheme="http://blakeley.com/blogofile" term="food" />
    <category scheme="http://blakeley.com/blogofile" term="recipes" />
    <summary type="html"><![CDATA[Sambal Olek]]></summary>
    <content type="html" xml:base="http://blakeley.com/blogofile/2023/04/21/sambal-olek"><![CDATA[<p>Remember the Great Huy Fong Shortage of 2022-23?
Here's my homemade version of Sambal Olek:</p>
<ul>
<li>8-oz fresh red chilis, Thai or Vietnamese long type</li>
<li>6-T (3-oz) rice vinegar</li>
<li>1-T salt</li>
<li>12-oz glass jar</li>
</ul>
<p>Remove the stems from the chilis and rinse them,
then transfer to a blender or food processor.
Add the vinegar and salt.
Blend smooth, adding 1-3 T water if needed.
Transfer to a small pan and simmer for 5-10 minutes.
Finally, transfer to the glass jar.</p>
<p>Keep refrigerated.</p>]]></content>
  </entry>
  <entry>
    <author>
      <name></name>
      <uri>http://blakeley.com/blogofile</uri>
    </author>
    <title type="html"><![CDATA[Toum]]></title>
    <link rel="alternate" type="text/html" href="http://blakeley.com/blogofile/2015/12/14/toum" />
    <id>http://blakeley.com/blogofile/2015/12/14/toum</id>
    <updated>2015-12-14T12:34:56Z</updated>
    <published>2015-12-14T12:34:56Z</published>
    <category scheme="http://blakeley.com/blogofile" term="food" />
    <category scheme="http://blakeley.com/blogofile" term="recipes" />
    <summary type="html"><![CDATA[Toum]]></summary>
    <content type="html" xml:base="http://blakeley.com/blogofile/2015/12/14/toum"><![CDATA[<p>Toum is traditional with middle eastern food, especially chicken. But
it's also great on pizza or pasta, or as a dip for vegetables.
Or use it to flavor rice dishes. Heck, try it with cardboard!</p>
<p>I first had toum (ثوم) at <a href="http://zankouchicken.com/">zankou chicken</a>,
where they call it "garlic sauce". Not knowing the name made it tricky
to find a recipe, but I persevered. Here's what I've settled on:</p>
<ul>
<li>3 heads of garlic, peeled (about 36-42 cloves)</li>
<li>2 tsp salt</li>
<li>2-3 Tbl lemon juice</li>
<li>9 oz vegetable oil (soybean works well)</li>
</ul>
<p>Peel the garlic and combine it with the salt and lemon juice in a
blender. Work it into a fine paste. Add the oil a little at a time,
building up a white, creamy emulsion. Use a little less lemon juice
for a thicker texture, like whipped butter. Use more lemon juice if,
like John Cleese, you like it runny.</p>
<p>Refrigerate for an hour or two before using, so the lemon juice can
work on the garlic. Makes about a pint of garlicky goodness. Keep it
refrigerated in a sealed container. Keeping toum in a condiment
squeeze tube can be a lot of fun: apply garlic anywhere, anytime, with
pinpoint precision.</p>
<p>You may be tempted to try olive oil instead of soybean. Don't bother:
it doesn't emulsify very well.  I think this is because of the
chemistry of the fatty acids involved, but I don't even play a chemist
on TV.  Avoid canola oil too, because it makes the toum taste funny.</p>
<p>Emulsions have a reputation for being finicky, but this one seems
pretty reliable for me. Just watch out for hot days or overworking the
blender. Heat breaks down emulsions pretty easily. So does freezing.
If you cook with toum, the heat will break down the emulsion.  Sure,
it'll break down into garlic and oil, and you might say you haven't
lost anything. But why did you bother getting out the blender and
making toum, if all you wanted was garlic and oil?</p>
<p>Some toum recipes call for extra filler: potato or breadcrumbs,
etc. Feel free to try that. You can also try mixing in herbs or
spices. Or just sprinkle whatever you feel like on top of your
plateful of yum.</p>]]></content>
  </entry>
  <entry>
    <author>
      <name></name>
      <uri>http://blakeley.com/blogofile</uri>
    </author>
    <title type="html"><![CDATA[Deduplicating Search Results]]></title>
    <link rel="alternate" type="text/html" href="http://blakeley.com/blogofile/2014/06/08/deduplicating-search-results" />
    <id>http://blakeley.com/blogofile/2014/06/08/deduplicating-search-results</id>
    <updated>2014-06-08T12:34:56Z</updated>
    <published>2014-06-08T12:34:56Z</published>
    <category scheme="http://blakeley.com/blogofile" term="XQuery" />
    <category scheme="http://blakeley.com/blogofile" term="MarkLogic" />
    <summary type="html"><![CDATA[Deduplicating Search Results]]></summary>
    <content type="html" xml:base="http://blakeley.com/blogofile/2014/06/08/deduplicating-search-results"><![CDATA[<p>So you're writing a MarkLogic application, and this question comes up:
How do we deduplicate search results?</p>
<p>In one sense MarkLogic will never return duplicates from a database lookup.
A single XPath, <code>cts:search</code>, or <code>search:search</code> expression will always
return unique nodes, as defined by the
<a href="https://www.w3.org/TR/xpath-functions/#func-is-same-node">XPath <code>is</code> operator</a>.</p>
<p>But your application might have its own, content-dependent definition
of a duplicate. This might depend on just a subset of the XML content.
For example you might be storing news articles pulled from different
media sources: newspapers, magazines, blogs, etc.
Often the same news story will appear in different sources,
and sometimes the text will be identical or extremely close.
When a user searches for the hot story of the day
you want to have all the variations available,
but the search results should roll them up together on the page.
You can see something like this if you search
<a href="https://news.google.com">Google News</a>.</p>
<p>One good strategy is to avoid duplicates entirely,
by ensuring that your documents have meaningful URIs.
Construct the URI using the same information
that determines whether or not a document is a duplicate.
This way if content arrives that duplicates existing content,
it turns out to have the same URI.
Then you are free to update the database with the latest copy,
ignore it, throw an error, or call for help.
If every news story has a dateline
and an author byline, we could construct document URIs
based on the date, location, and byline:
something like <code>/news/2014/05/30/washington/jones</code>.</p>
<p>But maybe that isn't a very good solution for our news application.
Remember that we want to search for articles,
but we only want one article per story.
So we have to store all the duplicate articles,
and we need query-time flexibility to display just one article per story.</p>
<p>Clearly we will need to generate a story-id for each story,
one that remains the same no matter how different articles
are presented. That might use a mechanism similar
to the URI computation above, except that we would put the result
in an element and it would not be unique.
We could use the same facts we were going to use in the document URI:</p>
<pre><code>&lt;story-id&gt;2014-05-30|washington|jones&lt;/story-id&gt;
</code></pre>
<p>Once we have our application set up to generate <code>story-id</code> elements,
we could try a brute-force approach.
Search the database, then walk through the search results.
Extract each <code>story-id</code> value and check it
against a list of previously-seen story-id values.
We could use a map for that.
If the <code>story-id</code> has already been seen, ignore it.
Otherwise put the <code>story-id</code> in the map and return the article.</p>
<pre><code>(
  let $search-results := search:search(...)
  let $seen := map:map()
  for $article in $search-results
  let $story-id as xs:string := $article/story-id
  where not(map:contains($seen, $story-id))
  return (
    map:put($seen, $story-id, $story-id),
    $article))[$start to $stop]
</code></pre>
<p>But there are problems with this approach. Pagination is tricky
because we don't know how many duplicates there will be.
So we have to ask the database for a lot of results,
maybe all of them at once, and then filter and paginate in user code.
This gets more and more expensive as the result size increases,
and trickier to manage as the user paginates through the results.
If a search matches a million articles, we might have to retrieve and check
all the matches before we can display any results.
That's going to be slow, and probably limited by I/O speeds.
Nowadays we could throw SSD at it, but even SSD has limits.</p>
<p>Another problem with the brute-force approach is that
facets generated by the database will not match the deduplicated results.
You might have a facet on author that shows 1000,
but deduplicated filters out all but 100 of the articles.</p>
<p>So let's look at another approach. Instead of deduplicating after we search,
let's deduplicate before we search. That might sound crazy,
but we have a couple of powerful tools that make it possible:
<a href="https://docs.marklogic.com/cts:value-co-occurrences"><code>cts:value-co-occurrences</code></a>
and <a href="https://docs.marklogic.com/cts:document-query"><code>cts:document-query</code></a>.
The idea is to deduplicate
based on the co-occurrence of <code>story-id</code> and document URI,
without retrieving any documents.
Then we query the database again,
this time fetching only the non-duplicate documents
that we want to return.</p>
<p>Each article is stored as a document with a unique document URI.
We enable the
<a href="https://docs.marklogic.com/guide/search-dev/lexicon#id_50782">document URI lexicon</a>
and we also create an element-range index on the element named <code>story-id</code>.
As described above, we construct a <code>story-id</code>
for every article as it arrives and add it to the XML.
This <code>story-id</code> is our deduplication key: it uniquely identifies a story,
and if multiple articles might have the same <code>story-id</code> value
then they are treated as duplicates.</p>
<p>A deduplication key is application-specific, and might be anything.
An application might even have multiple deduplication keys
for different query types.
However it's essential to have a deduplication key for every document
that you want to query, even if only some documents will have duplicates.
The technique we're going to use will only return documents
that have a deduplication key.
An article with no <code>story-id</code> simply won't show up in the co-occurrence
results, so it won't show up in search results either.</p>
<p>Here's some code to illustrate the idea. Start with <code>$query-original</code>,
which is the original user query as a cts:query item.
We might generate that using 
<a href="https://docs.marklogic.com/search:parse"><code>search:parse</code></a>
or perhaps the <a href="https://github.com/mblakele/xqysp">xqysp</a> library.</p>
<pre><code>(: For each unique story-id there may be multiple article URIs.
 : This implementation always uses the first one.
 :)
let $query-dedup := cts:document-query(
  let $m := cts:values-co-occurrences(
    cts:element-reference(
      xs:QName('story-id'),
      'collation=http://marklogic.com/collation/codepoint'),
    cts:uri-reference(),
    'map')
  for $key in map:keys($m)
  return map:get($m, $key)[1])
(: The document-query alone would match the right articles,
 : but there would be no relevance ranking.
 : Using both queries eliminates duplicates and preserves ranking.
 :)
let $query-full := cts:and-query(($query-original, $query-dedup))
...
</code></pre>
<p>Now we can use <code>$query-full</code> with any API that uses cts:query items,
such as <code>cts:search</code>. In order to match, an article will have to match
<code>$query-original</code> and it will have to have one of the URIs
that we selected from the co-occurrence map.</p>
<p>Instead of calling <code>cts:search</code> directly, we might want to use 
<a href="https://docs.marklogic.com/search:resolve"><code>search:resolve</code></a>.
That function expects a cts:query XML element, not a cts:query item.
So we need a little extra code to turn the cts:query item
into an XML document and then extract its root element:</p>
<pre><code>...
return search:resolve(
  document { $query-full }/*,
  $search-options,
  $pagination-start,
  $pagination-size)
</code></pre>
<p>Many search applications also provide facets. You can ask <code>search:resolve</code>
for facets by providing the right search options,
or you can call <a href="https://docs.marklogic.com/cts:values"><code>cts:values</code></a> yourself.
Note that since facets are not relevance-ranked,
it might be a little faster to use <code>$query-dedup</code> instead of <code>$query-full</code>.</p>
<p>Speaking of performance, how fast is this? In my testing it added
an O(n) component, linear with the number of keys in the
<code>cts:values-co-occurrences</code> map. With a small map the overhead is low,
and deduplicating 10,000 items only adds a few tens of milliseconds.
But with hundreds of thousands of map items the profiler
shows more and more time spent in the XQuery FLWOR expression
that extracts the first document URI from each map item.</p>
<pre><code>  let $m := cts:values-co-occurrences(
    cts:element-reference(
      xs:QName('story-id'),
      'collation=http://marklogic.com/collation/codepoint'),
    cts:uri-reference(),
    'map')
  for $key in map:keys($m)
  return map:get($m, $key)[1])
</code></pre>
<p>We can speed that up a little bit by trading the FLWOR
for function mapping.</p>
<pre><code>declare function local:get-first(
  $m as map:map,
  $key as xs:string)
as xs:string
{
  map:get($m, $key)[1]
};

let $m := cts:values-co-occurrences(
  cts:element-reference(
    xs:QName('story-id'),
    'collation=http://marklogic.com/collation/codepoint'),
  cts:uri-reference(),
  'map')
return local:get-first($m, map:keys($m))
</code></pre>
<p>However this is a minor optimization, and with large maps
it will still be expensive to extract the non-duplicate URIs.
It's both faster and more robust than the brute-force approach,
but not as fast as native search.</p>
<p>Pragmatically, I would try to handle these performance characteristics
in the application. Turn deduplication off by default,
and only enable it as an option
when a search returns fewer than 100,000 results.
This would control the performance impact of the feature,
providing its benefits without compromising overall performance.</p>
<p>It's also tempting to think about product enhancements.
We could avoid some of this work if we could find a way
to retrieve only the part of the map needed for the current search page,
but this is not feasible with the current implementation
of <code>cts:values-co-occurrences</code>. That function would have to return
the co-occurrence map sorted by the score of each story-id.
That's tricky because normally scores are calculated for documents,
in this case articles.</p>
<p>One way to speed this up without changing MarkLogic Server
could be to move some of the work into the forests.
MarkLogic Server supports
<a href="https://docs.marklogic.com/guide/app-dev/aggregateUDFs">User-Defined Functions</a>,
which are C++ functions that run directly on range indexes.
I haven't tried this approach myself, but in theory you could write a UDF
that would deduplicate based on the <code>story-id</code> and URI co-occurrence.
Then you could call this function with
<a href="https://docs.marklogic.com/cts:aggregate"><code>cts:aggregate</code></a>.
This would work best if you could partition your forests
using the <code>story-id</code> value, so that any the duplicate values articles
are guaranteed to be in the same forest.
Used carefully this approach could be much faster,
possibly allowing fast deduplication with millions of URIs.</p>
<p>For more on that idea, see the documentation for
<a href="https://docs.marklogic.com/guide/admin/tiered-storage">Tiered Storage</a>
and the
<a href="https://developer.marklogic.com/learn/a-mapreduce-aggregation-function">UDF plugin tutorial</a>.
If you try it, please let me know how it works out.</p>]]></content>
  </entry>
  <entry>
    <author>
      <name></name>
      <uri>http://blakeley.com/blogofile</uri>
    </author>
    <title type="html"><![CDATA[Introduction to Multi-Statement Transactions]]></title>
    <link rel="alternate" type="text/html" href="http://blakeley.com/blogofile/2013/06/21/introduction-to-multi-statement-transactions" />
    <id>http://blakeley.com/blogofile/2013/06/21/introduction-to-multi-statement-transactions</id>
    <updated>2013-06-21T12:34:56Z</updated>
    <published>2013-06-21T12:34:56Z</published>
    <category scheme="http://blakeley.com/blogofile" term="XQuery" />
    <category scheme="http://blakeley.com/blogofile" term="MarkLogic" />
    <summary type="html"><![CDATA[Introduction to Multi-Statement Transactions]]></summary>
    <content type="html" xml:base="http://blakeley.com/blogofile/2013/06/21/introduction-to-multi-statement-transactions"><![CDATA[<p>If you are an old hand with MarkLogic, you are used to writing
update queries with implicit commits. Sometimes this
means restructuring your code so that everything can happen in one commit,
with no conflicting updates. In extreme cases you might
even decide to run multiple transactions from one query,
using <code>xdmp:invoke</code> or semicolons.
Historically this meant giving up atomicity.</p>
<p>Multi-statement transactions, introduced in MarkLogic 6,
promise a third way. We can write a transaction that spans
multiple statements, with an explicit commit or rollback.</p>
<p>For most updates it's probably best to stick with the old ways
and use implicit commits. But let's look at a concrete example
of a time when multi-statement transactions are the right tool
for the job.</p>
<p>Suppose you are using DLS (Document Library Services)
to manage your document versioning. But you have a special case
where you want to insert two discrete versions of the same document
atomically. That may sound odd, but I ran into that exact problem recently.</p>
<p>First we need to discover that there is a problem.
Let's bootstrap the a test document with DLS.</p>
<pre><code>import module namespace dls="http://marklogic.com/xdmp/dls"
  at "/MarkLogic/dls.xqy";
try {
  dls:document-delete('test', false(), false()) }
catch ($ex) {
  if ($ex/error:code ne 'DLS-UNMANAGED') then xdmp:rethrow()
  else if (empty(doc('test'))) then ()
  else xdmp:document-delete('test') }
;
import module namespace dls="http://marklogic.com/xdmp/dls"
  at "/MarkLogic/dls.xqy";
dls:document-insert-and-manage('test', false(), &lt;x id="x1"/&gt;)
</code></pre>
<p>Now let's write some XQuery to insert two versions in one update,
and see what happens.</p>
<pre><code>import module namespace dls="http://marklogic.com/xdmp/dls"
  at "/MarkLogic/dls.xqy";
dls:document-checkout-update-checkin(
  'test', &lt;x id="x2"/&gt;, "version two", true()),
dls:document-checkout-update-checkin(
  'test', &lt;x id="x3"/&gt;, "version three", true())
</code></pre>
<p>This throws an <code>XDMP-CONFLICTINGUPDATES</code> error, because these calls to DLS
end up trying to update the same nodes twice in the same transaction.
In implicit commit mode, aka "auto" mode, this is difficult to avoid.
We could ask MarkLogic to extend DLS with a new function
designed for this situation. But that is a long-term solution,
and we need to move on with this implementation.</p>
<p>So what can we do? We might read up on <code>xdmp:invoke</code>, <code>xdmp:eval</code>, etc.
If we are careful, we can write a top-level read-only query
that invokes one or more update transactions.</p>
<pre><code>(: Entry point - must be a read-only query. :)
xdmp:invoke(
  'update.xqy',
  (xs:QName('URI'), 'test',
   xs:QName('NEW'), &lt;x id="x2"/&gt;,
   xs:QName('NOTE'), "version two")),
xdmp:invoke(
  'update.xqy',
  (xs:QName('URI'), 'test',
   xs:QName('NEW'), &lt;x id="x3"/&gt;,
   xs:QName('NOTE'), "version three"))
</code></pre>
<p>This invokes a module called <code>update.xqy</code>, which would look like this:</p>
<pre><code>(: update.xqy :)
import module namespace dls="http://marklogic.com/xdmp/dls"
  at "/MarkLogic/dls.xqy";

declare variable $NEW as node() external ;
declare variable $NOTE as xs:string external ;
declare variable $URI as xs:string external ;

dls:document-checkout-update-checkin(
  $URI, $NEW, $NOTE, true())
</code></pre>
<p>This works - at least, it doesn't throw <code>XDMP-CONFLICTINGUPDATES</code>.
But we have lost atomicity. Each of the two updates runs
as a different transaction. This opens up a potential race
condition, where a second query updates the document
in between our two transactions. That could break our application.</p>
<p>There are ways around this, but they get complicated quickly.
They are also difficult to test, so we can never be confident
that we have plugged all the potential holes in our process.
It would be much more convenient if we could run multiple
statements inside one transaction, with each statement able
to see the database state of the previous statements.</p>
<p>We can do exactly that using a multi-statement transaction.
Let's get our feet wet by looking at a very simple MST.</p>
<pre><code>declare option xdmp:transaction-mode "update";

xdmp:document-insert('temp', &lt;one/&gt;)
;

xdmp:document-insert('temp', &lt;two/&gt;),
xdmp:commit()
</code></pre>
<p>There are three important points to this query.
1. The option <code>xdmp:transaction-mode="update"</code>
  begins a multi-statment transaction.
1. The semicolon after the first <code>xdmp:document-insert</code>
  ends that statement and begins another.
1. The <code>xdmp:commit</code> ends the multi-statement transaction
  by commiting all updates to the database.</p>
<p>This runs without error, and we can verify that <code>doc('temp')</code>
contains <code>&lt;two/&gt;</code> after it runs.
But how can we prove that all this takes place in a single transaction?
Let's decorate the query with a few more function calls.</p>
<pre><code>declare option xdmp:transaction-mode "update";

xdmp:get-transaction-mode(),
xdmp:transaction(),
doc('temp')/*,
xdmp:document-insert('temp', &lt;one/&gt;)
;

xdmp:get-transaction-mode(),
xdmp:transaction(),
doc('temp')/*,
xdmp:document-insert('temp', &lt;two/&gt;),
xdmp:commit()
</code></pre>
<p>This time we return some extra information within each statement:
the transaction mode, the transaction id, and the contents of the test doc.
The transaction ids will be different every time, but here is one example.</p>
<pre><code>update
17378667561611037626
&lt;two/&gt;
update
17378667561611037626
&lt;one/&gt;
</code></pre>
<p>So the document <code>test</code> started out with the old node <code>&lt;two/&gt;</code>,
but after the first statement it changed to <code>&lt;one/&gt;</code>.
Both statements see the same transaction mode and id.</p>
<p>Try changing the <code>xdmp:transaction-mode</code> declaration to <code>auto</code>, the default.
You should see the mode change to <code>auto</code>, and two different transaction-ids.
This tells us that in <code>update</code> mode we have a multi-statement transaction,
and in <code>auto</code> mode we have a non-atomic sequence of two different transactions.
Before MarkLogic 6, all update statements ran in <code>auto</code> mode.</p>
<p>Now let's apply what we've learned about MST to the original problem:
inserting two different versions of a managed document in a single transaction.</p>
<pre><code>import module namespace dls="http://marklogic.com/xdmp/dls"
  at "/MarkLogic/dls.xqy";

declare option xdmp:transaction-mode "update";

dls:document-checkout-update-checkin(
  'test', &lt;x id="x2"/&gt;, "version two", true())
;

import module namespace dls="http://marklogic.com/xdmp/dls"
  at "/MarkLogic/dls.xqy";
dls:document-checkout-update-checkin(
  'test', &lt;x id="x3"/&gt;, "version three", true()),
xdmp:commit()
</code></pre>
<p>As above, this code uses three important features:
1. Set <code>xdmp:transaction-mode="update"</code> to begin the MST.
1. Use semicolons to end one statement and begin another.
1. Use <code>xdmp:commit</code> to end the MST and commit all updates.</p>
<p>To abort a multi-statement transaction, use <code>xdmp:rollback</code>.</p>
<p>So now you have a new tool for situations where implicit commit
is a little too awkward. Try not to overdo it, though.
In most situations, the default <code>xdmp:transaction-mode="auto"</code>
is still the best path.</p>]]></content>
  </entry>
  <entry>
    <author>
      <name></name>
      <uri>http://blakeley.com/blogofile</uri>
    </author>
    <title type="html"><![CDATA[External Variables (Code Review, Part II)]]></title>
    <link rel="alternate" type="text/html" href="http://blakeley.com/blogofile/2012/09/28/external-variables-(code-review,-part-ii)" />
    <id>http://blakeley.com/blogofile/2012/09/28/external-variables-(code-review,-part-ii)</id>
    <updated>2012-09-28T12:34:56Z</updated>
    <published>2012-09-28T12:34:56Z</published>
    <category scheme="http://blakeley.com/blogofile" term="XQuery" />
    <category scheme="http://blakeley.com/blogofile" term="MarkLogic" />
    <summary type="html"><![CDATA[External Variables (Code Review, Part II)]]></summary>
    <content type="html" xml:base="http://blakeley.com/blogofile/2012/09/28/external-variables-(code-review,-part-ii)"><![CDATA[<p>Remember when I talked about <a href="/blogofile/archives/518">XQuery Code Review</a>?
The other day I was forwarding that link to a client,
and noticed that I forgot to mention external variables.
I talked about <code>xdmp:eval</code> and <code>xdmp:value</code>
in the section titled <em>Look_for_injection_paths</em>,
and mentioned that it's usually better to use <code>xdmp:invoke</code> or <code>xdmp:unpath</code>,
which are less vulnerable to injection attacks.</p>
<p>But it can be convenient or even necessary to evaluate dynamic XQuery.
That's what <code>xdmp:eval</code> and <code>xdmp:value</code> are there for, after all.
I've even written tools like <a href="https://github.com/mblakele/presta">Presta</a>
to help you.</p>
<p>Used properly, dynamic queries can be made safe.
The trick is to <strong>never</strong> let user data directly into your dynamic queries.
Whenever you see <code>xdmp:eval</code> or <code>xdmp:value</code> in XQuery,
ask yourself "Where did this query comes from?"
If any part of it came from user input, flag it for a rewrite.</p>
<pre><code>(: WRONG - This code is vulnerable to an injection attack! :)
xdmp:eval(
  concat('doc("', xdmp:get-request-field('uri'), '")'))
</code></pre>
<p>Actually there are at least two bugs in this code.
There is a functional problem: what happens if the <code>uri</code> request field
is <code>fubar-"baz"</code>? You might not expect a uri to include a quote,
and maybe that will never legitimately happen in your application.
But if that request-field does arrive, <code>xdmp:value</code> will throw an error:</p>
<pre><code>XDMP-UNEXPECTED: (err:XPST0003) Unexpected token syntax error
</code></pre>
<p>That's because you haven't properly escaped the uri in the dynamic XQuery.
And you could escape it. You could even write a function to do that for you.
But if you miss any of the various characters that need escaping,
<code>XDMP-UNEXPECTED</code> will be there, waiting for you.</p>
<p>So far we've only talked about innocent mistakes. But what if someone out there
is actively hostile? Let's say it's me. If I know that your web service
expects a <code>uri</code> request-field, I might guess that your code looks something like
the code above, and try an injection attack.</p>
<p>After a little trial and error, I might find that sending
<code>uri=x"),cts:uris(),("</code> returns a list of all the documents in your database,
whether you want me to see them or not. Then I can send something like
<code>uri=x"),xdmp:document-delete("fubar</code>. If that document exists,
and security isn't tight... it's gone. Or maybe I will decide to try
<code>xdmp:forest-clear</code> instead.</p>
<p>In SQL we use bind variables to solve both of these problems.
Any user input binds to a variable inside the SQL,
and the database driver takes care of escaping for us.
We no longer have to worry about obscure syntax errors or injection attacks,
as long as we remember to use variable for all externally-supplied parameters.
In XQuery these are known as external variables.</p>
<pre><code>(: Always use external variables for user-supplied data. :)
xdmp:eval(
  'declare variable $URI as xs:string external ;
   doc($URI)',
  (xs:QName('URI'), xdmp:get-request-field('uri')))
</code></pre>
<p>The syntax is a little odd: that second parameter is a sequence of
alternating QName and value. Because XQuery doesn't support nested sequences,
this means you can't naively bind a sequence to a value.
Instead you can pass in XML or a map,
or use a convention like comma-separated values (CSV).</p>
<pre><code>(: Using XML to bind a sequence to an external variable. :)
xdmp:eval(
  'declare variable $URI-LIST as element(uri-list) external ;
   doc($URI-LIST/uri)',
  (xs:QName('URI-LIST'),
   element uri-list {
     for $uri in xdmp:get-request-field('uri')
     return element uri { $uri } }))
</code></pre>
<p>Even though these examples all use pure XQuery, this code review principle
also applies to XCC code. If you see a Java or .NET program using <code>AdHocQuery</code>,
check to make sure that all user input binds to variables.</p>
<p>Remember, the best time to fix a potential security problem
is <strong>before</strong> the code goes live.</p>]]></content>
  </entry>
  <entry>
    <author>
      <name></name>
      <uri>http://blakeley.com/blogofile</uri>
    </author>
    <title type="html"><![CDATA[AlbumMixer v1.13]]></title>
    <link rel="alternate" type="text/html" href="http://blakeley.com/blogofile/2012/06/13/albummixer-v1.13" />
    <id>http://blakeley.com/blogofile/2012/06/13/albummixer-v1.13</id>
    <updated>2012-06-13T13:26:52Z</updated>
    <published>2012-06-13T13:26:52Z</published>
    <category scheme="http://blakeley.com/blogofile" term="iOS" />
    <summary type="html"><![CDATA[AlbumMixer v1.13]]></summary>
    <content type="html" xml:base="http://blakeley.com/blogofile/2012/06/13/albummixer-v1.13"><![CDATA[<p><a href="https://itunes.apple.com/WebObjects/MZStore.woa/wa/viewSoftware?id=329552764">AlbumMixer v1.13</a>
fixes a minor bug where the player display would look odd
when tracks were missing some metadata.</p>
<p>If you see any problems with this release,
please use <code>Settings &gt; Report a Problem</code> from within the app.
I will also read comments posted here.</p>]]></content>
  </entry>
  <entry>
    <author>
      <name></name>
      <uri>http://blakeley.com/blogofile</uri>
    </author>
    <title type="html"><![CDATA[rsyslog and MarkLogic]]></title>
    <link rel="alternate" type="text/html" href="http://blakeley.com/blogofile/2012/05/17/rsyslog-and-marklogic" />
    <id>http://blakeley.com/blogofile/2012/05/17/rsyslog-and-marklogic</id>
    <updated>2012-05-17T18:00:01Z</updated>
    <published>2012-05-17T18:00:01Z</published>
    <category scheme="http://blakeley.com/blogofile" term="MarkLogic" />
    <category scheme="http://blakeley.com/blogofile" term="Linux" />
    <summary type="html"><![CDATA[rsyslog and MarkLogic]]></summary>
    <content type="html" xml:base="http://blakeley.com/blogofile/2012/05/17/rsyslog-and-marklogic"><![CDATA[<p>You probably know that MarkLogic Server logs important events
to the <code>ErrorLog.txt</code> file. By default it logs events at <code>INFO</code> or higher,
but many development and staging environments change the <code>file-log-level</code>
to <code>DEBUG</code>. These log levels are also available to the <code>xdmp:log</code> function,
and some of your XQuery code might use that for <code>printf</code>-style debugging.</p>
<p>You might even know that MarkLogic also sends important events
to the operating system. On linux this means <code>syslog</code>, and important events
are those at <code>NOTICE</code> and higher by default.</p>
<p>But are you monitoring these events?</p>
<p>How can you set up your MarkLogic deployment so that it will automatically
alert you to errors, warnings, or other important events?</p>
<p>Most linux deployments now use <code>rsyslog</code> as their system logging facility.
The <a href="http://www.rsyslog.com/doc/manual.html">full documentation</a> is available,
but this brief tutorial will show you how to set up email alerts for MarkLogic
using <code>rsyslog</code> version 4.2.6.</p>
<p>All configuration happens in <code>/etc/rsyslog.conf</code>.
Here is a sample of what we need for email alerts.
First, at the top of the file you should see several <code>ModLoad</code> declarations.
Check for <code>ommail</code> and add it if needed.</p>
<pre><code>$ModLoad ommail.so  # email support
</code></pre>
<p>Next, add a stanza for MarkLogic somewhere after the <code>ModLoad</code> declaration.</p>
<pre><code># MarkLogic
$template MarkLogicSubject,"Problem with MarkLogic on %hostname%"
$template MarkLogicBody,"rsyslog message from MarkLogic:\r\n[%timestamp%] %app-name% %pri-text%:%msg%"
$ActionMailSMTPServer 127.0.0.1
$ActionMailFrom your-address@your-domain
$ActionMailTo your-address@your-domain
$ActionMailSubject MarkLogicSubject
#$ActionExecOnlyOnceEveryInterval 3600
daemon.notice   :ommail:;MarkLogicBody
</code></pre>
<p>Be sure to replace both instances of <code>your-address@your-domain</code>
with an appropriate value. The ActionMailSMTPServer must be smart enough
to deliver email to that address. I used a default <code>sendmail</code> configuration
on the local host, but you might choose to connect to a different host.</p>
<p>Note that I have commented out the <code>ActionExecOnlyOnceEveryInterval</code> option.
The author of <code>rsyslog</code>, <a href="http://www.gerhards.net/rainer">Rainer Gerhards</a>,
recommends setting this value to a reasonably high number of seconds
so that your email inbox is not flooded with messages.
However, the <code>rsyslog</code> documentation states that excess messages
are discarded, and I did not want to lose any important messages.
What I would really like to do is buffer messages for N seconds at a time,
and merge them together in one email.
But while <code>rsyslog</code> has many features, and does offer buffering,
it does not seem to know how to combine consecutive messages
into a single email.</p>
<p>Getting back to what <code>rsyslog</code> <em>can</em> do,
you can customize the subject and body of the mail message.
With the configuration above, a restart of the server
might send you an email like this one:</p>
<pre><code>Subject: Problem with MarkLogic on myhostname.mydomain

rsyslog message from MarkLogic:
[May 17 23:58:36] MarkLogic daemon.notice&lt;29&gt;: Starting MarkLogic Server 5.0-3 i686 in /opt/MarkLogic with data in /var/opt/MarkLogic
</code></pre>
<p>When making any <code>rsyslog</code> changes, be sure to restart the service:</p>
<pre><code>sudo service rsyslog restart
</code></pre>
<p>At the same time, check your system log for any errors or typos.
This is usually <code>/var/log/messages</code> or <code>/var/log/syslog</code>.
The full documentation for <a href="http://www.rsyslog.com/doc/property_replacer.html">template substitution properties
</a> is online.
You can also read about a wealth of other options available in <code>rsyslog</code>.</p>]]></content>
  </entry>
  <entry>
    <author>
      <name></name>
      <uri>http://blakeley.com/blogofile</uri>
    </author>
    <title type="html"><![CDATA[AlbumMixer v1.12]]></title>
    <link rel="alternate" type="text/html" href="http://blakeley.com/blogofile/2012/05/16/albummixer-v1.12" />
    <id>http://blakeley.com/blogofile/2012/05/16/albummixer-v1.12</id>
    <updated>2012-05-16T10:40:00Z</updated>
    <published>2012-05-16T10:40:00Z</published>
    <category scheme="http://blakeley.com/blogofile" term="iOS" />
    <summary type="html"><![CDATA[AlbumMixer v1.12]]></summary>
    <content type="html" xml:base="http://blakeley.com/blogofile/2012/05/16/albummixer-v1.12"><![CDATA[<p><a href="https://itunes.apple.com/WebObjects/MZStore.woa/wa/viewSoftware?id=329552764">AlbumMixer v1.12</a>
fixes a minor bug where the player state would be wrong
when returning from background mode.</p>
<p>If you see any problems with this release,
please use <code>Settings &gt; Report a Problem</code> from within the app.
I will also read comments posted here.</p>]]></content>
  </entry>
</feed>
