Where am I?

Miso-Wasabi Dip

July 07, 2023 at 12:34 PM | categories: food, recipes | View Comments

Good with avocado slices or blanched vegetables.

3-T sake
1-T white miso
2-tsp prepared wasabi (use more or less to taste)
1.5-tsp black toasted sesame seed

Whisk the sake, miso, and wasabi, then add sesame seed.

Keep refrigerated.

Read and Post Comments

Solara

April 22, 2023 at 12:59 PM | categories: solar | View Comments

I've been working a simple web service to help maximize self-consumption of solar power, with the help of Emporia Energy's VUE Utility Connect. Today seems like a good day to release it: I hope it's useful!

mblakele/solara on GitHub

Read and Post Comments

Sambal Olek

April 21, 2023 at 12:34 PM | categories: food, recipes | View Comments

Remember the Great Huy Fong Shortage of 2022-23? Here's my homemade version of Sambal Olek:

8-oz fresh red chilis, Thai or Vietnamese long type
6-T (3-oz) rice vinegar
1-T salt
12-oz glass jar

Remove the stems from the chilis and rinse them, then transfer to a blender or food processor. Add the vinegar and salt. Blend smooth, adding 1-3 T water if needed. Transfer to a small pan and simmer for 5-10 minutes. Finally, transfer to the glass jar.

Keep refrigerated.

Read and Post Comments

Toum

December 14, 2015 at 12:34 PM | categories: food, recipes | View Comments

Toum is traditional with middle eastern food, especially chicken. But it's also great on pizza or pasta, or as a dip for vegetables. Or use it to flavor rice dishes. Heck, try it with cardboard!

I first had toum (ثوم) at zankou chicken, where they call it "garlic sauce". Not knowing the name made it tricky to find a recipe, but I persevered. Here's what I've settled on:

3 heads of garlic, peeled (about 36-42 cloves)
2 tsp salt
2-3 Tbl lemon juice
9 oz vegetable oil (soybean works well)

Peel the garlic and combine it with the salt and lemon juice in a blender. Work it into a fine paste. Add the oil a little at a time, building up a white, creamy emulsion. Use a little less lemon juice for a thicker texture, like whipped butter. Use more lemon juice if, like John Cleese, you like it runny.

Refrigerate for an hour or two before using, so the lemon juice can work on the garlic. Makes about a pint of garlicky goodness. Keep it refrigerated in a sealed container. Keeping toum in a condiment squeeze tube can be a lot of fun: apply garlic anywhere, anytime, with pinpoint precision.

You may be tempted to try olive oil instead of soybean. Don't bother: it doesn't emulsify very well. I think this is because of the chemistry of the fatty acids involved, but I don't even play a chemist on TV. Avoid canola oil too, because it makes the toum taste funny.

Emulsions have a reputation for being finicky, but this one seems pretty reliable for me. Just watch out for hot days or overworking the blender. Heat breaks down emulsions pretty easily. So does freezing. If you cook with toum, the heat will break down the emulsion. Sure, it'll break down into garlic and oil, and you might say you haven't lost anything. But why did you bother getting out the blender and making toum, if all you wanted was garlic and oil?

Some toum recipes call for extra filler: potato or breadcrumbs, etc. Feel free to try that. You can also try mixing in herbs or spices. Or just sprinkle whatever you feel like on top of your plateful of yum.

Read and Post Comments

Deduplicating Search Results

June 08, 2014 at 12:34 PM | categories: XQuery, MarkLogic | View Comments

So you're writing a MarkLogic application, and this question comes up: How do we deduplicate search results?

In one sense MarkLogic will never return duplicates from a database lookup. A single XPath, cts:search, or search:search expression will always return unique nodes, as defined by the XPath is operator.

But your application might have its own, content-dependent definition of a duplicate. This might depend on just a subset of the XML content. For example you might be storing news articles pulled from different media sources: newspapers, magazines, blogs, etc. Often the same news story will appear in different sources, and sometimes the text will be identical or extremely close. When a user searches for the hot story of the day you want to have all the variations available, but the search results should roll them up together on the page. You can see something like this if you search Google News.

One good strategy is to avoid duplicates entirely, by ensuring that your documents have meaningful URIs. Construct the URI using the same information that determines whether or not a document is a duplicate. This way if content arrives that duplicates existing content, it turns out to have the same URI. Then you are free to update the database with the latest copy, ignore it, throw an error, or call for help. If every news story has a dateline and an author byline, we could construct document URIs based on the date, location, and byline: something like /news/2014/05/30/washington/jones.

But maybe that isn't a very good solution for our news application. Remember that we want to search for articles, but we only want one article per story. So we have to store all the duplicate articles, and we need query-time flexibility to display just one article per story.

Clearly we will need to generate a story-id for each story, one that remains the same no matter how different articles are presented. That might use a mechanism similar to the URI computation above, except that we would put the result in an element and it would not be unique. We could use the same facts we were going to use in the document URI:

<story-id>2014-05-30|washington|jones</story-id>

Once we have our application set up to generate story-id elements, we could try a brute-force approach. Search the database, then walk through the search results. Extract each story-id value and check it against a list of previously-seen story-id values. We could use a map for that. If the story-id has already been seen, ignore it. Otherwise put the story-id in the map and return the article.

(
  let $search-results := search:search(...)
  let $seen := map:map()
  for $article in $search-results
  let $story-id as xs:string := $article/story-id
  where not(map:contains($seen, $story-id))
  return (
    map:put($seen, $story-id, $story-id),
    $article))[$start to $stop]

But there are problems with this approach. Pagination is tricky because we don't know how many duplicates there will be. So we have to ask the database for a lot of results, maybe all of them at once, and then filter and paginate in user code. This gets more and more expensive as the result size increases, and trickier to manage as the user paginates through the results. If a search matches a million articles, we might have to retrieve and check all the matches before we can display any results. That's going to be slow, and probably limited by I/O speeds. Nowadays we could throw SSD at it, but even SSD has limits.

Another problem with the brute-force approach is that facets generated by the database will not match the deduplicated results. You might have a facet on author that shows 1000, but deduplicated filters out all but 100 of the articles.

So let's look at another approach. Instead of deduplicating after we search, let's deduplicate before we search. That might sound crazy, but we have a couple of powerful tools that make it possible: cts:value-co-occurrences and cts:document-query. The idea is to deduplicate based on the co-occurrence of story-id and document URI, without retrieving any documents. Then we query the database again, this time fetching only the non-duplicate documents that we want to return.

Each article is stored as a document with a unique document URI. We enable the document URI lexicon and we also create an element-range index on the element named story-id. As described above, we construct a story-id for every article as it arrives and add it to the XML. This story-id is our deduplication key: it uniquely identifies a story, and if multiple articles might have the same story-id value then they are treated as duplicates.

A deduplication key is application-specific, and might be anything. An application might even have multiple deduplication keys for different query types. However it's essential to have a deduplication key for every document that you want to query, even if only some documents will have duplicates. The technique we're going to use will only return documents that have a deduplication key. An article with no story-id simply won't show up in the co-occurrence results, so it won't show up in search results either.

Here's some code to illustrate the idea. Start with $query-original, which is the original user query as a cts:query item. We might generate that using search:parse or perhaps the xqysp library.

(: For each unique story-id there may be multiple article URIs.
 : This implementation always uses the first one.
 :)
let $query-dedup := cts:document-query(
  let $m := cts:values-co-occurrences(
    cts:element-reference(
      xs:QName('story-id'),
      'collation=http://marklogic.com/collation/codepoint'),
    cts:uri-reference(),
    'map')
  for $key in map:keys($m)
  return map:get($m, $key)[1])
(: The document-query alone would match the right articles,
 : but there would be no relevance ranking.
 : Using both queries eliminates duplicates and preserves ranking.
 :)
let $query-full := cts:and-query(($query-original, $query-dedup))
...

Now we can use $query-full with any API that uses cts:query items, such as cts:search. In order to match, an article will have to match $query-original and it will have to have one of the URIs that we selected from the co-occurrence map.

Instead of calling cts:search directly, we might want to use search:resolve. That function expects a cts:query XML element, not a cts:query item. So we need a little extra code to turn the cts:query item into an XML document and then extract its root element:

...
return search:resolve(
  document { $query-full }/*,
  $search-options,
  $pagination-start,
  $pagination-size)

Many search applications also provide facets. You can ask search:resolve for facets by providing the right search options, or you can call cts:values yourself. Note that since facets are not relevance-ranked, it might be a little faster to use $query-dedup instead of $query-full.

Speaking of performance, how fast is this? In my testing it added an O(n) component, linear with the number of keys in the cts:values-co-occurrences map. With a small map the overhead is low, and deduplicating 10,000 items only adds a few tens of milliseconds. But with hundreds of thousands of map items the profiler shows more and more time spent in the XQuery FLWOR expression that extracts the first document URI from each map item.

  let $m := cts:values-co-occurrences(
    cts:element-reference(
      xs:QName('story-id'),
      'collation=http://marklogic.com/collation/codepoint'),
    cts:uri-reference(),
    'map')
  for $key in map:keys($m)
  return map:get($m, $key)[1])

We can speed that up a little bit by trading the FLWOR for function mapping.

declare function local:get-first(
  $m as map:map,
  $key as xs:string)
as xs:string
{
  map:get($m, $key)[1]
};

let $m := cts:values-co-occurrences(
  cts:element-reference(
    xs:QName('story-id'),
    'collation=http://marklogic.com/collation/codepoint'),
  cts:uri-reference(),
  'map')
return local:get-first($m, map:keys($m))

However this is a minor optimization, and with large maps it will still be expensive to extract the non-duplicate URIs. It's both faster and more robust than the brute-force approach, but not as fast as native search.

Pragmatically, I would try to handle these performance characteristics in the application. Turn deduplication off by default, and only enable it as an option when a search returns fewer than 100,000 results. This would control the performance impact of the feature, providing its benefits without compromising overall performance.

It's also tempting to think about product enhancements. We could avoid some of this work if we could find a way to retrieve only the part of the map needed for the current search page, but this is not feasible with the current implementation of cts:values-co-occurrences. That function would have to return the co-occurrence map sorted by the score of each story-id. That's tricky because normally scores are calculated for documents, in this case articles.

One way to speed this up without changing MarkLogic Server could be to move some of the work into the forests. MarkLogic Server supports User-Defined Functions, which are C++ functions that run directly on range indexes. I haven't tried this approach myself, but in theory you could write a UDF that would deduplicate based on the story-id and URI co-occurrence. Then you could call this function with cts:aggregate. This would work best if you could partition your forests using the story-id value, so that any the duplicate values articles are guaranteed to be in the same forest. Used carefully this approach could be much faster, possibly allowing fast deduplication with millions of URIs.

For more on that idea, see the documentation for Tiered Storage and the UDF plugin tutorial. If you try it, please let me know how it works out.

Read and Post Comments

Where am I?

Miso-Wasabi Dip

Solara

Sambal Olek

Toum

Deduplicating Search Results

Latest blog posts

Recommended

Apps

Projects

Categories

Archives