Attn Browser and phone makers: Please, please support the new HTML5 input types (date, time, slider, color, etc.)
I’ve been quite involved in mobile web development recently and of course very concerned about page speed, page weight, and user experience. Forms and input controls are an essential part of engaging with others on the web. There are some frameworks and libraries that have produced some elegant input controls over the years, like jQuery UI and YUI, but they come with a price.
If I want to include a datepicker on my webpage, I have to include a lot of javascript code if the browser doesn’t natively support input type=”date”. In the mobile world, javascript is more expensive than on the desktop both in terms of download time but also in parsing time. You can cache a 100KB javascript file but that code will be parsed on every page load which can add several milliseconds (200ms for me to do jQuery and jQuery UI). For me to have a mature datepicker control I may have a pretty good performance hit.
There are tricks that you can do, like jQuery Mobile never actually does a page load after the first page but rather uses ajax to insert a new page into the DOM, but I have a better idea: how about the browsers and phones just support the new HTML 5 input types and then I don’t have to do a thing! And it will be cross-browser compatible! And be a familiar interface to the user!
Opera and the iPhone are the only ones that support the new HTML5 inputs, but only some of them. I have several mobile phones for testing: iPhone, Android, Windows 7 Phone, Blackberry and feature phones. They all have strong points. But I do not understand the phone wars when THEY DON’T EVEN SUPPORT THE NEW HTML5 INPUT CONTROLS but rather spend their time and effort creating new themes and menus or whatever.
And while I’m at it, it turns out that only the iPhone can really do CSS3 transitions. The other phones may technically support them but the processor is so weak that the transitions don’t look right. Seriously? What are you Android and Windows Phone developers thinking you’re doing? In all your dev meetings no one says: “Uh, why don’t we actually support HTML5″. You can make a Live Tiles and the Metro Interface, but not a datepicker? You can make animated backgrounds and slick system menus but you can’t make CSS animate or flip look anything close to decent? We can do local storage, Web Workers, and geolocation, but not the input types? Is a slider really that hard?
There are frameworks, like Sencha and jQuery Mobile, that are “cross-browser” and “cross-phone” compatible, but that doesn’t mean you get the full experience across all devices. It means that they degrade elegantly on those devices that don’t have support for the features you want to use. And except for the iPhone, everything degrades to the point that I don’t even know why I need the framework anymore because it’s just HTML pages with CSS and fade transitions between pages. But even the code to degrade has a cost to determine whether and how to degrade, and in my testing this uses results in lag, choppiness, and screen flicker.
So browser and phone makers: Please, please support the new HTML5 controls and if you’re feeling particularly benevolent, the CSS3 transitions. I can’t imagine it’s much harder than other things you are doing. Plus it makes it much easier on mobile web developers, and makes for a better experience for the user.
Until this happens, I really can’t recommend anything but the iPhone to people. It is so far ahead of the others in HTML5 that there’s no comparison. We are not talking about taste or preference in the way iPhone handles web applications or that it does them better. It’s the only one that can do certain things, like CSS3 transitions and the new HTML5 input controls. The others just don’t do them. And for anyone who thinks I’m a fanboy, go carry around an iPhone, an Android, and a Windows 7 Phone for a week each, then develop a web app to work on each, and then come back and tell me I’m wrong.
Using a Semantic Data System for Genealogy Data
Last spring at my brother’s wedding I had an opportunity to talk to my aunt who is getting a degree in Genealogy from Brigham Young University. Far from being a quaint little hobby of grandmas, genealogy is some serious business to professionals in the industry. My aunt told me about a project she had been working on where she had filled up four full-sized whiteboards completely with information in the effort to come to conclusions for her project. Naturally I asked about what computer applications she uses and what format the data is in. That’s when the rush of frustration was unleashed on the state of the technology for genealogy.
While there are some good sites that host indexes that you can search, there really was nothing that met her needs to compiling and coalescing the information and for capturing and presenting her conclusions. The amateur software was obviously not sufficient for what she needed and there seemed to be not good data format for any of the data except for pedigrees, which actually is the result of research, ie the conclusion, not the pieces of data that lead up to determining a pedigree.
I told my aunt that I imagined that genealogy data is pretty messy, that it’s often incomplete, inconsistent, contradictory, in different levels of detail, quality, uncertainty, and authority. And rather trying to fit messy data into defined buckets or categories or pedigrees, what there really needs to be is something that embraces the messiness and just captures everything and what the system produces is probable profile of an individual. Individuals and families become more of a statistical probability of a convergence of loose data points, rather that a “Person” in the system. This may sound a little harsh and impersonal, but we often don’t have data sufficient to make an authoritative call on what a “Person” is based on the data available. So we have to switch to thinking of the data as revealing information about an individual with a non-definite amount of certainty.
My aunt of course lit up at this, and after I said I have experience in a system that can do just that, I was invested. I bought several books on genealogy for professionals and my understanding of them and the field increased tremendously. A professional genealogist is part historian, part biographer, part information analyst, part linguist, part author, and forensic expert, and other things too I’m sure. Evidence is paramount. If you can’t provide sources and correct citations for the information and conclusions you’ve drawn, then you have no conclusions that anyone can accept. The most common publishing avenue is in document form, as in reports, book, and articles. The sharing of the data itself is not where it needs to be, often being trapped in footnotes, in-document tables and charts, and of course whiteboards (or photos of whiteboards).
The RootsTech Developer Challenge
The RootsTech Developer Challenge was a good opportunity for me to make reality ideas I have had on how to address the need for improved genealogy systems. The challenge is billed more for “apps” to enhance the public’s engagement with genealogy and to foster increased interest in the field. I did not think the problem or next step is better apps. I think the problem is how the data is handled. Even amazing apps aren’t going to be able to compensate for issues with a weak system for handling the data. What I thought is that we need a fundamentally different way to handle genealogy data and embrace the messiness.
What we needed was a system that is designed to take in data on just about anything, from individuals to locations, from factoids of an individual’s life to full and complete pedigrees. We needed a system that did not seek for the one correct data set on a person but rather a high tolerance of uncertainty, contradiction, and even error. Some genealogy systems in the past sought to create a single pedigree for everyone but what ended up is people overwriting other’s previous entries and errors that were introduced were perpetuated, causing a gradual degradation of the quality of the data in general in the system. We needed a system that could take in anything and with quality and certainty measures be able to tag the data points so users could know the quality of the data. And we need the users themselves to be able to set some of the quality and certainty measures.
We also need users to be able to upload the results of their work, research and conclusions, not just in report or final form, but even the fragments and bits that they have. Users need to be able to make these data public, keep them private, or share them with others that they choose. Users need to be able to incorporate these data in their searches, and they need to be able to pick and choose what kind of data (public, private or shared) when they search. They also need to be able to choose the quality and certainty level of data they want to include in their search to improve the signal to noise ratio of their results.
And why can’t the system go through the data and infer some results? If John Smith was divorced from Alice in 1834, why can’t the system infer that John Smith is male, was married before 1834, and that his wife was probably named Alice Smith? And why can’t the system infer that the John Smith in that divorce record is the same John Smith in another census record that says he lived in Pleasantdale, Maine and had two kids, or at least provide a level of probability? Why can’t then the information between these two records be combined when I search for “Smith divorced Maine” and have a hit, showing me all the combined information for person with the quality and certainty levels displayed?
Why can’t users add their conclusions, assertions, and inferences, too? Why can’t Sally add that she found an obituary for John Smith in Pleasantdale in an old microfiched copy of a newspaper and it says he was survived by three children? And after all these derivative data points are added in the system, why can’t I trace each one of them back to their original sources, with contact information of the people who captured or entered the information?
It might sound nice but pie-in-the-sky, but that’s what I did and that’s what I entered into the RootsTech Developer Challenge. Granted the application is ugly (think Geocities…beta) and I’m sure it has bugs, but everything I described above is included and it works and works fast (all subsecond with a 4GB database), with the exception of being able to include certainty levels in searches. I don’t expect it to win and I am kind of done with that challenge because I feel like I figured it out and now I want to move on with it, but I enjoyed the challenge.
How it Works
The data model was the single most important factor of getting this system to work. Relational data models with tables and columns is just insufficient because the data could be virtually anything and could be virtually in any structure, and that just doesn’t fit in tables and columns. XML is better because it can allow any structure in a document with an field and any values, but that still doesn’t make searching the data any easier. To search the data and make sense of the values you need to know ahead of time what fields are available. A Semantic Data Model provides a way for any data to be of any type and associated with any other thing, be it a value or an entity, so that’s the model I chose.
The system I built at its heart is a Semantic Data system with some modifications. Semantic data is in Subject – Predicate – Object form. The subject is anything, often it is an id or key of some “thing” but it can be anything. The Predicate is the “type” of relationship between the Subject and the Object. A Predicate can be for a value (“Eye color is”) or it can be for the relationship between two things (“is Father of”). The Object is either an value (“Blue”) or the id or key of some entity (“123-John-Smith”). The Predicate usually tells you if the Object is a value or another entity, which would probably be used elsewhere as a Subject in some other fact.
This kind of triple, S-P-O, is called a tuple and can be used to model virtually any kind of data. For genealogy, there may be a tuple like:
“123456789″ “first-name” “John”
and another tuple
“123456789″ “last-name” “Smith”
and another
“123456789″ “married-to” “4443333222″
and another
“4443333222″ “birthplace” “Chicago”
and another
“11111111″ “father-of” “4443333222″
and another
“11111111″ last-name” “Grant”
etc.
So in the above example, we know that there is someone with a name of “John Smith” who married someone who was born in Chicago and whose father’s last name was Grant. Even without complete information, we can “walk the graph” and go from person to person using relationships defined in the data and view the data we have for each person and at each step. Note too that the relationships can be bi-directional. We may know that A is the father of B or we may know that B is the daughter of A. Either way we can walk the relationship graph and infer the parent-child relationship even if that relationship is only original defined one direction. Better yet, the system can make the inference and add that inference as a new tuple into the system.
For a collaborative semantic genealogy data system, Subject – Predicate – Object is actually insufficient. Three other aspects need to be added: Time, Source, and Quality. If we know that in an 1850 census is says John Smith was 34, we want to capture that original data as-is with no interpretation in order to preserve the data integrity. So we would have:
“12345-page-4-line-2″ (or whatever we decide the subject should be) “Age is” “34″
But that fact was not always true, just when the census worker captured the information. So we need to add Time, which in this case is “1850″ or as much as we know. So now we have Subject – Predicate – Object – Time (S-P-O-C):
“12345-page-4-line-2″ “Age is” “34″ “1850″
But what if that data is coming from something written in someone’s Bible. We want to capture that too, So now we’ll have Subject – Predicate – Object – Time – Source (S-P-O-C-E):
“12345-page-4-line-2″ “Age is” “34″ “1850″ “Alison-Grant-Family-Bible”
Now the source would be a key to some other data that would have detailed information about the source and contact information for whoever got the information from the source. But how reliable is this data? Is it for the right John Smith? Is the information legible in the Bible? Is it firsthand, secondhand, or more? We need a quality value, so know we have Subject – Predicate – Object – Time – Source – Quality (S-P-O-C-E-Quality):
“12345-page-4-line-2″ “Age is” “34″ “1850″ “Alison-Grant-Family-Bible” “4″
Assuming a scale from 1 to 10 and 10 is absolutely certain. Now we have a model that can capture virtually any data, for anything, true at a particular time, of any quality, and noting the source. For sharing and privacy we can either add those into the tuples, or use system permission controls to secure the tuples themselves.
Yeah But How Can This Be Fast?
We covered conceptually how to model the data to achieve the goals I listed at the beginning. Fortunately there are several products in the industry for Semantic or Graph Databases. I chose to use MarkLogic because it was familiar to me, I knew it was extremely fast, can scale to the petabyte level, but most importantly it has very advanced language tools. Other graph database may be fast linking different pieces of data together, but I also needed one that was fast when searching for first names that sound like “John”, or contain “John”, or are spelled similar to “John.” I needed something that could find word stem hits, like when I search for “run” I get “ran”, “runs”, “running” also. For this free text searching (which is almost entirely against the Object in the tuple) I need a powerful search system. So MarkLogic gave me both the free text searching and the fast data linking, and security down to the user and tuple level, which can scale to petabytes. I suppose this system could be build on other platforms, but I’ll leave it to the reader as an exercise to prove out.
The implementation details and actually code used can be pretty complicate and lengthy, but for those interested almost all the magic is in the prolific use of cts:queries in MarkLogic. For searches such as “Family of John Smith” there is a Predicate Resolution step which builds a list of all Predicate types that are part of a “family” type which is determined by rules (“husband-to”, “brother-of” etc). Then this Predicate Set is sent as exact values to match in the P. The queries against Objects (O) are stemmed, case-insensitive, diacritic-insensitive, punctuation-insensitive, and whitespace-insensitive. I also created a Double Metaphone custom index on all the values for Objects of Predicate types of names and locations. So I took all unique values of O for those types and created a file for each type of the calculated Double Metaphone value, which MarkLogic provides an API for. Then the Object Resolution step include the original value typed and all the values from the Double Metaphone index (which are existing values in O) which have the same Double Metaphone value and which are then also included in the O set to the query. All linking and joins are done through the Subject (S). For tuples that have been linked “Same-As”, I query all Objects, get their subjects and filter them to the ones that have Same-As matched within the result set. This is the Subject Resolution Step. And voilà, I have my results.
Relevancy is not so good yet, and all the sharing and access control I’ve done via yet another field to the tuple: Sharing. But I’m rethinking that and considering using document permissions instead. Below are a couple of functions so people can see some of the code. It is most of the code in a free text search field that expects that you are going to include a first name and last name in your search.
What’s Left?
I don’t think I’ve answered the need of what my aunt needs for her individual research, but this provides the engine and capability to build off of. There lots of application interface coding left to do, too. Getting data for this has actually been pretty hard. Any genealogy data dumps I could get were on CDs from the 90s in Infobases format, with threatening EULAs about not using this data in another system. So I didn’t use that. Oddly there is a county in Maine that has a decent amount of census data online (trapped in HTML pages): Aroostock County. The State of Maine has several databases posted online (mostly in Access databases) like military record, divorces, court proceedings, and Revolutionary War Land Grants. I spent about half of my time getting the data and processing it into a format I could use, eventually in semantic form.
This system can be used as a way for researcher to enter share, and publish, their work and conclusions, and they can do so without any original data being in the system. They can make the Subjects individuals on the big genealogy sites, and the can make the sources point to those specific URLs elsewhere. This would then effectively be a big meta data system. The data need not be limited to name and date information, but can have photographs, scans, video, pdfs, and any other kind of data. But I really hope I can either get data or work with one of the big website to incorporate these capabilities. It’s just time and money.
declare function free-search($phrase, $sources, $start, $end) {
let $terms := fn:tokenize($phrase, " ")[. != ""]
let $subjects := reduce((), $terms)[$start to $end]
let $subject := get-all-same-as-subjects($subjects)
let $subjects :=
cts:search(/t,
cts:and-query((
cts:element-value-query(xs:QName("s"), $subjects, "exact")
,
get-cts-search-source-query($sources)
))
)/s/text()
let $subjects := fn:distinct-values($subjects)
return
<results>
{
for $s in $subjects
let $family-name-hit := (/t[s = $s][p = $type:person-family-name]/o/text())[1]
let $given-name-hit := (/t[s = $s][p = $type:person-given-name]/o/text())[1]
let $name := fn:concat($given-name-hit, " ", $family-name-hit)
let $query :=
cts:and-query((
cts:element-value-query(xs:QName("s"), $subjects, "exact")
,
cts:element-word-query(xs:QName("o"), $terms)
,
get-cts-search-source-query($sources)
))
let $t := cts:search(/t, $query)
return
<hit>
<s>{$s}</s>
<name>{$name}</name>
{
for $at in $t
return
<highlight>
{$at/p}
{$at/o}
{$at/c}
{$at/e}
</highlight>
}
</hit>
}
</results>
};
declare function reduce($subjects, $terms) {
let $matched-subjects :=
if (fn:empty($subjects))
then
cts:search(/t,
cts:element-word-query(xs:QName("o"), $terms[1])
)/s/text()
else
cts:search(/t,
cts:and-query((
cts:element-value-query(xs:QName("s"), $subjects, "exact")
,
cts:element-word-query(xs:QName("o"), $terms[1])
))
)/s/text()
return
if (fn:count($terms) = 1)
then fn:distinct-values($matched-subjects)
else reduce(fn:distinct-values($matched-subjects), $terms[2 to fn:last()])
};
declare function get-all-same-as-subjects($subject) {
get-same-as-subjects($subject, ())
};
declare function get-same-as-subjects($check-subjects, $found-subjects) {
let $other-subjects := fn:distinct-values(
for $check-subject in $check-subjects
return
cts:search(/t,
cts:and-query((
cts:element-value-query(xs:QName("p"), $type:person-same-as, "exact")
,
cts:or-query((
cts:element-value-query(xs:QName("o"), $check-subject, "exact")
,
cts:element-value-query(xs:QName("s"), $check-subject, "exact")
))
))
)/(s|o)/text()
)
let $new-subjects :=
for $other-subject in $other-subjects
return
if ($other-subject = $found-subjects)
then ()
else $other-subject
return
if ($new-subjects)
then get-same-as-subjects($new-subjects, ($new-subjects,$found-subjects))
else $found-subjects
};
Code for translating content using Google Translate
Often I use machine translated content when I am developing so that I have content in various language before I get the official translation. I’m not too concerned that the translation is correct, just as long as it is representative of content in that language. I’ve found that creating a website from the beginning in at least two different languages helps me avoid coding myself in a box and making assumptions that are only valid in a single-language website, particularly in my queries for content in the DB and in how the web page layout handles strings of different lengths.
I usually create an initial resource bundle of some sort in English and then use Google Translate to translate the strings in the bundle into various target languages. I automate this so that I just push a button or have the script execute in a trigger so that when I change or add to the resource bundle in English I can re-translate it easily.
Here is the type of script I use to make the call to Google Translate:
declare function local:get-google-translation($text, $source-lang, $target-lang) {
let $url := fn:concat("http://translate.google.com/translate_a/t?client=t&text=", xdmp:url-encode($text), "&hl=", $source-lang, "&tl=", $target-lang, "&multires=1&sc=1")
let $response := xdmp:http-get($url,
<options xmlns="xdmp:http-get">
<format xmlns="xdmp:document-get">text</format>
</options>
)
return fn:tokenize($response[2], '"')[2]
};
let $text := "Rome (CNN) -- Transcripts published Tuesday capture the dramatic conversations between port officials and a cruise ship captain, who a judge ruled can be held under house arrest while Italian authorities investigate his role in last week's disaster."
let $source-lang := "en"
let $target-lang := "es"
return local:get-google-translation($text, $source-lang, $target-lang)
=> ROMA (CNN) - Las transcripciones publicadas el martes la captura de las conversaciones entre los funcionarios del puerto espectacular y un capitán de barco de crucero , que puede ser un juez dictaminó bajo arresto domiciliario mientras las autoridades italianas investigar su papel en el desastre de la semana pasada .
Node.js is good for solving problems I don’t have
I have recently starting programming with Node.js and I like how simple and easy it use to write HTTP server code with it. Just because it’s easy doesn’t mean it’s appropriate for my needs or that it’s ready for prime time. What I have noticed in learning and using Node is that it was created primarily as a response to a problem that I just don’t have, or in fact that most web applications shouldn’t have.
Node was created to provide an event-based web server programming model that better utilizes threads on the server, particularly when it comes to IO operations (like filesystem reads or database calls). So rather than a thread having to wait for an IO operation to finish before program execution continues, the thread implements a callback to be called when the IO operation is finished. This way the threads are able to server more request because they aren’t waiting for expensive operations to complete.
Who has this problem? Whose web application performance bottleneck is that their threads are waiting for IO to complete? If this is your problem, then I don’t think you’ve got a very good web application implementation. Let me explain why.
HTML Generation is usually not the slowest part
Given that a web application is correct, available, and secure, users care most about speed. They don’t care about your hardware utilization or how many requests are handled per server, and they also don’t care how fast your web app is on average, they care about how fast it is for them. Page load time and time-until-usable are what users are concerned about.
Looking at cnn.com, there are 89 requests, totalling a little less than 1MB, which took 4.24s for it to load for me. Of those 89 requests, 3 were HTML requests from cnn.com (1 for the HTML page and 2 for weather). The HTML from cnn.com is about 30KB…out of 1MB! So if you want to speed up your site, where is the best place to focus? HTML from the server that makes up 3% of the total weight, 3% of the total number of requests, and 9% of the total page load time? Or would you focus on reducing the number of requests, the size of assets, and the caching of those assets?
Cnn.com’s HTML took 336 ms to get to me. Let’s say you made that 10x faster. You would have then reduced the total page time by 300 ms or about 7% of the total page load time and still get about 4 seconds for page load. You could have a 1 ms HTML response time and still have a slow site. The HTML generation and return time is usually not where the problem is for web application performance.
Most of the assets on a web page are static (meaning they don’t change per request) so they can be served by a cache server (so the origin server isn’t hit) and by the browser (so not even the cache server is hit). The origin server can generate the HTML and server up the static assets if needed, but it really shouldn’t do that very often because the browser cache and cache servers should be serving them. So then what you really need is a content server that is geared toward HTML generation, whether it be static or dynamic. So you have the origin server generating dynamic but cacheable HTML (like templated by little-changing info pages), and for handling dynamic but non-cacheable HTML (like search).
The content server should not need to do hardly any IO. Why would an HTML content server need to write to the filesystem? Even if it does, why does the web visitor need to wait on the result of that file write operation before seeing the server response? If you really need to write to the filesystem, spawn a thread or offload that operation to something else that can queue up write operations. Your content server doesn’t need to do it; it just need to invoke something else to do it.
If your content server is serving up dynamic content, what else can it be doing before it gets the data from the database? It’s primarily going to be formatting and creating presentation using the data from the DB, and if it has something it can be doing in the meantime I’m arguing it should be doing it. Something else can communicate with other services and cache HTML fragments or whatever. All the content server does is process content, so if it has to wait for the data, it waits.
But why would there be any IO for data that takes much time at all? If the data is so far removed from the presentation engine (the content server) that it blocks for any noticeable amount of time, you got a problem with data retrieval. The answer isn’t to create a callback for when the data finally arrives from the DB, the answer is to fix the problem of data coming back so slow from the DB.
Functional programming facilitates optimized and parallelized execution
One of the reasons I like functional programming is because the execution engine is able to parallelize function calls because functions only operate on data coming into the function and only output a result. Function don’t change properties or state on objects in memory. Since there’s no shared state or objects that can be accessed by two different processes, all operations are threadsafe. Better yet, with lazy evaluations like what MarkLogic does for many things, you can capture the result of a function call in a variable, but the execution engine doesn’t need to actually make the function call until you access something on that variable, which could be at any later point in your program. In fact, if you never access the variable the execution engine may never actually call the function that returns the value for that variable. Order of execution becomes much less important because the functions have no side effects and can be executed whenever the execution engine decides. The execute of one function does not affect another, so you can execute them all at once, or whenever resources are available. With Node, you’d be writing code to do all that: optimizing the method calling yourself. Instead, use a functional language and let the execution engine do it for you.
The problem I have is processing a lot of data quickly
I have megabytes and gigabytes of data to query and format for display on a web page. I need to be able to find a needle in a haystack and transform it into presentation quickly, for every request. First I need to get the speed down for just one user because that is as fast as I can go (unless another user were to cache it). Then I need that speed to remain fairly constant at scale, both with web traffic and amount of content. I am less concerned about how many requests each server can handle because I can scale horizontally if needed for both traffic and content size. With MarkLogic I have extremely fast access to the content I need. There’s no IO blocking to speak of. Even if there were, the execution engine will do some optimizing so parts of my code can execute in parallel. I spend time reducing query times, not coding callbacks for them.
Node enthusiasts are front-end coders not wanting to do server coding
I have used Javascript for over 15 years. I learned it before I learned Java. It’s really not too bad. I think what has happened in the web developer community is that some people who know front-end programming have gotten all excited that they can use their front-end skills to program the server. In fact, they think that they can even move a lot of processing that used to be done on the server up into the browser, using the programming languages and techniques they are used to, and all of the sudden it’s revolutionary and cutting edge. That’s a big reason CouchDB gained popularity, because there was no need for server programming. With HTML5, some have the idea that we don’t even hardly need a backend service at all, just to persist some state once in awhile.
So the Node community has tried to sell Node as solving a fundamental problem with server programming (blocking IO calls) but that’s really not the problem with web page speed or even server speeds, especially per request. I think the real reason is that they are mostly novices that want to use Javascript for the server side but they use the “blocking” argument to convince others. All the Node enthusiasts I know, some personally, are not very skilled server programmers but have pretty strong front-end skills. This revolution is more about front end coders not having to deal with the server side than any breakthroughs about how to do the server side. And the exuberance and arrogance from enthusiasts is meant to shame non-enthusiasts into thinking they’re old school, antiquated, or unable to learn new things, that this is the future and in a few years we’ll all be programming in Javascript and if you don’t get on board you’ll be out of a job (I heard this first-hand). Node has to be adopted, otherwise all these front-end coders will have to learn server programming.
But there are lots of things I like about Node, but not the community. I plan on using Node for easy HTTP server programming and for handling a large number of connections. But I need a Big Data server and a content server to generate dynamic and personalized HTML and to handle search. I’ll offload the HTML assets and cache as much as possible to cache servers, and I’ll optimize the front-end code to increase performance. Blocking calls, including IO, are just not one of my problems.
How to set system variables on a MarkLogic App Server
Sometimes you want to be able to set variables at the system level and have your code be able to retrieve those values at run time. For example, if you want to know what lane you are on (dev, test, prod, etc.) or what endpoint you need to call for a service which would depend on what box you are running on. MarkLogic doesn’t have a formal way of setting system variables but there is a little trick I learned today that mimics this pretty well.
Global Namespaces can be added at the Group or Application level in MarkLogic. Through the Admin Interface on port 8001 or through the API you can set a prefix and namespace URI which is accessible in the code. You set it on the Group and then all App Servers in the Group will be able to access it, or you can set it on the App Server. The App Server’s namespace will override an existing Group namespace.
So if I wanted to set the type of lane my code is running in, I could set a namespace at the Group level that has a prefix of “lane” and a URI of “prod”. The following code would get the value:
fn:namespace-uri-for-prefix("lane", <lane:blah/>)
=> prod
And if I wanted to set some endpoint, I could create a namespace on the Group with the name “endpoint” and URI “http://mysite:6005″
fn:namespace-uri-for-prefix("endpoint", <endpoint:blah/>)
=> http://mysite:6005
Since these are global namespaces you don’t have to declare the namespace in the prolog, so you don’t need any more code than shown above.
Granted this is not using global namespaces for their intended purpose, but it seem to work pretty well.
My Three Favorite New Features of MarkLogic 5
There are three new features in MarkLogic 5 that I am especially excited to see: better binary content handling, configuration importing and exporting, and retrieving the original URL of the request before URL rewriting. All of these save me development time and amount of code that I need to write.
Better binary content handling
MarkLogic has always been able to store binary files in the database, but if the files were too big or if you had too many files, your caches may have been adversely affected and your database merges may have taken longer than they needed to. In the past, when we had a lot of binary content that we wanted to serve off of a MarkLogic-powered website we would keep the binary files on the files system and just put the metadata file in the MarkLogic database. This worked fine, even streaming the files off the filesystem through MarkLogic, but we had to code the implementation and we always had to make sure the metadata files were in sync with the binary files on the filesystem. We don’t have to do this anymore with MarkLogic 5.
MarkLogic 5 introduces Rich Media Support which means that large binary files are handled differently than XML and text files under the covers in the server. There is a configurable threshold for the size of a binary file to be considered “large” as to be handled in a more efficient way. These large binary files are handled by MarkLogic as efficiently as if you saved them to the filesystem yourself. But you don’t need to use an special API or different functions that you would use for the XML and text files. You just insert the file using xdmp:document-insert() and MarkLogic will handle the rest.
Configuration importing and exporting
The Administration Interface on port 8001 provides a nice graphical, point-and-click interface for managing and configuring a MarkLogic installation. But for mature implementations, you’ll probably want a way to declare the settings for the servers, database, forests, etc. and script the configuration changes. There are several good implementations that do this outside of MarkLogic, but now you can just export the settings of an installation and get the full configuration settings in a XML file. You can import this XML file into a separate machine to stand up an installation with the exact same settings. You can also check in the configuration settings file into source control, make changes to it, and re-import the file back into the MarkLogic installation to affect those changes. As part of troubleshooting you can take a fresh export of the settings of an installation and compare those settings to the configuration settings file you had in source control to see if there were any inadvertent changes to the installation.
Getting the original URL of the request
This may seem to be a minor feature but one that can save me code and complexity. It’s always been possible to get the request URL from within XQuery code by calling xdmp:get-request-url(). But this returns the URL after the URL rewriter has rewritten the URL. What if you wanted to get the URL before the URL was rewritten? In previous versions of MarkLogic you’d have to get the request URL (by calling xdmp:get-request-URL()) in the URL rewriter itself and adding the original URL as a parameter to the rewritten URL. For example,
fn:concat("/new/url?orig-url=", xdmp:get-request-url())
Then in subsequent code you’d get the original URL by getting the request field, like xdmp:get-request-field("orig-url"). That works but it can be a pain if you forget to add the URL as a parameter, or you make in error in the code to retrieve it. But now in MarkLogic 5 you can just call xdmp:get-original-url() which will return the URL as it was before the URL rewriter changed it. Less code I have to write. Less complexity. Fewer bugs.
MarkLogic is fast in terms of performance but also in terms of development time. I spent ten years in the Java world and time-to-market was extremely important, and is still is now. I have never been able to implement mature, high-performance, enterprise solutions faster on any other platform than on MarkLogic. The new features of MarkLogic 5 that excite me the most are the ones that reduce that time-to-market for me even more. Most if not all of these features are the results of customers lobbying for them, and MarkLogic has listened. I have been vocal about binary content handling and now it’s part of the server. I’m looking forward to this new version so I can continue to push the boundaries of delivering solutions for my customers in less time and with less risk.
Identifying entities in search phrases
TL;DR use regular expressions
From keywords to concepts
A lot of attention is given to finding relevant documents given a set of search terms, but what doesn’t get as much attention is the task of determining what the user means or intends before we start going searching through documents for matches. In a recent article on TechCrunch, Nadav Gur talks about the future of finding information on the web and the rise of personal virtual assistants. He talks about the shift from keyword matching to concept matching and that the first step is to really understand what the user means with the words he typed in the search box (or spoke to the app).
If the user is already aware of the kinds of documents, scope of the corpus, and terminology used in the system then he can probably type words that are going to match what he means with how the system stores that meaning. Thesauruses help with synonyms, but they don’t go far enough helping with understanding the users’s meaning.
Imagine if someone searches for “President of the United States” but all the documents have “President Barack Obama.” Obviously there may be many relevant documents missed because the form of the concept differs between the search phrase and the documents data. Although it’s easy for us humans to see they are the same thing, the computer system needs more help to make the connection.
Several companies offer entity identification and extraction services which will identify that “President Barack Obama”, “President Obama”, “Mr. President”, and “The President of the United States” are all the same entity and which can enrich the content to use a consistent tag around the terms. So the problem of differing forms of the entity is solved, but what isn’t solved is going from what the user typed in to the entity tag we created. What we really want is to be highly tolerant with user entered data that matches out-predefined entities in a highly confident way and at fast speeds.
Bridging the gap between user and content
Depending on the type of system we’re dealing with, we may be more interested in entity type or particular entities. That is, are we more interested that “President Barack Obama” is an entity of type “person” or that it is the particular entity that represents Barack Obama. The problem I have been working on recently is identifying particular entities, not types, and the entities are all modeled semantically. Correctly identifying entities provides the on-ramp into the semantic world of relationships among the entities where I can walk the relationship graph. The hard part is getting from what the user typed to the semantic data because the user is going to type what makes sense to him, not necessarily using the terms you did internally.
Once we’re in the semantic world we can do all sorts of interesting things like show related data, compilations, provide disambiguation links, and ordered information. We can go way beyond simple document matching on terms the user types.
We want yes\no answers on identification, not relevancy
Imagine a users types:
“President Obama visits Republic of China with the VP”
As humans we can see two person entities (Barack Obama and Joe Biden) and a country entity (China). We also know which terms contributed to identifying the entities (“President Obama”, “the VP”, and “Republic of China”). Knowing this we don’t need to use the terms “President Obama” or “Republic of China” or “the VP” when we actually search content because we’ll use some canonical form of those entities instead. That really just leaves us with “visits” and “with” as content terms, and “with” is a pretty weak word so we’ll probably not want to use that. If we could pick out these entities in the search term itself we can get the best search results regardless of what form the content originally had these terms. In fact, we don’t even necessarily have to already have done entity enrichment on the content as long as we know all the forms that these entities might take in the content.
Identifying entities in the search phrase is easy if the entire search phrase is just an entity: “President Obama” or “Barack Obama.” But when the terms for the entity are in the search phrase with other terms it’s harder to pick out. When I first tried to solve this I tried to essentially search entity data using permutations of the search terms, e.g. “President Obama”, “President Obama visits”, “Obama visits”, “China with the”, etc. Performance degrades significantly as the number of search terms increases because the number of permutations increases rapidly.
There’s also a more subtle problem with trying to search for entities given search terms: you get results based on relevancy instead of a discrete yes\no answer. It’s not that helpful to know that the most relevant match for “President Obama” is the canonical phrase “Barack Hussein Obama II”. Is it a that entity or not? Fuzziness doesn’t really help here. We want identification, not relevancy for entities.
Entity identification presupposes that you already know the entities you want to use (or perhaps types if that’s what you’re doing). Identifying entities that you have no information about isn’t very helpful. At a minimum you’d want to have canonical forms of the entity if not having semantic information on the entities. Regardless, you’re usually starting with what you already know and are looking in the search phrase to see if you can find terms that match those things you know. This means that you have to reverse the direction of the finding operation: rather than taking terms and search content, you take the content and match it against the search terms.
Regular Expressions to the rescue
MarkLogic fortunately has a very fast regular expression interpreter. In my experience, I can do do about 1000 regular expressions in about 100 ms. I have also found that very long and complex regular expressions don’t really run any slower than simple expressions so it’s better to have single, complex expression per entity that covers all possible forms rather than have several different expressions per entity.
I generate the regular expressions from entities I already know about as a starting point and once in awhile modify individual expressions by hand. I also have as an attribute of the expression a reference to the entity it represents. I prefer to have a single XML file with all the regular expressions so that it’s easier to see them all and easier to change and version.
For example, I might have Barack Obama modeled in my semantics datastore with the tuple subject as “barack_hussein_obama_II” and might have an XML file that looks like this:
This will match on “President Obama”, “Barack Obama” and “Barack Hussein Obama II” as well as other combinations.
But we can add more patterns to match on to catch what users might enter such as “President of the United States”:
Granted that single expression is getting insanely long which is why normally you’ll want to generate it. But let’s see how fast it runs.
On my laptop, running the above expression 10K times against the phrase “President of the United States” takes 0.14s. Not bad.
Getting the terms that matched
Knowing that the something in the phrase “a visit by the President of the United States” matched a certain regular expression is helpful, but we ultimately want to know which terms in the phrase matched so that we can toss those out and use the remaining terms as search terms.
We’ll use fn:analyze-string and get the matches:
Now we can tokenize the search phrase and confidently pull out these terms if we wanted to.
Since fn:analyze-string runs so much slower than fn:matches, we’ll want to only run fn:analyze-string if fn:matches returns true. So putting it all together, we might have code that looks like this:
declare namespace s = "http://www.w3.org/2009/xpath-functions/analyze-string"; let $search-phrase := "a visit by the President of the United States" let $regex-matches := for $regex in /person-regex/regex return if (fn:matches($search-phrase, $regex, "i")) then let $s := $regex/@s/fn:string() return <entity type="person"> <s>{$s}</s> <terms>{fn:analyze-string($suggested-phrase, $regex, "i")//s:match/fn:normalize-space(fn:string())}</terms> </entity> else () return $regex-matchesAnd now we know what what entities match on what terms. We can add entities to match on the regex XML file and we can modify the regular expressions on an individual entity basis. And all this runs in a fraction of a second, depending on the number of regular expressions and (a little) on the length of the search phrase, all in real time for each user’s search. If there are multiple entities that match, we’ll get those too separated out.
What if you have a very large number of entities you want to match against, or if you want to do this entity identification even faster? For that you’re going to want to execute the regular expressions in parallel which I’ll discuss next time.