Archive

Archive for January, 2012

Using a Semantic Data System for Genealogy Data

January 24, 2012 4 comments

Last spring at my brother’s wedding I had an opportunity to talk to my aunt who is getting a degree in Genealogy from Brigham Young University. Far from being a quaint little hobby of grandmas, genealogy is some serious business to professionals in the industry. My aunt told me about a project she had been working on where she had filled up four full-sized whiteboards completely with information in the effort to come to conclusions for her project. Naturally I asked about what computer applications she uses and what format the data is in. That’s when the rush of frustration was unleashed on the state of the technology for genealogy.

While there are some good sites that host indexes that you can search, there really was nothing that met her needs to compiling and coalescing the information and for capturing and presenting her conclusions. The amateur software was obviously not sufficient for what she needed and there seemed to be not good data format for any of the data except for pedigrees, which actually is the result of research, ie the conclusion, not the pieces of data that lead up to determining a pedigree.

I told my aunt that I imagined that genealogy data is pretty messy, that it’s often incomplete, inconsistent, contradictory, in different levels of detail, quality, uncertainty, and authority. And rather trying to fit messy data into defined buckets or categories or pedigrees, what there really needs to be is something that embraces the messiness and just captures everything and what the system produces is probable profile of an individual. Individuals and families become more of a statistical probability of a convergence of loose data points, rather that a “Person” in the system. This may sound a little harsh and impersonal, but we often don’t have data sufficient to make an authoritative call on what a “Person” is based on the data available. So we have to switch to thinking of the data as revealing information about an individual with a non-definite amount of certainty.

My aunt of course lit up at this, and after I said I have experience in a system that can do just that, I was invested. I bought several books on genealogy for professionals and my understanding of them and the field increased tremendously. A professional genealogist is part historian, part biographer, part information analyst, part linguist, part author, and forensic expert, and other things too I’m sure. Evidence is paramount. If you can’t provide sources and correct citations for the information and conclusions you’ve drawn, then you have no conclusions that anyone can accept. The most common publishing avenue is in document form, as in reports, book, and articles. The sharing of the data itself is not where it needs to be, often being trapped in footnotes, in-document tables and charts, and of course whiteboards (or photos of whiteboards).

The RootsTech Developer Challenge

The RootsTech Developer Challenge was a good opportunity for me to make reality ideas I have had on how to address the need for improved genealogy systems. The challenge is billed more for “apps” to enhance the public’s engagement with genealogy and to foster increased interest in the field. I did not think the problem or next step is better apps. I think the problem is how the data is handled. Even amazing apps aren’t going to be able to compensate for issues with a weak system for handling the data. What I thought is that we need a fundamentally different way to handle genealogy data and embrace the messiness.

What we needed was a system that is designed to take in data on just about anything, from individuals to locations, from factoids of an individual’s life to full and complete pedigrees. We needed a system that did not seek for the one correct data set on a person but rather a high tolerance of uncertainty, contradiction, and even error. Some genealogy systems in the past sought to create a single pedigree for everyone but what ended up is people overwriting other’s previous entries and errors that were introduced were perpetuated, causing a gradual degradation of the quality of the data in general in the system. We needed a system that could take in anything and with quality and certainty measures be able to tag the data points so users could know the quality of the data. And we need the users themselves to be able to set some of the quality and certainty measures.

We also need users to be able to upload the results of their work, research and conclusions, not just in report or final form, but even the fragments and bits that they have. Users need to be able to make these data public, keep them private, or share them with others that they choose. Users need to be able to incorporate these data in their searches, and they need to be able to pick and choose what kind of data (public, private or shared) when they search. They also need to be able to choose the quality and certainty level of data they want to include in their search to improve the signal to noise ratio of their results.

And why can’t the system go through the data and infer some results? If John Smith was divorced from Alice in 1834, why can’t the system infer that John Smith is male, was married before 1834, and that his wife was probably named Alice Smith? And why can’t the system infer that the John Smith in that divorce record is the same John Smith in another census record that says he lived in Pleasantdale, Maine and had two kids, or at least provide a level of probability? Why can’t then the information between these two records be combined when I search for “Smith divorced Maine” and have a hit, showing me all the combined information for person with the quality and certainty levels displayed?

Why can’t users add their conclusions, assertions, and inferences, too? Why can’t Sally add that she found an obituary for John Smith in Pleasantdale in an old microfiched copy of a newspaper and it says he was survived by three children? And after all these derivative data points are added in the system, why can’t I trace each one of them back to their original sources, with contact information of the people who captured or entered the information?

It might sound nice but pie-in-the-sky, but that’s what I did and that’s what I entered into the RootsTech Developer Challenge. Granted the application is ugly (think Geocities…beta) and I’m sure it has bugs, but everything I described above is included and it works and works fast (all subsecond with a 4GB database), with the exception of being able to include certainty levels in searches. I don’t expect it to win and I am kind of done with that challenge because I feel like I figured it out and now I want to move on with it, but I enjoyed the challenge.

How it Works

The data model was the single most important factor of getting this system to work. Relational data models with tables and columns is just insufficient because the data could be virtually anything and could be virtually in any structure, and that just doesn’t fit in tables and columns. XML is better because it can allow any structure in a document with an field and any values, but that still doesn’t make searching the data any easier. To search the data and make sense of the values you need to know ahead of time what fields are available. A Semantic Data Model provides a way for any data to be of any type and associated with any other thing, be it a value or an entity, so that’s the model I chose.

The system I built at its heart is a Semantic Data system with some modifications. Semantic data is in Subject – Predicate – Object form. The subject is anything, often it is an id or key of some “thing” but it can be anything. The Predicate is the “type” of relationship between the Subject and the Object. A Predicate can be for a value (“Eye color is”) or it can be for the relationship between two things (“is Father of”). The Object is either an value (“Blue”) or the id or key of some entity (“123-John-Smith”). The Predicate usually tells you if the Object is a value or another entity, which would probably be used elsewhere as a Subject in some other fact.

This kind of triple, S-P-O, is called a tuple and can be used to model virtually any kind of data. For genealogy, there may be a tuple like:

“123456789″ “first-name” “John”

and another tuple

“123456789″ “last-name” “Smith”

and another

“123456789″ “married-to” “4443333222″

and another

“4443333222″ “birthplace” “Chicago”

and another

“11111111″ “father-of” “4443333222″

and another

“11111111″ last-name” “Grant”

etc.

So in the above example, we know that there is someone with a name of “John Smith” who married someone who was born in Chicago and whose father’s last name was Grant. Even without complete information, we can “walk the graph” and go from person to person using relationships defined in the data and view the data we have for each person and at each step. Note too that the relationships can be bi-directional. We may know that A is the father of B or we may know that B is the daughter of A. Either way we can walk the relationship graph and infer the parent-child relationship even if that relationship is only original defined one direction. Better yet, the system can make the inference and add that inference as a new tuple into the system.

For a collaborative semantic genealogy data system, Subject – Predicate – Object is actually insufficient. Three other aspects need to be added: Time, Source, and Quality. If we know that in an 1850 census is says John Smith was 34, we want to capture that original data as-is with no interpretation in order to preserve the data integrity. So we would have:

“12345-page-4-line-2″ (or whatever we decide the subject should be) “Age is” “34″

But that fact was not always true, just when the census worker captured the information. So we need to add Time, which in this case is “1850″ or as much as we know. So now we have Subject – Predicate – Object – Time (S-P-O-C):

“12345-page-4-line-2″  “Age is” “34″ “1850″

But what if that data is coming from something written in someone’s Bible. We want to capture that too, So now we’ll have Subject – Predicate – Object – Time – Source (S-P-O-C-E):

“12345-page-4-line-2″  “Age is” “34″ “1850″ “Alison-Grant-Family-Bible”

Now the source would be a key to some other data that would have detailed information about the source and contact information for whoever got the information from the source. But how reliable is this data? Is it for the right John Smith? Is the information legible in the Bible? Is it firsthand, secondhand, or more? We need a quality value, so know we have Subject – Predicate – Object – Time – Source – Quality (S-P-O-C-E-Quality):

“12345-page-4-line-2″  “Age is” “34″ “1850″ “Alison-Grant-Family-Bible” “4″

Assuming a scale from 1 to 10 and 10 is absolutely certain. Now we have a model that can capture virtually any data, for anything, true at a particular time, of any quality, and noting the source. For sharing and privacy we can either add those into the tuples, or use system permission controls to secure the tuples themselves.

Yeah But How Can This Be Fast?

We covered conceptually how to model the data to achieve the goals I listed at the beginning. Fortunately there are several products in the industry for Semantic or Graph Databases. I chose to use MarkLogic because it was familiar to me, I knew it was extremely fast, can scale to the petabyte level, but most importantly it has very advanced language tools. Other graph database may be fast linking different pieces of data together, but I also needed one that was fast when searching for first names that sound like “John”, or contain “John”, or are spelled similar to “John.” I needed something that could find word stem hits, like when I search for “run” I get “ran”, “runs”, “running” also. For this free text searching (which is almost entirely against the Object in the tuple) I need a powerful search system. So MarkLogic gave me both the free text searching and the fast data linking, and security down to the user and tuple level, which can scale to petabytes. I suppose this system could be build on other platforms, but I’ll leave it to the reader as an exercise to prove out.

The implementation details and actually code used can be pretty complicate and lengthy, but for those interested almost all the magic is in the prolific use of cts:queries in MarkLogic.  For searches such as “Family of John Smith” there is a Predicate Resolution step which builds a list of all Predicate types that are part of a “family” type which is determined by rules (“husband-to”, “brother-of” etc). Then this Predicate Set is sent as exact values to match in the P. The queries against Objects (O) are stemmed, case-insensitive, diacritic-insensitive, punctuation-insensitive, and whitespace-insensitive. I also created a Double Metaphone custom index on all the values for Objects of Predicate types of names and locations. So I took all unique values of O for those types and created a file for each type of the calculated Double Metaphone value, which MarkLogic provides an API for. Then the Object Resolution step include the original value typed and all the values from the Double Metaphone index (which are existing values in O) which have the same Double Metaphone value and which are then also included in the O set to the query. All linking and joins are done through the Subject (S). For tuples that have been linked “Same-As”, I query all Objects, get their subjects and filter them to the ones that have Same-As matched within the result set. This is the Subject Resolution Step. And voilà, I have my results.

Relevancy is not so good yet, and all the sharing and access control I’ve done via yet another field to the tuple: Sharing. But I’m rethinking that and considering using document permissions instead. Below are a couple of functions so people can see some of the code. It is most of the code in a free text search field that expects that you are going to include a first name and last name in your search.

What’s Left?

I don’t think I’ve answered the need of what my aunt needs for her individual research, but this provides the engine and capability to build off of. There lots of application interface coding left to do, too. Getting data for this has actually been pretty hard. Any genealogy data dumps I could get were on CDs from the 90s in Infobases format, with threatening EULAs about not using this data in another system. So I didn’t use that. Oddly there is a county in Maine that has a decent amount of census data online (trapped in HTML pages): Aroostock County. The State of Maine has several databases posted online (mostly in Access databases) like military record, divorces, court proceedings, and Revolutionary War Land Grants. I spent about half of my time getting the data and processing it into a format I could use, eventually in semantic form.

This system can be used as a way for researcher to enter share, and publish, their work and conclusions, and they can do so without any original data being in the system. They can make the Subjects individuals on the big genealogy sites, and the can make the sources point to those specific URLs elsewhere. This would then effectively be a big meta data system. The data need not be limited to name and date information, but can have photographs, scans, video, pdfs, and any other kind of data. But I really hope I can either get data or work with one of the big website to incorporate these capabilities. It’s just time and money.

declare function free-search($phrase, $sources, $start, $end) {

	let $terms := fn:tokenize($phrase, " ")[. != ""]
	let $subjects := reduce((), $terms)[$start to $end]
	let $subject := get-all-same-as-subjects($subjects)
	let $subjects :=
		cts:search(/t,
			cts:and-query((
				cts:element-value-query(xs:QName("s"), $subjects, "exact")
			,
				get-cts-search-source-query($sources)
			))
		)/s/text()
	let $subjects := fn:distinct-values($subjects)

	return
		<results>
			{
			for $s in $subjects
			let $family-name-hit := (/t[s = $s][p = $type:person-family-name]/o/text())[1]
			let $given-name-hit := (/t[s = $s][p = $type:person-given-name]/o/text())[1]
			let $name := fn:concat($given-name-hit, " ", $family-name-hit)
			let $query :=
				cts:and-query((
					cts:element-value-query(xs:QName("s"), $subjects, "exact")
				,
					cts:element-word-query(xs:QName("o"), $terms)
				,
					get-cts-search-source-query($sources)
				))

			let $t := cts:search(/t, $query)

			return
				<hit>
					<s>{$s}</s>
					<name>{$name}</name>
					{
					for $at in $t
					return
						<highlight>
							{$at/p}
							{$at/o}
							{$at/c}
							{$at/e}
						</highlight>
					}
				</hit>
			}
		</results>
};

declare function reduce($subjects, $terms) {
  let $matched-subjects :=
    if (fn:empty($subjects))
    then
      cts:search(/t,
        cts:element-word-query(xs:QName("o"), $terms[1])
      )/s/text()
    else
      cts:search(/t,
        cts:and-query((
          cts:element-value-query(xs:QName("s"), $subjects, "exact")
        ,
          cts:element-word-query(xs:QName("o"), $terms[1])
        ))
      )/s/text()
   return
     if (fn:count($terms) = 1)
     then fn:distinct-values($matched-subjects)
     else reduce(fn:distinct-values($matched-subjects), $terms[2 to fn:last()])
};

declare function get-all-same-as-subjects($subject) {
get-same-as-subjects($subject, ())
};

declare function get-same-as-subjects($check-subjects, $found-subjects) {

let $other-subjects := fn:distinct-values(
for $check-subject in $check-subjects
return
cts:search(/t,
cts:and-query((
cts:element-value-query(xs:QName("p"), $type:person-same-as, "exact")
,
cts:or-query((
cts:element-value-query(xs:QName("o"), $check-subject, "exact")
,
cts:element-value-query(xs:QName("s"), $check-subject, "exact")
))
))
)/(s|o)/text()
)

let $new-subjects :=
for $other-subject in $other-subjects
return
if ($other-subject = $found-subjects)
then ()
else $other-subject

return
if ($new-subjects)
then get-same-as-subjects($new-subjects, ($new-subjects,$found-subjects))
else $found-subjects

};

Categories: commentary

Code for translating content using Google Translate

January 17, 2012 Leave a comment

Often I use machine translated content when I am developing so that I have content in various language before I get the official translation. I’m not too concerned that the translation is correct, just as long as it is representative of content in that language. I’ve found that creating a website from the beginning in at least two different languages helps me avoid coding myself in a box and making assumptions that are only valid in a single-language website, particularly in my queries for content in the DB and in how the web page layout handles strings of different lengths.

I usually create an initial resource bundle of some sort in English and then use Google Translate to translate the strings in the bundle into various target languages. I automate this so that I just push a button or have the script execute in a trigger so that when I change or add to the resource bundle in English I can re-translate it easily.

Here is the type of script I use to make the call to Google Translate:

declare function local:get-google-translation($text, $source-lang, $target-lang) {

  let $url := fn:concat("http://translate.google.com/translate_a/t?client=t&amp;text=", xdmp:url-encode($text), "&amp;hl=", $source-lang, "&amp;tl=", $target-lang, "&amp;multires=1&amp;sc=1")

  let $response := xdmp:http-get($url,
    <options xmlns="xdmp:http-get">
      <format xmlns="xdmp:document-get">text</format>
    </options>
  )

  return fn:tokenize($response[2], '"')[2]
};

let $text := "Rome (CNN) -- Transcripts published Tuesday capture the dramatic conversations between port officials and a cruise ship captain, who a judge ruled can be held under house arrest while Italian authorities investigate his role in last week's disaster."
let $source-lang := "en"
let $target-lang := "es"

return local:get-google-translation($text, $source-lang, $target-lang)


=> ROMA (CNN) - Las transcripciones publicadas el martes la captura de las conversaciones entre los funcionarios del puerto espectacular y un capitán de barco de crucero , que puede ser un juez dictaminó bajo arresto domiciliario mientras las autoridades italianas investigar su papel en el desastre de la semana pasada .

Categories: Tips n' Tricks
Follow

Get every new post delivered to your Inbox.