Archive

Archive for the ‘newbie track’ Category

How to do a deep copy of XML, and make changes using XQuery

December 1, 2010 Leave a comment

Since XQuery is a functional language and you can’t change the value of a variable (you can only redeclare the variable to have a new value), if you want to make a change to an XML tree, you have to actually create a new copy of the XML with your changes in it. You can’t change the original XML. There are some utility functions that make the process fairly easy, like in-mem-update and node-replace, but they may not be the best options in some cases.

Below is a simple way to make a deep copy of an element and introduce change to it, or its immediate children. If you want to make changes deeper than 1 level, or want to make changes based on some rules, I’d recommend use the dispatch-passthru method or using XSLT. But if you just want to add or remove an element or attribute, consider the following:

Adding an element:

xquery version '1.0-ml';

declare namespace ryan = "http://ryan";
declare namespace bob = "http://bob.com/xyz";

let $example := 
    <ryan:sky time="now">
        Clear
        <cloudy/>
        <bob:forecast>Snow</bob:forecast>
    </ryan:sky>

return 
    element { fn:QName(fn:namespace-uri($example), fn:name($example)) } 
    { $example/(@*|node()), <speak>Hello</speak> }

=> <ryan:sky time="now" xmlns:ryan="http://ryan">
        Clear
        <cloudy/><bob:forecast xmlns:bob="http://bob.com/xyz">Snow</bob:forecast><speak>Hello</speak></ryan:sky>

Excluding an element (removing it) using “except”:

xquery version '1.0-ml';

declare namespace ryan = "http://ryan";
declare namespace bob = "http://bob.com/xyz";

let $example := 
    <ryan:sky time="now">
        Clear
        <cloudy/>
        <bob:forecast>Snow</bob:forecast>
    </ryan:sky>

return 
    element { fn:QName(fn:namespace-uri($example), fn:name($example)) } 
    { $example/(@*|node()) except $example//cloudy }

=> <ryan:sky time="now" xmlns:ryan="http://ryan">
        Clear
        <bob:forecast xmlns:bob="http://bob.com/xyz">Snow</bob:forecast></ryan:sky>

Excluding an element (removing it) using the element name:

xquery version '1.0-ml';

declare namespace ryan = "http://ryan";
declare namespace bob = "http://bob.com/xyz";

let $example := 
    <ryan:sky time="now">
        Clear
        <cloudy/>
        <bob:forecast>Snow</bob:forecast>
    </ryan:sky>

return 
    element { fn:QName(fn:namespace-uri($example), fn:name($example)) } 
    { $example/(@*|node())[fn:not(fn:local-name(.) eq "cloudy")] }

=> <ryan:sky time="now" xmlns:ryan="http://ryan">
        Clear
        <bob:forecast xmlns:bob="http://bob.com/xyz">Snow</bob:forecast></ryan:sky>

Excluding an attribute (removing it) using “except”:

xquery version '1.0-ml';

declare namespace ryan = "http://ryan";
declare namespace bob = "http://bob.com/xyz";

let $example := 
    <ryan:sky time="now">
        Clear
        <cloudy/>
        <bob:forecast>Snow</bob:forecast>
    </ryan:sky>

return 
    element { fn:QName(fn:namespace-uri($example), fn:name($example)) } 
    { $example/(@*|node()) except $example/@time }

=> <ryan:sky xmlns:ryan="http://ryan">
        Clear
        <cloudy/><bob:forecast xmlns:bob="http://bob.com/xyz">Snow</bob:forecast></ryan:sky>

Adding an attribute using attribute constructor (remember that attributes must come after the owner node or other attributes):

xquery version '1.0-ml';

declare namespace ryan = "http://ryan";
declare namespace bob = "http://bob.com/xyz";

let $example := 
    <ryan:sky time="now">
        Clear
        <cloudy/>
        <bob:forecast>Snow</bob:forecast>
    </ryan:sky>

return 
    element { fn:QName(fn:namespace-uri($example), fn:name($example)) } 
    { attribute { "quality" } { "good" }, $example/(@*|node()) }

=> <ryan:sky quality="good" time="now" xmlns:ryan="http://ryan">
        Clear
        <cloudy/><bob:forecast xmlns:bob="http://bob.com/xyz">Snow</bob:forecast></ryan:sky>

..and so forth

Categories: newbie track

Simple XML File Upload Example with MarkLogic

November 15, 2010 Leave a comment

Below is an XQuery page that displays the contents of an XML file that you uploaded through the browser (tested on FF, IE, Opera, Safari). Note that this code reads in the file contents as XML, not just text, so you can use XPath on it.

let $doc := xdmp:get-request-field( "upload-file" )
let $xml-contents := xdmp:unquote($doc)

return
<html xmlns="http://www.w3.org/1999/xhtml">
    <body>
        <form name="file_submit" method="post" enctype="multipart/form-data">
            <input type="file" name="upload-file" /><input type="submit" value="Upload" />
        </form>
        Contents: {$xml-contents}
    </body>
</html>

Example XML:

<message>
    <today>Hello World!</today>
    <tomorrow>Welcome back, World!</tomorrow>
</message>

Screenshot:

Categories: newbie track

Preventing XQuery Injection and XSS Attacks

November 12, 2010 4 comments

Accessing or Modifying Data on the Server or Database

SQL Injection attacks are a well-known problem for many websites. The problems stem from taking user input, and using those values as part of a string concatenation of a SQL command.

For example, suppose there is a web page that has a form field for the user to enter his userid. Below is a code snippet that would be a poor way to write the query code (Java):

String userid = request.getParameter("userid")
String sql = "SELECT * from users where user = '" + userid + "'";
stmt = conn.createStatement();
rs = stmt.executeQuery(sql);
...

A malicious user might enter:

bob' OR '1'='1

This would effectively make the query:

SELECT * from users where user = 'bob' OR '1'='1'

which of course would always evaluate to TRUE and return all rows of the table.

For XQuery on MarkLogic, fortunately XQuery Injection doesn’t offer hardly any opportunity to do anything malicious. This is mostly because in the SQL example above, the SQL statement is constructed as a String and then evaluated in the database. With XPath, you don’t create the path statement as a String and then evaluate it. Rather you create the path statement and include variables in it.

Example:

let $userid := xdmp:get-request-field("userid")
return fn:doc()/user[@id = $userid]

The $userid variable is bound to the value passed in. The statement is not built using the user input. So even if the user tried to introduce some code to inject, it wouldn’t work:

"A" or 1 = 1

Does nothing. No userid equal to that value.

 

xdmp:eval()
To mimic the above SQL by creating a String that you then interpret, you’d have to use xdmp:eval():

let $userid := xdmp:get-request-field("userid")
return xdmp:eval(fn:concat("fn:doc()/user[@id = '", $userid, "']"))

and then you could pass in something like

A' or '1'='1

for the value of userid which would effectively turn the XPath statement into:

fn:doc()/user[@id = 'A' or '1'='1']

which would return all the user documents.

 

xdmp:value()
xdmp:value() is very similar to xdmp:eval() in that it will evaluate whatever is passed in to it:

let $userid := xdmp:get-request-field("userid")
let $user := "Bob"
return xdmp:value($userid)

If someone sumbitted “$user” for the userid, the result would be
Bob

So where the easy way to execute SQL statements is to construct a String then evaluate it (which presents a security hole), the easy way to do an XPath statement  is to write the XPath with variables for the values (which does not open a security hole). This is similar to creating a PreparedStatement in Java and then binding values to the variables in the statement. But with XQuery and XPath, binding variables to values is the typical, easy way to do it. The hard way is to use xdmp:eval() or xdmp:value() to evaluate the statement and open a hole for XQuery code injection. So you’d have to get good enough at XQuery to even know how to use xdmp:eval() or xdmp:value() to create a security hole this way.

I have never used xdmp:eval() or xdmp:value() for anything. Meaning, I have never used xdmp:eval() or xdmp:value() for any code I’ve written for public and non-public websites, tools, backend code….nothing. Perhaps someone, somewhere has a legitimate need to use them for something, but I doubt it. And if you are using them anywhere in a web application, you are doing it wrong. If you don’t know how to use xdmp:eval() or xdmp:value(), don’t worry about it. You don’t need to. And if you don’t use xdmp:eval() or xdmp:value(), there are very few opportunities for XQuery Injections.

Injecting Javascript to Be Executed in the User’s Browser

The above discussion revolved around preventing XQuery Injections to gain access to the server or data in the database. However, a different type of injection involved injecting HTML or Javascript that will be rendered on the user’s browser and execute malicious Javascript. This is called a Cross Site Scripting attack (XSS) and is a problem for any HTML page regardless of the underlying technology.

If your page ever displays data that the user previously entered, then a malicious user could either A) enter javascript that gets saved in the database and displayed on the webpage when others bring up the page (like comments, profile information), or B) just encode malicious code in URL parameters which will be shown on the web page without it being saved in the database, enabling the malicious user to send a link to that page to another user that has the malicious code as URL parameters in the link . If you are unclear about these types of attacks, go read about them. What I will describe here is what you can do in your XQuery application to prevent them.

Rather than trying to create a list of problem characters that you will sanitize out of your user inputs (a blacklist), you should rather create a list of characters that your app will accept (a whitelist). This is fairly easy using regular expressions. Just create a function that all your code uses when getting request values, and your ability to fend off XSS attacks goes up dramatically.

For example, you could establish a regex pattern that strips out everything but letters, numbers, and a few punctuation characters:

declare function get-sanitized-request-field($field-name) {
    let $field-value := xdmp:get-request-field($field-name)
    return fn:replace($field-value, "[^a-zA-Z0-9 '"":\.\-]", "" )
};

Summary

If you don’t use xdmp:eval() or xdmp:value(), there’s really no way to inject XQuery into your application, so attacks like SQL Injections really aren’t possible. If you sanitize your user inputs against a whitelist that only allows certain characters through, then you leave almost no opportunity for XSS attacks. Of course bad guys try very had to find holes, and they may find some, but the opportunities for holes are much less than on other web technologies because the easy, normal way to do things don’t open up many holes. And it’s not just because XQuery is a newer technology that hasn’t had as much opportunity to be attacked. It’s because XQuery fundamentally operates differently than other common web programming languages. I see this akin Java in that since the JVM itself managed memory, there really was no way to write bad code that would allow a Buffer Overflow error that could execute malicious code, like what could be possible with applications written in C. Developers don’t have to spend as much time defensively programming to address vulnerabilities because there just weren’t as many to begin with by design.

When to use an element or an attribute in your data model

November 10, 2010 1 comment

When creating a data model in XML, you’ll likely have those moments where you have to decide whether a particular piece of data should be represented as an element or an attribute in your schema. I use to deliberate on this for awhile, but now I’ve settled into some pretty easy guidelines:

  • Could there be multiple values for this data, like two “authors” or “types”? If so, use an element.
  • Could it be complex data, like with child nodes or attributes of its own? If so, use an element.
  • Does it belong in a large group of other elements, like address1, address2, city, state, zip, and country? If so, use an element, because all those values as attributes would make the owner element hard to read.
  • Does it seem to exist and have an identity of its own (like “city”), or does it seem to be an aspect of something else (like “color”)? If it seems to just be an aspect of something else, use an attribute.

So basically I ask if the data is a single atomic value that seems to just be an aspect of some other thing. If it is, then I make it an attribute; otherwise it’s an element.

Categories: newbie track

Infosets, documents, and root elements

November 3, 2010 1 comment

One surprise to new XQuery developers sometimes is that they don’t get back the root element when querying the database. Understanding the differences between the infoset, document, and root element will help.

  • The document is the top level entity of an XML document in the database. It must have one root element, but it can also have other child nodes (comments, processing instructions, declarations). fn:collection and fn:doc return the document for a given URI, not the root element.
  • The infoset is the abstract representation of a document. The infoset does not include all information in an XML document. For example, it does not include the order of attributes. But it does include information about nodes and their relationships to other nodes (parent, child, owner, etc.). The infoset can be represented different ways, and each XML parser may implement it differently, but you shouldn’t have to know the details.
  • The root element is often what we mean when we talk about an XML document, but it’s important to know the difference between it and its document when writing code to get that element.

For example, the following code selects the title element from an XML structure that is in memory:

let $article := 
	<article>
		<title>Great Article</title>
		<body>This is a really interested article on some topic. Blah blah.</body>
	</article>
return $article/title
=> <title>Great Article</title>

If we were to save this to the database and then get the title, we might not get what we want if we use a similar XPath statement:

let $article := 
    <article>
        <title>Great Article</title>
        <body>This is a really interested article on some topic. Blah blah.</body>
    </article>
    
return xdmp:document-insert("/myarticle.xml", $article)
;
fn:doc("/myarticle.xml")/title
=> "your query returned an empty sequence"

In order to get the title element in the above example using fn:doc, we would need to include the root element in the XPath (line 9):

let $article := 
    <article>
        <title>Great Article</title>
        <body>This is a really interested article on some topic. Blah blah.</body>
    </article>
    
return xdmp:document-insert("/myarticle.xml", $article)
;
fn:doc("/myarticle.xml")/article/title
=> <title>Great Article</title>

Note that you can create documents in memory using the document constructor. If we created the original XML fragment this way, we would include the root in the XPath:

let $article := document {
	<article>
		<title>Great Article</title>
		<body>This is a really interested article on some topic. Blah blah.</body>
	</article>
}
return $article/article/title
=> <title>Great Article</title>

Usually when I am capturing an XML doc in a variable, I include the root element in the XPath. That way any subsequent XPath statements would be the same as if the XML had originally just been in memory:

let $article := fn:doc("/myarticle.xml")/article
return $article/title
=> <title>Great Article</title>

Categories: newbie track

When to use XPath, XQuery, XSLT, cts:search, and search:search

November 3, 2010 Leave a comment

With XML technologies, and particularly with MarkLogic, there are several options that overlap in functionality. You have several choices for querying the database, navigating through nodes of a document, and transforming XML into a new document.

XML Technologies from www.w3schools.com

XML Technologies from www.w3schools.com

However while many technologies can do the same thing, there are some that are usually better choices than others for a particular operation, and this may not be readily apparent until you’ve been using them for awhile.

This is what I have learned from experience developing XQuery applications on MarkLogic:

  1. To select documents from the database, use cts:search. You can also use XPath with fn:collection if you’re sure you know the performance implications. cts:search is as fast or faster than XPath for selecting documents and you will likely avoid coding queries that are extremely expensive and non-performant and possibly even kill your app. If you just want a particular doc and you have the URI, then you should just use fn:doc($uri).
  2. To select nodes in a given document, use XPath. Once you already have a document from the database, you probably are not going to be able write an XPath statement that would cause any major performance problems. So go ahead and use XPath for node selection inside a doc, but use cts:search to get the doc from the database.
  3. To do full text searches or faceted searches, use search:search. This has a lot of powerful and flexible features for searching across documents in the database.
  4. To do application logic and simple element construction, use XQuery. This is what you can always fall back on to get the exact behavior that you want. I would use it more to hook all the other technologies together and handle the application flow. You can also use it to do simple changes to an XML structure, like adding an attribute or element.
  5. To do more complicated document transformations, use XSLT. XSLT is very clean and easy to read, particularly with recursive operations, so you’ll probably be less likely to introduce errors in your transformation. There is a small performance hit when using XLST compared to XQuery so use XSLT when your transformation starts to get too complex for it to be easily read in XQuery.

That’s what I would suggest. I’d be interested to know if anyone else has other opinions.

The value of fn:doc() [with no parameters]

May 18, 2010 2 comments

UPDATE: I totally got it wrong that MarkLogic was the only implementation that could do something like fn:doc() without parameters. Rob Whitby schooled me that fn:collection() with no parameters is in the spec and that this is implemented by other XQuery engines. So the information below is not quite accurate. But the advantages of being able to query the entire database as a single document remain. End update.

—————–

I started learning XQuery using the MarkLogic platform and sometimes it’s not immediately clear to me what is an advantage because of XQuery and what is an advantage because of MarkLogic. Recently I have been tinkering with other XQuery engines and environments and I have learned a lot from the differences of implementations. One thing I have really come to appreciate is MarkLogic’s implementation of fn:doc().

According to the XQuery spec, the fn:doc() function takes one parameter which is the URI of a document in the database. If you pass it the empty sequence, you get the empty sequence back. This effectively means that you have to know the document you want to get before you use this function. But when you have a lot of documents in the database, you may not know which document has the data you are looking for.

For example, if you have a lot of books that you want to keep in the database, you would have to put all the book data into one document and then use XPath to find the data you want:

books.xml

<books>
    <book>
        <title>Where the Red Fern Grows</title>
    </book>
    <book>
        <title>To Kill a Mockingbird</title>
    </book>
</books>

XQuery code:

let $books := fn:doc("/books.xml")
return $books/book[title eq "Where the Red Fern Grows"]

As far as I can tell, this is how Oracle, Sausalito, eXist, and Zorba are implemented. Although these implementations may have other ways to search the database (as does MarkLogic), I would much rather use XPath to select data, rather than have to do a search across elements. I don’t want to have to do a search to select elements.

MarkLogic has implement fn:doc() so that if you don’t pass it a parameter it includes all documents in the database that match your XPath expression. So rather than having to know ahead of time the document you want to get, or having to use a search function to select the elements you want, you can just use fn:doc() with no parameter and the entire database acts like a single document in terms of node selection.

For example, using the same example, this query returns the book whose title match “Where the Red Fern Grows” regardless of what document it is in:

book_1.xml

<book>
    <title>Where the Red Fern Grows</title>
</book>

book_2.xml

<book>
    <title>To Kill a Mockingbird</title>
</book>

XQuery code:

fn:doc()/book[title eq "Where the Red Fern Grows"]

This is very powerful because then you can have each book be its own document in the database, which means you can have an arbitrary number of “book” documents being inserted\updated independently and not interfere with other “book” documents.

But to even take it further, you can use the shorthand for the “descendant-or-self” axis in XPath (“//”) to select all book elements in all levels of all documents in the entire database that match your query.

fn:doc()//book[title eq "Where the Red Fern Grows"]

This provides lots of flexibility in data design and allows you to use all the power of XPath to select whatever you want without having to worry about documents at all. This means you never have to know a URI. Everything is retrieved from the database by value only. All you have to deal with are elements and attributes, not URIs or documents, unless you want to. It’s just a gigantic cloud of data.

I suspect, but I don’t know for sure, that the reason that the MarkLogic server can do this is because it indexes every document automatically. If it didn’t, using fn:doc() with no parameter would mean that the server would have to open every document in the database, perform the XPath expression on it, consolidate the results and return them. I doubt performance for that would be very good.

Until other implementations support fn:doc() with no parameters, MarkLogic is really going to be the superior implementation of XQuery, especially for large datasets. This is another example of how the MarkLogic server as a hybrid database\app server\search server can do things that really no other technology can, and I don’t know if that is always understood or appreciated when it is evaluated.

Categories: commentary, newbie track

Be aware of the different XQuery dialects in MarkLogic

May 1, 2010 3 comments

When I first started learning XQuery, I read the XQuery book by O’Reilley. But I didn’t remember reading about how to do “appy” things like setting HTTP response codes, reading request parameters, logging, alerting, reading from the filesystem etc. So I read the O’Reilley book again, and then I got pretty nervous because I still couldn’t find how to do those things. After asking around I came to realize that there is the XQuery specification, and then there are the functions that MarkLogic provides to do the “appy” things. But that’s not even quite right. There are different dialects of XQuery, including one from MarkLogic that extends the language itself, and then there also are functions MarkLogic provides to that enable you to do the “appy” things. In fact, without MarkLogic’s enhancements to the language and “appy” functions it provides, it would not be possible to use XQuery for the application code of a mature application. Read more…

Categories: newbie track

Rules of thumb when you can’t see the data you expect

April 29, 2010 Leave a comment

If your code can’t seem to get the data you think it should, keeps these rules of thumb in mind:

1. If you can’t see the document, it’s probably permissions

2. If you can see the document but you can’t see the elements, it’s probably namespaces
Read more…

Categories: newbie track

Commandline interface for XQuery

April 29, 2010 Leave a comment

MarkLogic doesn’t provide an actual commandline interface, but there are some web-based tools the behave like one. This makes it easier to tinker with XQuery code since you don’t have to create an xqy file. This is the easiest way to enter ad hoc code.

CQ

You probably already CQ in the Samples directory of your MarkLogic installation. If not, you can download it from here. You can just copy the “cq” directory to the Docs directory and then access it at http://localhost:8000/cq. The four buttons under the textarea execute the code and display the results in the various formats (text, xml, html). The Profile button will return detailed profiling information about the code execution.

DQ

DQ is an enhancement to CQ that include syntax highlighting, tabs, and line numbers. You can download DQ here. Unzip the contents into the “cq” directory (see above) and then you can access it at http://localhost:8000/cq/dq

Categories: newbie track
Follow

Get every new post delivered to your Inbox.