Archive

Archive for November, 2011

Node.js is good for solving problems I don’t have

November 18, 2011 41 comments

I have recently starting programming with Node.js and I like how simple and easy it use to write HTTP server code with it. Just because it’s easy doesn’t mean it’s appropriate for my needs or that it’s ready for prime time. What I have noticed in learning and using Node is that it was created primarily as a response to a problem that I just don’t have, or in fact that most web applications shouldn’t have.

Node was created to provide an event-based web server programming model that better utilizes threads on the server, particularly when it comes to IO operations (like filesystem reads or database calls). So rather than a thread having to wait for an IO operation to finish before program execution continues, the thread implements a callback to be called when the IO operation is finished. This way the threads are able to server more request because they aren’t waiting for expensive operations to complete.

Who has this problem? Whose web application performance bottleneck is that their threads are waiting for IO to complete? If this is your problem, then I don’t think you’ve got a very good web application implementation. Let me explain why.

HTML Generation is usually not the slowest part

Given that a web application is correct, available, and secure, users care most about speed. They don’t care about your hardware utilization or how many requests are handled per server, and they also don’t care how fast your web app is on average, they care about how fast it is for them. Page load time and time-until-usable are what users are concerned about.

Looking at cnn.com, there are 89 requests, totalling a little less than 1MB, which took 4.24s for it to load for me. Of those 89 requests, 3 were HTML requests from cnn.com (1 for the HTML page and 2 for weather). The HTML from cnn.com is about 30KB…out of 1MB! So if you want to speed up your site, where is the best place to focus? HTML from the server that makes up 3% of the total weight, 3% of the total number of  requests, and 9% of the total page load time? Or would you focus on reducing the number of requests, the size of assets, and the caching of those assets?

Cnn.com’s HTML took 336 ms to get to me. Let’s say you made that 10x faster. You would have then reduced the total page time by 300 ms or about 7% of the total page load time and still get about 4 seconds for page load. You could have a 1 ms HTML response time and still have a slow site. The HTML generation and return time is usually not where the problem is for web application performance.

Most of the assets on a web page are static (meaning they don’t change per request) so they can be served by a cache server (so the origin server isn’t hit) and by the browser (so not even the cache server is hit). The origin server can generate the HTML and server up the static assets if needed, but it really shouldn’t do that very often because the browser cache and cache servers should be serving them. So then what you really need is a content server that is geared toward HTML generation, whether it be static or dynamic. So you have the origin server generating dynamic but cacheable HTML (like templated by little-changing info pages), and for handling dynamic but non-cacheable HTML (like search).

The content server should not need to do hardly any IO. Why would an HTML content server need to write to the filesystem? Even if it does, why does the web visitor need to wait on the result of that file write operation before seeing the server response? If you really need to write to the filesystem, spawn a thread or offload that operation to something else that can queue up write operations. Your content server doesn’t need to do it; it just need to invoke something else to do it.

If your content server is serving up dynamic content, what else can it be doing before it gets the data from the database? It’s primarily going to be formatting and creating presentation using the data from the DB, and if it has something it can be doing in the meantime I’m arguing it should be doing it. Something else can communicate with other services and cache HTML fragments or whatever. All the content server does is process content, so if it has to wait for the data, it waits.

But why would there be any IO for data that takes much time at all? If the data is so far removed from the presentation engine (the content server) that it blocks for any noticeable amount of time, you got a problem with data retrieval. The answer isn’t to create a callback for when the data finally arrives from the DB, the answer is to fix the problem of data coming back so slow from the DB.

Functional programming facilitates optimized and parallelized execution

One of the reasons I like functional programming is because the execution engine is able to parallelize function calls because functions only operate on data coming into the function and only output a result. Function don’t change properties or state on objects in memory. Since there’s no shared state or objects that can be accessed by two different processes, all operations are threadsafe. Better yet, with lazy evaluations like what MarkLogic does for many things, you can capture the result of a function call in a variable, but the execution engine doesn’t need to actually make the function call until you access something on that variable, which could be at any later point in  your program. In fact, if you never access the variable the execution engine may never actually call the function that returns the value for that variable. Order of execution becomes much less important because the functions have no side effects and can be executed whenever the execution engine decides. The execute of one function does not affect another, so you can execute them all at once, or whenever resources are available. With Node, you’d be writing code to do all that: optimizing the method calling yourself. Instead, use a functional language and let the execution engine do it for you.

The problem I have is processing a lot of data quickly

I have megabytes and gigabytes of data to query and format for display on a web page. I need to be able to find a needle in a haystack and transform it into presentation quickly, for every request. First I need to get the speed down for just one user because that is as fast as I can go (unless another user were to cache it). Then I need that speed to remain fairly constant at scale, both with web traffic and amount of content. I am less concerned about how many requests each server can handle because I can scale horizontally if needed for both traffic and content size. With MarkLogic I have extremely fast access to the content I need. There’s no IO blocking to speak of. Even if there were, the execution engine will do some optimizing so parts of my code can execute in parallel. I spend time reducing query times, not coding callbacks for them.

Node enthusiasts are front-end coders not wanting to do server coding

I have used Javascript for over 15 years. I learned it before I learned Java. It’s really not too bad. I think what has happened in the web developer community is that some people who know front-end programming have gotten all excited that they can use their front-end skills to program the server. In fact, they think that they can even move a lot of processing that used to be done on the server up into the browser, using the programming languages and techniques they are used to, and all of the sudden it’s revolutionary and cutting edge. That’s a big reason CouchDB gained popularity, because there was no need for server programming. With HTML5, some have the idea that we don’t even hardly need a backend service at all, just to persist some state once in awhile.

So the Node community has tried to sell Node as solving a fundamental problem with server programming (blocking IO calls) but that’s really not the problem with web page speed or even server speeds, especially per request. I think the real reason is that they are mostly novices that want to use Javascript for the server side but they use the “blocking” argument to convince others. All the Node enthusiasts I know, some personally, are not very skilled server programmers but have pretty strong front-end skills. This revolution is more about front end coders not having to deal with the server side than any breakthroughs about how to do the server side. And the exuberance and arrogance from enthusiasts is meant to shame non-enthusiasts into thinking they’re old school, antiquated, or unable to learn new things, that this is the future and in a few years we’ll all be programming in Javascript and if you don’t get on board you’ll be out of a job (I heard this first-hand). Node has to be adopted, otherwise all these front-end coders will have to learn server programming.

But there are lots of things I like about Node, but not the community. I plan on using Node for easy HTTP server programming and for handling a large number of connections. But I need a Big Data server and a content server to generate dynamic and personalized HTML and to handle search. I’ll offload the HTML assets and cache as much as possible to cache servers, and I’ll optimize the front-end code to increase performance. Blocking calls, including IO, are just not one of my problems.

Categories: commentary, node.js

How to set system variables on a MarkLogic App Server

November 11, 2011 3 comments

Sometimes you want to be able to set variables at the system level and have your code be able to retrieve those values at run time. For example, if you want to know what lane you are on (dev, test, prod, etc.) or what endpoint you need to call for a service which would depend on what box you are running on. MarkLogic doesn’t have a formal way of setting system variables but there is a little trick I learned today that mimics this pretty well.

Global Namespaces can be added at the Group or Application level in MarkLogic. Through the Admin Interface on port 8001 or through the API you can set a prefix and namespace URI which is accessible in the code. You set it on the Group and then all App Servers in the Group will be able to access it, or you can set it on the App Server. The App Server’s namespace will override an existing Group namespace.

So if I wanted to set the type of lane my code is running in, I could set a namespace at the Group level that has a prefix of “lane” and a URI of “prod”. The following code would get the value:

fn:namespace-uri-for-prefix("lane", <lane:blah/>)
=> prod

And if I wanted to set some endpoint, I could create a namespace on the Group with the name “endpoint” and URI “http://mysite:6005″

fn:namespace-uri-for-prefix("endpoint", <endpoint:blah/>)
=> http://mysite:6005

Since these are global namespaces you don’t have to declare the namespace in the prolog, so you don’t need any more code than shown above.

Granted this is not using global namespaces for their intended purpose, but it seem to work pretty well.

Categories: Tips n' Tricks

My Three Favorite New Features of MarkLogic 5

November 4, 2011 Leave a comment

There are three new features in MarkLogic 5 that I am especially excited to see: better binary content handling, configuration importing and exporting, and retrieving the original URL of the request before URL rewriting.  All of these save me development time and amount of code that I need to write.

Better binary content handling

MarkLogic has always been able to store binary files in the database, but if the files were too big or if you had too many files, your caches may have been adversely affected and your database merges may have taken longer than they needed to. In the past, when we had a lot of binary content that we wanted to serve off of a MarkLogic-powered website we would keep the binary files on the files system and just put the metadata file in the MarkLogic database. This worked fine, even streaming the files off the filesystem through MarkLogic, but we had to code the implementation and we always had to make sure the metadata files were in sync with the binary files on the filesystem. We don’t have to do this anymore with MarkLogic 5.

MarkLogic 5 introduces Rich Media Support which means that large binary files are handled differently than XML and text files under the covers in the server. There is a configurable threshold for the size of a binary file to be considered “large” as to be handled in a more efficient way. These large binary files are handled by MarkLogic as efficiently as if you saved them to the filesystem yourself. But you don’t need to use an special API or different functions that you would use for the XML and text files. You just insert the file using xdmp:document-insert() and MarkLogic will handle the rest.

Configuration importing and exporting

The Administration Interface on port 8001 provides a nice graphical, point-and-click interface for managing and configuring a MarkLogic installation. But for mature implementations, you’ll probably want a way to declare the settings for the servers, database, forests, etc. and script the configuration changes. There are several good implementations that do this outside of MarkLogic, but now you can just export the settings of an installation and get the full configuration settings in a XML file. You can import this XML file into a separate machine to stand up an installation with the exact same settings. You can also check in the configuration settings file into source control, make changes to it, and re-import the file back into the MarkLogic installation to affect those changes. As part of troubleshooting you can take a fresh export of the settings of an installation and compare those settings to the configuration settings file you had in source control to see if there were any inadvertent changes to the installation.

Getting the original URL of the request

This may seem to be a minor feature but one that can save me code and complexity. It’s always been possible to get the request URL from within XQuery code by calling xdmp:get-request-url(). But this returns the URL after the URL rewriter has rewritten the URL. What if you wanted to get the URL before the URL was rewritten? In previous versions of MarkLogic you’d have to get the request URL (by calling xdmp:get-request-URL()) in the URL rewriter itself and adding the original URL as a parameter to the rewritten URL. For example,

fn:concat("/new/url?orig-url=", xdmp:get-request-url())

Then in subsequent code you’d get the original URL by getting the request field, like xdmp:get-request-field("orig-url"). That works but it can be a pain if you forget to add the URL as a parameter, or you make in error in the code to retrieve it. But now in MarkLogic 5 you can just call xdmp:get-original-url() which will return the URL as it was before the URL rewriter changed it. Less code I have to write. Less complexity. Fewer bugs.

MarkLogic is fast in terms of performance but also in terms of development time. I spent ten years in the Java world and time-to-market was extremely important, and is still is now. I have never been able to implement mature, high-performance, enterprise solutions faster on any other platform than on MarkLogic. The new features of MarkLogic 5 that excite me the most are the ones that reduce that time-to-market for me even more. Most if not all of these features are the results of customers lobbying for them, and MarkLogic has listened. I have been vocal about binary content handling and now it’s part of the server. I’m looking forward to this new version so I can continue to push the boundaries of delivering solutions for my customers in less time and with less risk.

Categories: MarkLogic 5
Follow

Get every new post delivered to your Inbox.