Web Script: Modified Content List

For a recent customer engagement I was asked to write a web script which they’ve agreed to let me share.

The script was written to give them:

  • A list of content that had been modified from a point in time until “now” (The time of execution).
  • Specify a specific space or recurse the child spaces for the named space.
  • return a specified XML structure

In preparing for this post I’ve added some modifications/fixes:

  • I’ve added two new return formats: JSON and HTML
  • fixed an issue that allows it to run on the 3.2.x release of Alfresco.

Here are some insights to a few areas of the script:

Lucene Date Queries

There are two specific things to remember about Date queries:

1 – Date Format. Date queries with Lucene are  looking for the date in an ISO8601 format using combined date and time in UTC.  (Even if you are looking just for the date it will complain if you don’t pass the time ).

To do this I had originally used  the JavaScript Date prototype from http://delete.me.uk/2005/03/iso8601.html. This works fine in pre 3.2 releases.  But due to a change in the 3.2 code line top level objects are now sealed which means you can’t make these type of prototype change to Root level objects in the JavaScript libraries. So I’ve made simple modification to Date prototype code allowing me to pass a Date Object into the code which then does the coversion and returns a properly formated ISO8601 string.

function toISO8601String(date) { // based on http://delete.me.uk/2005/03/iso8601.html
	/*
	 * accepted values for the format [1-6]: 1 Year: YYYY (eg 1997) 2 Year and
	 * month: YYYY-MM (eg 1997-07) 3 Complete date: YYYY-MM-DD (eg 1997-07-16) 4
	 * Complete date plus hours and minutes: YYYY-MM-DDThh:mmTZD (eg
	 * 1997-07-16T19:20+01:00) 5 Complete date plus hours, minutes and seconds:
	 * YYYY-MM-DDThh:mm:ssTZD (eg 1997-07-16T19:20:30+01:00) 6 Complete date
	 * plus hours, minutes, seconds and a decimal fraction of a second
	 * YYYY-MM-DDThh:mm:ss.sTZD (eg 1997-07-16T19:20:30.45+01:00)
	 */
	if (!format) {
		var format = 6;
	}
	if (!offset) {
		var offset = 'Z';
		var date = date;
	} else {
		var d = offset.match(/([-+])([0-9]{2}):([0-9]{2})/);
		var offsetnum = (Number(d[2]) * 60) + Number(d[3]);
		offsetnum *= ((d[1] == '-') ? -1 : 1);
		var date = new Date(Number(Number(date) + (offsetnum * 60000)));
	}

	var zeropad = function(num) {
		return ((num < 10) ? '0' : '') + num; 	}

 	var str = "";
 	str += date.getUTCFullYear(); 	if (format > 1) {
		str += "-" + zeropad(date.getUTCMonth() + 1);
	}
	if (format > 2) {
		str += "-" + zeropad(date.getUTCDate());
	}
	if (format > 3) {
		str += "T" + zeropad(date.getUTCHours()) + ":"
				+ zeropad(date.getUTCMinutes());
	}
	if (format > 5) {
		var secs = Number(date.getUTCSeconds() + "."
				+ ((date.getUTCMilliseconds() < 100) ? '0' : '')
+ zeropad(date.getUTCMilliseconds()));
str += ":" + zeropad(secs);
} else if (format > 4) {
		str += ":" + zeropad(date.getUTCSeconds());
	}

	if (format > 3) {
		str += offset;
	}
	return str;
}

I’ve turned the original prototyping into a function, where I pass a Date object, with the date I want to work with, into this function. I also replaced any reference to ‘this’ with that data parameter.  I will admit there is some excess code here that may never be called. But this was the quickest change to accomplish what I needed.

2 – Only dates are indexed. By default, only the date portion of the property is indexed.  In general, most use cases only require the date .  But in this case, the customer was interested in what may have changed over the period of an hour.

There are two options for this: Code around or modify how Alfresco indexes these DateTime properties. In this case I choose to code around it.  This means that I am not modifying default behavior in Alfresco…always a plus.

Code Around. We have all the pieces we need to do this: We know the datetime we use for the start and end our range query.  We also have access to the full datetime property (even if it wasn’t indexed).

I chose to pull this into a function that takes the collection (array) of documents found in our range query and then test to see if the datetime property of the node (in this case the modified datetime property) is between our the beginning and end of our range, if so then add it to our new filtered results array.

function filterResults(unfilteredResults) {
	var now = new Date();
	var filteredResults = new Array();

	for each (node in unfilteredResults) {

		var testDate = new Date(node.properties.modified);

		//if the nodes modified date is between the passed date/time and now add
		//to filteredResult Array
		if ((filterDate <= testDate) && (testDate <= now)){
			filteredResults.push(node);
		}
	}

	return filteredResults;
}

This is clean and simple.

Modify Alfresco. While this sounds heavy, it actually isn’t.  Our support team pointed me at this forum post in which Andy Hind from our engineering team shows us how to change the default behaviour:

In dataTypeAnalyzers.properties (located inside the war file in WEB-INF/classes/alfresco/model) change:

d_dictionary.datatype.d_datetime.analyzer=org.alfresco.repo.search.impl.lucene.analysis.DateAnalyser

to

d_dictionary.datatype.d_datetime.analyzer=org.alfresco.repo.search.impl.lucene.analysis.DateTimeAnalyser

Now rebuild your indexes.  Depending on the amount of content in your repository it this may take some time.  It may also not be possible to make this immediate because the  SLA you have with your customers won’t permit it until you have scheduled maintenance downtime. (HA Clustering is helpful in these situations.)

System Folders

One thing that may show up in these queries are system folders.  These folders contain any of the rules that may be associated with a space.  In most queries these folders and their content has no value, so we want to exclude them.

This string can be appended to your query string to exclude them:

var systemFolder = "-TYPE:\"cm:systemfolder\" " +
		"-TYPE:\"{http://www.alfresco.org/model/rule/1.0}rule\" " +
		"-TYPE:\"{http://www.alfresco.org/model/action/1.0}compositeaction\" " +
		"-TYPE:\"{http://www.alfresco.org/model/action/1.0}actioncondition\" " +
		"-TYPE:\"{http://www.alfresco.org/model/action/1.0}action\" " +
		"-TYPE:\"{http://www.alfresco.org/model/action/1.0}actionparameter\"";

PARENT vs PATH

When we are performing queries for content within a space we have a couple of options.  Two of the most common are PARENT and PATH.

PARENT queries work directly on a space. No recursion. It will return all of the nodes (spaces and content) in that space.  In other words, PARENT queries work directly on a space without recursion. It will return all of the nodes (spaces and content) in that space alone, but will not return any nodes (spaces and content) that are in sub-spaces of that space.

PATH queries are a subset of XPATH.  They are eagerly evaluated.  Thus they can be memory (Caching) and CPU intensive. They are useful if you want more than what is just in the space you are working, may not know the location of the space, or if a space of that name may exist in multiple locations.

Be smart in your choice.  Use PARENT as often as possible.

Note For Java extensions: unless the other clauses in your query are complex, it’s likely more efficient to enumerate (list / ls / dir) a space using the FileFolderService rather than a Lucene query. ie. a simple query of the form “PARENT:[noderef]” would be better implemented using FileFolderService.list()

OK, enough about some of our design decisions let’s talk about how to use the web script.

Install

The web script is packaged in an amp file.  It will work on Alfresco 3.1.1 and newer versions. (I tested up to Alfresco 3.2r) The alfresco-modified-content-list amp and the source can be found on the google code project site. Use the apply_amps script appropriate for your OS.

How to use it

The web script can be called using:

http://localhost:8080/alfresco/service/updated/in/{path}/since?date=01/01/2010 13:10:10

http://localhost:8080/alfresco/service/updated/{path}/since?date=01/01/2010 13:10:10

There is a slight difference in the two URLs: ‘in’

The ‘in’ allows you to tell the web script to just list the content in the space passed in the path parameter. Without the ‘in’ the web script will recurse the space structure from the passed path parameter down.

Next, the DateTime structure is not fixed:  Possible (tested) formats are

mm/dd/yyyy

yyyy/mm/dd

mm/dd/yyyy hh:mm:ss

yyyy/mm/dd hh:mm:ss

The hh is expecting a 24 hour clock format.  And millisecond and timezone tests are not supported

The query tests against the modified datetime property of the content.  (Fairly easy to change it and test against the created property or any other custom metadata DataTime property.)

There are three return formats: XML (the default), JSON and HTML.

XML The XML format is a simple structure that returns back the name of the file, the path to the file and the modification date. It is also the default format returned.

<contentItems>
	<contentItem name='simple.doc' path='/Company Home/test/Portal' modDate='2010-04-26T16:21:20.612-06:00'/>
	<contentItem name='simpledraft.docx' path='/Company Home/test/Portal' modDate='2010-04-26T16:28:44.569-06:00'/>
</contentItems>

If there is no modified content a simple  is returned.

JSON The JSON object is somewhat similar, a modified object containing a collection of node objects is returned

{ "modified": [
              { "node": {
                         "name": "test3.txt", "path": "some/path","modified": "2010-03-17T00:05:43.311-06:00"
                        }
              },
              { "node": {
                        "name": "test2.txt", "path": "some/path","modified": "2010-03-17T15:07:30.301-06:00"
                        }
              }
              ]
    }

HTML This final format is more for testing than actual production use.  It returns a simple html page with a list of modified items. It displays a simple unordered list, where each item is a csv list with key and values seperated by colons.

In all of these return formats, the freemaker template processes the returned scriptNodes for a simple collection.  This makes it easy to extend the return with the returned nodes properties to include things like nodeRef, creation date, node icons, download url, etc.

If you have suggestions for improvements or questions please feel free to comment here or on the google project.  As always I’m willing to open access to the project to those will to help improve the code.