Sunday, 24 October 2010

wget doesn't get all levels

I had a problem trying to grab a whole directory structure of files from an ftp site using wget. When I tried to get the files directly they were fine but when trying to get the whole site, it didn't get certain files which definitely existed.

It turns out that wget defaults to 5 levels of depth, even when downloading physical files from ftp (rather than following symlinks). I changed it so it so that it uses 10 levels instead with -l10 and it was fine. The number of levels make sense when you see the default output structure of servername/dri1/dir2/dir3/dir4 even though your files might appear to be level 4.

I'm not sure if rsync is easier to use for this purpose but the wget command is easy enough:

wget -r -N -l10 ftp://username:password@servername/dir

Where -r is recursion, -N only gets items that have been modified and -l10 increases the number of levels to 10 (That is 'L' 10, not 110!)

Tuesday, 19 October 2010

Document Generation with Word 2007 and ASP.NET - part 2

In the first part, we looked at how to build a document ready to accept dynamic data which can be used to generate dynamic reports. Whether we are planning to use a blank template, individual read/write documents or merging documents together, the basic idea is the same:

  1. Get data from data source

  2. Create XML part for this data

  3. Insert this data into the document

  4. Return the result to the client


Getting data from the data source is beyond the scope of this document but in my case, I have an existing database access layer which I call to return a single row of data (a DataRow).
Creating the XML part is quite easy using the XmlWriter class as in the following function:

private void GetData(Stream stream, string quoteRef)
{
// DataRow dr = etc..
XmlWriter writer = XmlWriter.Create(stream);
writer.WriteStartElement("Quote");
writer.WriteAttributeString("Reference", quoteRef);
writer.WriteElementString("QuoteName", "Test New Quote");
writer.WriteElementString("TotalSellingPrice", Convert.ToDecimal(dr["TotalSellingPrice"]).ToString("N2"));
writer.WriteElementString("TotalCost", Convert.ToDecimal(dr["TotalCost"]).ToString("N2"));
writer.WriteEndElement();
writer.Close();
}

Since the XML is text, it is convenient to format numbers at this point rather than playing with it in the document (but you can if you need to).
To insert this into the document, we need to use the System.IO.Packaging classes which allow you to work on the zip file (which is the docx). You might need to reference WindowsBase.dll if you haven't already to get these classes.
My function is then:

private void InsertCustomXml(MemoryStream memoryStream)
{
// Open the document in the stream and replace the custom XML part
Package pkgFile = Package.Open(memoryStream, FileMode.Open, FileAccess.ReadWrite);
PackageRelationshipCollection pkgrcOfficeDocument = pkgFile.GetRelationshipsByType(strRelRoot);
foreach (PackageRelationship pkgr in pkgrcOfficeDocument)
{
if (pkgr.SourceUri.OriginalString == "/")
{
// Add a custom XML part to the package
Uri uriData = new Uri("/customXML/item1.xml", UriKind.Relative);
if (pkgFile.PartExists(uriData))
{
// Delete template "/customXML/item1.xml" part
pkgFile.DeletePart(uriData);
}
// Load the custom XML data
PackagePart pkgprtData = pkgFile.CreatePart(uriData, "application/xml");
GetData(pkgprtData.GetStream(), "QUO-016952");
}
}
// Close the file
pkgFile.Close();
}

This function takes a stream which represents a docx document and adds the custom XML in, there is nothing in this function which you would change (except the string parameter passed to GetData) you need a const defined which matches the correct namespace:

private const string strRelRoot = "http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument";

The reason I use streams is because I use the system in different ways. To retrieve a single document:

public MemoryStream RetrieveDocument(string fileName, bool addCustomXml)
{
// Read the file into memory - default constructor is expandable
MemoryStream memoryStream = new MemoryStream();
byte[] buffer = File.ReadAllBytes(fileName);
memoryStream.Write(buffer,0,buffer.Length);
// If we want to add in the XML, do it here otherwise we might want to add it at top level
if (addCustomXml)
{
InsertCustomXml(memoryStream);
}

return memoryStream;
}

NOTE the comment that you need to use the default constructor for the memory stream otherwise it will not be expandable when you insert your custom XML.
If I am merging several documents (in my case using the Aspose Words dlls from Aspose) I merge the docs and THEN add the custom XML to the main document:

public Document MergeDocuments(List docs, string templateDoc, string outputPath)
{
Document dstDoc = new Document(templateDoc);

foreach (MemoryStream Doc in docs)
{
Document srcDoc = new Document(Doc);
dstDoc.AppendDocument(srcDoc, ImportFormatMode.UseDestinationStyles);
Doc.Close();
}
// Add in the custom XML via the memory stream
MemoryStream ms = new MemoryStream();
dstDoc.Save(ms,SaveFormat.Docx);
InsertCustomXml(ms);
dstDoc = new Document(ms);
return dstDoc;
}

Aspose merges actual docs whereas I need a memory stream for my functions so I use the Aspose doc class to append documents and then convert it to memory stream to add the custom XML, once it is done, I return a new Aspose doc so it can be saved appropriately. One of the things that is easier in docx is that to merge docs, you can simply merge the content and keep a single set of styles, custom xml and the rest of it but it does mean that any custom XML/styles in individual docs is lost when the documents are merged.
In my web layer then, I call these functions and return the result to the browser using content-disposition to hint that it can be saved instead of viewed inline:

protected void Page_Load(object sender, EventArgs e)
{
TenderGenerator gen = new TenderGenerator();
const string TemplateFile = @"~/App_Data/QuoteTemplate.docx";
List myList = new List();
myList.Add(gen.RetrieveDocument(@"c:\work\QuoteDocuments\QUO-016952\Documents\UniqueReport.docx", false));
myList.Add(gen.RetrieveDocument(@"c:\work\QuoteDocuments\QUO-016952\Documents\Financial Summary.docx", false));
Document doc = gen.MergeDocuments(myList, Server.MapPath(TemplateFile), "");
doc.Save(Response, "CustomerDocument.docx", ContentDisposition.Attachment, SaveOptions.CreateSaveOptions(SaveFormat.Docx));
Response.End();
}

In this case, I have a test page so the doc paths are hard-coded and the whole thing is driven from the page load event, in real life, this will be driven from a button press and database driven list of docs. Note that I use the template file stored in App_Data which already has the custom XML linked to it so that the top level document can have the custom XML replaced (in the previous post I mentioned that for some reason, if the custom XML is not present, the new data is not added in its place). Hopefully this is all easy enough to understand.

Document Generation with Word 2007 and ASP.NET

I had been asked to look at this for a project and spent many an hour trawling the interwebs to find out about solutions. The thing that makes it hard is that MS have a tendency to change technologies frequently (although generally for the better) and their MSDN articles are galaxian in size so trawling the various guides, APIs, white papers, support and other pages takes time.
Anyway, I've found some helpful articles and thought I would write a very easy to follow guide to getting the basics working.
If you want to generate a document on-the-fly, chances are that you want to populate the document with dynamic data from some form of database or other data source (the actual source is not important). Currently, this has to be done with various horrible COM technologies or otherwise by directly hacking Word 2003 XML files.
Word 2007 has the ability to link data controls on the page to XML data in the document, which can be generated dynamically. A Word 2007 document is actually a ZIP file (rename it and see!) which contains the various related XML files and which form the overall document, the XML which we will use to generate docs is custom XML and can follow any or no schema. In my case it is a very simple XML file with nothing more than the XML element and a single entity with 3 elements. Because this custom XML is separate from the rest of the document, I can change it at my leisure without affecting any formatting.
Now when adding these data controls, know as Content Controls, there are certain things that are not that obvious. Firstly, the controls can only be linked easily to the XML using code. There is a whole schema thing where you can define a schema and tell Word about it but I went another route by creating a Ribbon Control add-in for Word and using the C# code to add a correctly linked Content Control to a document. These can be added without the XML being present (they just won't have any data until the XML appears at some point). If you create a Word add-in project in Visual Studio and create a Ribbon(XML) component, which is fairly easy, you can then add the code which will look something like this (note the public modifier on the button handler):

// In the handler for a certain button - Ribbon1.cs
public void OnSellingPriceButton(Office.IRibbonControl control)
{
Globals.ThisAddIn.CreateBoundDataItem("Total Sale", "TotalSellingPrice");
}

the reason this calls a function via the AddIn is because the AddIn has access to the underlying document whereas the ribbon doesn't (although I think it is possible to do something weird to gain access). The add-in code in my case is then:

public void CreateBoundDataItem(string theTitle, string theProperty)
{
Microsoft.Office.Tools.Word.Document doc = Globals.ThisAddIn.Application.ActiveDocument.GetVstoObject();
Tools.PlainTextContentControl plainTextControl1 = doc.Controls.AddPlainTextContentControl("plainTextControl" + Count.ToString());
plainTextControl1.Title = theTitle;
plainTextControl1.XMLMapping.SetMapping("/Quote/" + theProperty, null, null);
++Count;
}

Special notes here! You CANNOT bind XML to the rich text control so use a plain text control. Also, you need to get the vsto object in order to get the correctly typed document to access the controls collection. Also, when adding the control, you have to provide a unique name so I use a static int variable in this class to generate a unique name.
Obviously the SetMapping code (XPath) will be unique to your XML format, mine, as I said, is very simple. Also, the title property is what appears on the control tab when it is inserted into the document (as an aide-memoire). You can set other properties here like whether the data is readonly and whether the control can be deleted.
The cool thing about the add-in is that once it is built, you simply copy the output from the bin/relase directory into your AddIns directory in AppData (search for AddIns) and Word will automatically show the ribbon.
Once the ribbon is installed, it will create content controls pointing to whatever part of data you need. At this point there may or may not be any xml in the custom parts of the document but if you are using the Packaging classes in C# (which we use in the server end), it appears that it can replace the XML but not create it (might be some permissions thing?) so you will need to make sure that the document starts with a correctly formed XML custom part, even if the element data in it is blank or invalid. You can do this by running some VB in Word 2007 with your document open (you will need to enable the developer tab in Word Options), in the immediate window, type each of these lines and press 'enter' after them to invoke them:

ActiveDocument.CustomXMLParts.Add
ActiveDocument.CustomXMLParts(4).Load ("c:\CustomerData.xml")

You need to use the index 4 (the doc already has 3 built-in ones) but your path to the XML will obviously be wherever it is. The system will create an item called item1.xml in the customxml directory inside your docx zip structure and the fields will immediately be available to your content controls.
That is the document part done. You have a couple of alternatives now, depending on what you are doing but worth work in the same way. You can either keep a single document with the custom XML present but no content, and use this as a basis for any generated documents or you can save a bunch of different documents, each with the custom XML in place and then work on these individually or together. It depends on whether your generation is one-way or whether it will be loaded and saved.
The next post will be the server side which will look at the various options.

Thursday, 14 October 2010

Embedding Excel cells into Word 2007

I'm trying to write something that generates Word documents from a database which can then be modified by a user to include optional fields - things that will change and need updating in a document.
I'm having a bit of a struggle working out how to do it but I thought about old-school DDE and embedding LINK fields into Word. It is not ideal since it requires an absolute file path but I thought I would try it out. I could then create building blocks based on these LINKs and allow users to add them in. I attempted to copy the example in the Word help and surprise, surprise, it didn't work at all. I kept getting various errors all amounting to the fact that the field was invalid. I tried all sorts of options and almost gave up when someone mentioned the easier way to do it!
Copy the cell(s) in Excel that you want to use and then use Paste Special in Word to paste as a link (it's an option on the left-hand side of the dialog). It will automatically create the correct format for the LINK field (in my case some slightly different switches) and should all be good.

Friday, 1 October 2010

When a SQL JOIN is slower than it should be

I've experienced the situation a few times where select * from viewA runs very fast, select * from viewB is very fast but when joined together, they are very slow (like 10 times the execution time).
I also had today the scenario where joining an old view was fine but joining a new one, which was not only virtually identical but with less columns and one less join, took 30 minutes instead of 10 seconds!
I have experienced minor improvements by changing join orders and trying to use more efficient WHERE clauses, avoiding SELECT in the WHERE clause etc but this was a whole other league of performance issue. When you only join two views, you don't have many options but I tried simplifying the ON clause (no different), removed the GROUP BY (no difference) remove the column which was summing an amount (no difference) and then I realised the only thing remaining was the JOIN. I changed this from INNER JOIN to LEFT OUTER JOIN and all of a sudden, I was back up and running.
As it happens, there are no rows in view 1 which do not appear in view 2 so the INNER JOIN isn't actually required here but I was still surprised. I looked into a few web posts and the best answer was, "an inner join is an implicit outer join + where column is not null" but I don't actually believe this and still wouldn't expect the problem in my 500 row dataset but obviously somehow it is a problem (although this is by FAR the worst case of it I have ever seen).