Category Archives: Document Conversion

How To : Convert Word Documents to PDF using SharePoint Server 2010 and Word Services and Stop Wasting Money on MS Field Engineers

SharePoint’s Word Automation Services is extremely powerful when it comes to converting document types and keeping the formatting.
Combine this with the flexibility of the templates that OpenXML use and you have a very powerful combination capapble of ANYTHING.
This is part of a Document Management System I developed for Nedbank that a Microsoft Field Engineer said was Impossible in SharePoint and asked R300 000 just for a few hours of his time for “analysis

SharePoint 2010Word Automation Services available with SharePoint Server 2010 supports converting Word documents to other formats. This includes PDF. This article describes using a document library list item event receiver to call Word Automation Services to convert Word documents to PDF when they are added to the list. The event receiver checks whether the list item added is a Word document. If so, it creates a conversion job to create a PDF version of the Word document and pushes the conversion job to the Word Automation Services conversion job queue.

This article describes the following steps to show how to call the Word Automation Services to convert a document:

  1. Creating a SharePoint 2010 list definition application solution in Visual Studio 2010.
  2. Adding a reference to the Microsoft.Office.Word.Server assembly.
  3. Adding an event receiver.
  4. Adding the sample code to the solution.

Creating a SharePoint 2010 List Definition Application in Visual Studio 2010

This article uses a SharePoint 2010 list definition application for the sample code.

To create a SharePoint 2010 list definition application in Visual Studio 2010

  1. Start Microsoft Visual Studio 2010 as an administrator.
  2. From the File Menu, point to the Project menu and then click New.
  3. In the New Project dialog box select the Visual C# SharePoint 2010 template type in the Project Templates pane.
  4. Select List Definition in the Templates pane.
  5. Name the project and solution ConvertWordToPDF.
    Figure 1. Creating the Solution

    Creating the solution

  6. To create the solution, click OK.
  7. Select a site to use for debugging and deployment.
  8. Select the site to use for debugging and the trust level for the SharePoint solution.
    Note
    Make sure to select the trust level Deploy as a farm solution. If you deploy as a sandboxed solution, it does not work because the solution uses the Microsoft.Office.Word.Server assembly. This assembly does not allow for calls from partially trusted callers.
    Figure 2. Selecting the trust level

    Creating the solution

  9. To finish creating the solution, click Finish.

Adding a Reference to the Microsoft Office Word Server Assembly

To use Word Automation Services, you must add a reference to the Microsoft.Office.Word.Server to the solution.

To add a reference to the Microsoft Office Word Server Assembly

  1. In Visual Studio, from the Project menu, select Add Reference.
  2. Locate the assembly. By using the Browse tab, locate the assembly. The Microsoft.Office.Word.Server assembly is located in the SharePoint 2010 ISAPI folder. This is usually located at C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\14\ISAPI. After the assembly is located, click OK to add the reference.
    Figure 3. Adding the Reference

    Adding the reference

Adding an Event Receiver

This article uses an event receiver that uses the Microsoft.Office.Word.Server assembly to create document conversion jobs and add them to the Word Automation Services conversion job queue.

To add an event receiver

  1. In Visual Studio, on the Project menu, click Add New Item.
  2. In the Add New Item dialog box, in the Project Templates pane, click the Visual C# SharePoint 2010 template.
  3. In the Templates pane, click Event Receiver.
  4. Name the event receiver ConvertWordToPDFEventReceiver and then click Add.
    Figure 4. Adding an Event Receiver

    Adding an event receiver

  5. The event receiver converts Word Documents after they are added to the List. Select the An item was added item from the list of events that can be handled.
    Figure 5. Choosing Event Receiver Settings

    Choosing even receiver settings

  6. Click Finish to add the event receiver to the project.

Adding the Sample Code to the Solution

Replace the contents of the ConvertWordToPDFEventReceiver.cs source file with the following code.C

 
using System;
using System.Security.Permissions;
using Microsoft.SharePoint;
using Microsoft.SharePoint.Security;
using Microsoft.SharePoint.Utilities;
using Microsoft.SharePoint.Workflow;

using Microsoft.Office.Word.Server.Conversions;

namespace ConvertWordToPDF.ConvertWordToPDFEventReceiver
{
  /// <summary>
  /// List Item Events
  /// </summary>
  public class ConvertWordToPDFEventReceiver : SPItemEventReceiver
  {
    /// <summary>
    /// An item was added.
    /// </summary>
    public override void ItemAdded(SPItemEventProperties properties)
    {
      base.ItemAdded(properties);

      // Verify the document added is a Word document
      // before starting the conversion.
      if (properties.ListItem.Name.Contains(".docx") 
        || properties.ListItem.Name.Contains(".doc"))
      {
        //Variables used by the sample code.
        ConversionJobSettings jobSettings;
        ConversionJob pdfConversion;
        string wordFile;
        string pdfFile;

        // Initialize the conversion settings.
        jobSettings = new ConversionJobSettings();
        jobSettings.OutputFormat = SaveFormat.PDF;

        // Create the conversion job using the settings.
        pdfConversion = 
          new ConversionJob("Word Automation Services", jobSettings);

        // Set the credentials to use when running the conversion job.
        pdfConversion.UserToken = properties.Web.CurrentUser.UserToken;

        // Set the file names to use for the source Word document
        // and the destination PDF document.
        wordFile = properties.WebUrl + "/" + properties.ListItem.Url;
        if (properties.ListItem.Name.Contains(".docx"))
        {
          pdfFile = wordFile.Replace(".docx", ".pdf");
        }
        else
        {
          pdfFile = wordFile.Replace(".doc", ".pdf");
        }

        // Add the file conversion to the conversion job.
        pdfConversion.AddFile(wordFile, pdfFile);

        // Add the conversion job to the Word Automation Services 
        // conversion job queue. The conversion does not occur
        // immediately but is processed during the next run of
        // the document conversion job.
        pdfConversion.Start();

      }
    }
  }
}

Examples of the kinds of operations supported by Word Automation Services are as follows:

  • Converting between document formats (e.g. DOC to DOCX)
  • Converting to fixed formats (e.g. PDF or XPS)
  • Updating fields
  • Importing “alternate format chunks”

This article contains sample code that shows how to create a SharePoint list event handler that can create Word Automation Services conversion jobs in response to Word documents being added to the list. This section uses code examples taken from the complete, working sample code provided earlier in this article to describe the approach taken by this article.

The ItemAdded(SPItemEventProperties) event handler in the list event handler first verifies that the item added to the document library list is a Word document by checking the name of the document for the .doc or .docx file name extension.

// Verify the document added is a Word document
// before starting the conversion.
if (properties.ListItem.Name.Contains(".docx") 
    || properties.ListItem.Name.Contains(".doc"))
{

If the item is a Word document then the code creates and initializes ConversionJobSettings and ConversionJob objects to convert the document to the PDF format.

C#
 
//Variables used by the sample code.
ConversionJobSettings jobSettings;
ConversionJob pdfConversion;
string wordFile;
string pdfFile;

// Initialize the conversion settings.
jobSettings = new ConversionJobSettings();
jobSettings.OutputFormat = SaveFormat.PDF;

// Create the conversion job using the settings.
pdfConversion = 
  new ConversionJob("Word Automation Services", jobSettings);

// Set the credentials to use when running the conversion job.
pdfConversion.UserToken = properties.Web.CurrentUser.UserToken;

The Word document to be converted and the name of the PDF document to be created are added to the ConversionJob.

 
// Set the file names to use for the source Word document
// and the destination PDF document.
wordFile = properties.WebUrl + "/" + properties.ListItem.Url;
if (properties.ListItem.Name.Contains(".docx"))
{
  pdfFile = wordFile.Replace(".docx", ".pdf");
}
else
{
  pdfFile = wordFile.Replace(".doc", ".pdf");
}

// Add the file conversion to the Conversion Job.
pdfConversion.AddFile(wordFile, pdfFile);

Finally the ConversionJob is added to the Word Automation Services conversion job queue.

 
// Add the conversion job to the Word Automation Services 
// conversion job queue. The conversion does not occur
// immediately but is processed during the next run of
// the document conversion job.
pdfConversion.Start();

How To : A library to create .mht files (available at request)

There are a number of ways to do this, including hosting Word or Excel on the Web Server and dealing with COM Interop issues, or purchasing third – party MIME encoding libraries, some of which sell for $250.00 or more. But, there is no native .NET solution. So, being the curious soul that I am, I decided to investigate a bit and see what I could come up with. Internet Explorer offers a File / Save As option to save a web page as “Web Archive, single file (*.mht)”.

Image

What this does is create an RFC – compliant Multipart MIME Message. Resources such as images are serialized to their Base64 inline encoding representations and each resource is demarcated with the standard multipart MIME header – breaks. Internet Explorer, Word, Excel and most newsreader programs all understand this format. The format, if saved with the file extension “.eml”, will come up as a web page inside Outlook Express; if saved with “.mht”, it will come up in Internet Explorer when the file is double-clicked out of Windows Explorer, and — what many do not know — if saved with a “*.doc” extension, it will load in MS Word, each with all the images intact, and in the case of the EML and MHT formats, with all of the hyperlinks fully-functioning. The primary advantage of the format is, of course, that all the resources can be consolidated into a single file,. making distribution and archiving much easier — including database storage in an NVarchar or NText type field.

 

System.Web.Mail, which .NET provides as a convenient wrapper around the CDO for Windows COM library, offers only a subset of the functionality exposed by the CDO library, and multipart MIME encoding is not a part of that functionality. However, through the wonders of COM Interop, we can create our own COM reference to CDO in the Visual Studio IDE, allowing it to generate a Runtime Callable Wrapper, and help ourselves to the entire rich set of functionality of CDO as we see fit.

 

One method in the CDO library that immediately came to my notice was the CreateMHTMLBody method. That’s MHTMLBody, meaning “Multipurpose Internet Mail Extension HTML (MHTML) Body”. Well!– when I saw that, my eyes lit up like the LED’s on a 32 – way Unisys box! This is a method on the CDO Message class; the method accepts a URI to the requested resource, along with some enumerations, and creates a MultiPart MIME – encoded email message out of the requested URI responses — including images, css and script — in one fell swoop.

 

“Ah”, you say, “How convenient”! Yes, and not only that, but we also get a free “multipart COM Interop Baggage” reference to the ADODB.Stream object – and by simply calling the GetStream method on the Message Class, and then using the Stream’s SaveToFile method, we can grab any resource including images, javascript, css and everything else (except video) and save it to a single MHT Web Archive file just as if we chose the “Save As” option out of Internet Explorer.

 

If we choose not to save the file, but instead want to get back the stream contents, no problem. We just call Stream.ReadText(Stream.Size) and it returns a string containing the entire MHT encoded content. At that point we can do whatever we want with it – set a content – header and Response .Write the content to the browser, for instance — or whatever.

 

For example, when we get back our “MHT” string, we can write the following code:

Response.ContentType=”application/msword”;
Response.AddHeader( “Content-Disposition”, “attachment;filename=NAME.doc”);
Response.Write(myDataString);

 

— and the browser will dutifully offer to save the file as a Word Document. It will still be Multipart MIME encoded, but the .doc extension on the filename allows Word to load it, and Word is smart enough to be able to parse and render the file very nicely. “Ah”, you are saying, “this is nice, and so is the price!”. Yup!

And, if you are serving this MIME-encoded file from out of your database, for example, and you would like it to be able to be displayed in the browser, just change the “NAME.doc” to “NAME.MHT”, and don’t set a content-type header. Internet Explorer will prompt the user to either save or open the file. If they choose “open”, it will be saved to the IE Temporary files and open up in the browser just as if they had loaded it from their local file system.

 

So, to answer a couple of questions that came up recently, yes — you can use this method to MHTML – encode any web page – even one that is dynamically generated as with a report — provided it has a URL, and save the MIME-encoded content as a string in either an NVarchar or NText column in your database. You can then bring this string back out and send it to the browser, images,css, javascript and all.

Now here is the code for a small, very basic “Converter” class I’ve written to take advantage of the two scenarios specified above. Bear in mind, there is much more available in CDO, but I leave this wondrous trail of ecstatic discovery to your whims of fancy:

using System;
using System.Web;
using CDO;
using ADODB;
using System.Text;
namespace PAB.Web.Utils
{
 public class MIMEConverter
 {
  //private ctor as our methods are all static here
  private MIMEConverter()
  {
   
  }   
  public static bool SaveWebPageToMHTFile( string url, string filePath)
  {
   bool result=false;
   CDO.Message  msg = new CDO.MessageClass(); 
   ADODB.Stream  stm=null ;
   try
   {
    msg.MimeFormatted =true;   
    msg.CreateMHTMLBody(url,CDO.CdoMHTMLFlags.cdoSuppressNone, "" ,"" );
stm = msg.GetStream(); stm.SaveToFile(filePath,ADODB.SaveOptionsEnum.adSaveCreateOverWrite); msg=null; stm.Close(); result=true; } catch {throw;} finally { //cleanup here } return result; } public static string ConvertWebPageToMHTString( string url ) { string data = String.Empty; CDO.Message msg = new CDO.MessageClass(); ADODB.Stream stm=null; try { msg.MimeFormatted =true; msg.CreateMHTMLBody(url,CDO.CdoMHTMLFlags.cdoSuppressNone,
"", "" );
stm = msg.GetStream(); data= stm.ReadText(stm.Size); } catch { throw; } finally { //cleanup here } return data; } } }

 

NOTE: When using this type of COM Interop from an ASP.NET web page, it is important to remember that you must set the AspCompat=”true” directive in the Page declaration or you will be very disappointed at the results! This forces the ASP.NET page to run in STA threading model which permits “classic ASP” style COM calls. There is, of course, a significant performance penalty incurred, but realistically, this type of operation would only be performed upon user request and not on every page request.

<

p align=”left”>The downloadable zip file below contains the entire class library and a web solution that will exercise both methods when you fill in a valid URI with protocol, and a valid file path and filename for saving on the server. Unzip this to a folder that you have named “ConvertToMHT” and then mark the folder as an IIS Application so that your request such as “http://localhost/ConvertToMHT/WebForm1.aspx&#8221; will function correctly. You can then load the Solution file and it should work “out of the box”. And, don’t forget – if you have an ASP.NET web application that wants to write a file to the file system on the server, it must be running under an identity that has been granted this permission.