Merging SharePoint Word documents into a single PDF with iTextSharp

Question 1

I created a program which collects all the word files from a folder which belongs to our SharePoint library of documents. The program will only grab the documents if they were submitted/are a part of the HR team.

These documents are saved in a temporary folder, and then merged together into one large pdf.

I thought this seemed a bit superfluous, and figured there must be a way of achieving this without having to save the individual pdf files first. Ideally I would also like a way of dynamically creating a table of contents based on the files which are already present.

The files need to be in the correct order, so each document on SharePoint has been assigned an order number.

This list is retrieved so that I have something to order the documents against (as saving them in the file results in them being in their default, alphabetical order, appending the order number to the beginning of the file name also doesn't work, as this orders them as 1, 10, 11, 12 .. etc). I also feel that this can be improved upon. If I can work out how to convert the word files into PDFs without having to save them first, I could sort the FileInfo data before converting.

Does anyone have any ideas or suggestions on how I can improve/implement the above, or any other comments? Any advice/guidance at all is much appreciated.

Program

using System.Collections.Generic;
using System.IO;
namespace PdfConverter
{
 class Program
 {
 #region fields
 public static List<string> orderedList;
 #endregion fields
 static void Main(string[] args)
 {
 // Get the word files FileInfo
 FileInfo[] wordFileInfo = DataGrabber.GetWordFileInfo();
 // Instantiate a new 'Convert' object and convert our files to pdf, saving them in a temporary folder
 Converter convert = new Converter();
 convert.ToPdf(wordFileInfo);
 // Get the pdf filenames
 var newPdfFiles = DataGrabber.GetPdfFileNames();
 convert.MergePdfs(newPdfFiles);
 // Delete the files
 foreach (var item in orderedList)
 {
 var newFileName = "\\\\file\\IT\\SK\\test\\" + item.Split('.')[0] + ".pdf";
 System.IO.File.Delete(newFileName);
 }
 }
 }
}

DataGrabber

using System;
using System.Collections.Generic;
using System.IO;
using Microsoft.SharePoint.Client;
namespace PdfConverter
{
 class DataGrabber
 {
 #region fields
 public static string spSite = "http://SharepointSite/WorkingHere";
 public static string policyList = "Policies & Procedures";
 public static string sortBy = "Sort_x0020_Order";
 public static string sourceFolder = @"\\SharepointSite\WorkingHere\PoliciesandProcedures"; \\ network SP folder
 #endregion fields
 public ListItemCollection GetSpList(string siteName, string listName, string sortBy)
 {
 // Set the context (URL) of the site ...
 ClientContext cc = new ClientContext(siteName);
 Web web = cc.Web;
 // Get the list by title
 Microsoft.SharePoint.Client.List list = web.Lists.GetByTitle(listName);
 // A query to fileter the results
 CamlQuery caml = new CamlQuery();
 caml.ViewXml = "<View>" +
 "<Query>" +
 "<OrderBy>" +
 "<FieldRef Name='" + sortBy + "' Ascending='True'/>" +
 "</OrderBy>" +
 "</Query>" +
 //"<RowLimit>" + strCount + "</RowLimit>" +
 "</View>";
 // Store the items in a ListItemCollection
 ListItemCollection items = list.GetItems(caml);
 // Load the client context and execute
 cc.Load<Microsoft.SharePoint.Client.List>(list);
 cc.Load<ListItemCollection>(items);
 cc.ExecuteQuery();
 return items;
 }
 public static FileInfo[] GetWordFileInfo()
 {
 // Create a new data grabber
 DataGrabber d = new DataGrabber();
 // Get the SP list using spList - policyList - sortBy
 ListItemCollection spList = d.GetSpList(spSite, policyList, sortBy);
 // Get a list of the document titles from spList
 List<string> hrDocTitles = d.Sort(spList);
 // Get the fileinfo from these documents
 FileInfo[] wordFiles = d.GetFiles(hrDocTitles);
 return wordFiles;
 }
 public FileInfo[] GetFiles(List<string> fileNames)
 {
 // Get the fileinfo from each document we added to the docList 
 var length = fileNames.Count;
 string uriPath = sourceFolder;
 FileInfo[] wordFiles = new FileInfo[length];
 for (int j = 0; j < length; j++)
 {
 string path = uriPath + "\\" + fileNames[j];
 string localPath = new Uri(path).LocalPath;
 FileInfo fileInfo = new FileInfo(localPath);
 wordFiles[j] = fileInfo;
 }
 return wordFiles;
 }
 public List<string> Sort(ListItemCollection spList)
 {
 List<string> docList = new List<string>();
 // For each item in the SPList, check if it is HR and then add it to our lists
 foreach (ListItem litem in spList)
 {
 try
 {
 var rawTeam = litem.FieldValues["BRTeamHTField0"].ToString();
 var team = rawTeam.Substring(0, 2);
 var asd = litem.FieldValues["FileLeafRef"];
 string title = "";
 if (team == "HR" & litem.FieldValues["FileLeafRef"].ToString().Contains(".doc"))
 {
 try
 {
 title = litem.FieldValues["FileLeafRef"].ToString();
 }
 catch (Exception)
 {
 title = ""; 
 }
 docList.Add(title);
 }
 else
 {
 // Nothing
 }
 }
 catch 
 {
 }
 }
 // Update the ordered list field in the main program - used to maintain the correct document order
 Program.orderedList = docList;
 return docList;
 }
 public static string[] GetPdfFileNames()
 {
 // Get list of word files in specified directory
 string localPath = new Uri(sourceFolder).LocalPath;
 DirectoryInfo dirInfo = new DirectoryInfo(localPath);
 FileInfo[] pdfFiles = dirInfo.GetFiles("*.pdf");
 var length = pdfFiles.Length; // To exclude the existing HR Policies and Procedures Guide
 string[] pdfFileNames = new string[length];
 for (int i = 0; i < length; i++)
 {
 pdfFileNames[i] = localPath + "\\" + pdfFiles[i].Name;
 }
 Array.Sort(pdfFileNames);
 return pdfFileNames;
 }
 }
}

Converter

using iTextSharp.text;
using iTextSharp.text.pdf;
using Microsoft.Office.Interop.Word;
using System;
using System.IO;
namespace PdfConverter
{
 class Converter
 {
 #region fields
 public static string tempTargetFolder = @"\\file\IT\SK\test\";
 public static string targetPdf = @"\\file\IT\SK\test\Guide\testOutput.pdf";
 #endregion fields
 public void ToPdf(FileInfo[] wordFiles)
 {
 // Takes a FileInfo[] variable and convers each file in it to pdf, saving it to a temporary target folder tempTargetFolder
 // Create a new microsoft word application object
 Application word = new Application();
 word.Visible = false;
 word.ScreenUpdating = false;
 // I think this can actually be done away with and all oMissing parameters removed, as C# can handle optional parameters ...
 object oMissing = System.Reflection.Missing.Value;
 // for adding the file name to the temporary pdf files
 int counter = 1;
 foreach (FileInfo wordFile in wordFiles)
 {
 // Cast as object for word open method
 Object fileName = (Object)wordFile.FullName;
 // Use the dummy value as a placeholder for optional arguments
 Microsoft.Office.Interop.Word.Document doc = word.Documents.Open(ref fileName, ref oMissing,
 ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing,
 ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing,
 ref oMissing, ref oMissing, ref oMissing, ref oMissing);
 doc.Activate();
 object outputFileName = new object();
 if (wordFile.FullName.Contains(".docx"))
 {
 outputFileName = tempTargetFolder + wordFile.FullName.Split('\\')[5];
 outputFileName = outputFileName.ToString().Replace(".docx", ".pdf");
 }
 else if (wordFile.FullName.Contains(".doc"))
 {
 outputFileName = tempTargetFolder + wordFile.FullName.Split('\\')[5];
 outputFileName = outputFileName.ToString().Replace(".doc", ".pdf");
 }
 else
 {
 break;
 }
 object fileFormat = WdSaveFormat.wdFormatPDF;
 // Save document into pdf format
 doc.SaveAs(ref outputFileName,
 ref fileFormat, ref oMissing, ref oMissing,
 ref oMissing, ref oMissing, ref oMissing, ref oMissing,
 ref oMissing, ref oMissing, ref oMissing, ref oMissing,
 ref oMissing, ref oMissing, ref oMissing, ref oMissing);
 // Close the word document, but leave the word application open.
 // Doc has to be cast to type _Document so that it will find the
 // correct close method.
 object saveChanges = WdSaveOptions.wdDoNotSaveChanges;
 ((_Document)doc).Close(ref saveChanges, ref oMissing, ref oMissing);
 doc = null;
 counter++;
 }
 // word has to be cast to type _Application so that it will find the correct Quit method
 ((_Application)word).Quit(ref oMissing, ref oMissing, ref oMissing);
 word = null;
 }
 public void MergePdfs(string[] newPdfFiles)
 {
 using (FileStream stream = new FileStream(targetPdf, FileMode.Create))
 {
 iTextSharp.text.Document pdfDoc = new iTextSharp.text.Document(PageSize.A4);
 PdfCopy pdf = new PdfCopy(pdfDoc, stream);
 pdfDoc.Open();
 foreach (string file in Program.orderedList)
 {
 var newFileName = "\\\\file\\IT\\SK\\test\\" + file.Split('.')[0] + ".pdf";
 PdfReader test = new PdfReader(newFileName);
 pdf.AddDocument(test);
 test.Dispose();
 }
 // pdfDoc.Close(); // Doesn't seem to do anything
 pdf.Close();
 }
 }
 }
}

Question 2

Welcome to Code Review! I have rolled back the last edit. Please see what you may and may not do after receiving answers .

Question 3

orderedList: you should not have public fields in your class. Is it OK if someone else assigns an entirely new value to that field? If not then you should make a property with private setter. Also name isn't so helpful: orderer list of what? Even better: remove it. Static fields, properties and methods are terrible to test and in this case you don't even need it because it can be a function return value.

newFileName is built manually with an hard-coded constant (move it to a const private field) and a manually parsed file name. To obtain file name (it doesn't matter how trivial it seems) you should use Path methods, in this case Path.GetFileNameWithoutExtension(). It's both clear (to readers) and more reliable.

In my opinion your Main() method does little bit too many things. Why don't you extract few self-descriptive methods like ConvertPdf(), MergePdf() and DeletePdf()? It's a short method but we have too read (and understand) too much code just to have a coarse overview or program logic.

DataGrabber class isn't derived. Why don't you also mark it as sealed?

Many static strings are constants, why don't you declare them as private const string?

sortBy is an escaped XML fragment and you build XML document with string concatenation. Assuming sortBy won't change (make it clear making it const string) it's too error-prone and not human-friendly (IMO). You'd better move your XML file into resources, reading it with XmlDocument.Load() from ResourceManager.GetStream(). You can change attribute value using a simple XPath query. It's slower that raw manual approach but easier to read (also because you don't pollute your C# code with XML fragments) and less error-prone (XmlDocument will escape things for you).

ClientContext class implements IDisposable then I'd wrap its usage with using (to correctly dispose it also in case of errors).

Sort() code is too long and convoluted, in my opinion it needs to be more self-descriptive. What it is doing? What all those fields are?

I'd also change how you manage exceptions, for example:

try
{
 title = litem.FieldValues["FileLeafRef"].ToString();
}
catch (Exception)
{
 title = ""; 
}

Why that code may fail? Check instead of catch exceptions (C# is not Python) and for sure never ever catch base Exception. If there may be an error you can't prevent then catch exactly that exception otherwise it will hide possible bugs. On the same style: do not ever use catch { }. Never, no exceptions to this.

In your GetPdfFileNames() the length local variable (and its associated comment) are pretty useless. What's its purpose? Why don't you simply use pdfFiles.Length? As it is now it doesn't exclude anything, is it a bug or an outdated comment?

Also consider to use some LINQ, it may make your code slightly more readable. Note that you do that for just to concatenate FileInfo.Name with localPath however (because FileInfo are retrieved from a DirectoryInfo created from localPath) you already have that value in FileInfo.FullName property. One step more: to get full path you don't even need FileInfo:

var pdfFileNames = Directory.GetFiles(localPath, "*.pdf");

In Converter I'd change more or less same things I already suggested for other modules. Do not build path by hand, to change extension you can use Path.ChangeExtension(). Also always (especially when working with Word instances!) use using statement when applicable (also read How to release COM handle in .NET.)

Next (optional!) step would be to make your utility more testable and extensible. The fact you're converting using iTextSharp and that documents come from SharePoint are merely implementation details. If you hide this in abstract base classes (or interfaces) you may plug-in (for example) different sources and also test all your logic mocking-up class currently not under-test. As very final note I'd add some sort of retry-pattern. Things may go wrong with networks but a temporary error may be recovered if you wait few seconds (and you won't need to restart whole process from beginning), see also this post.

Question 4

Exactly! Use AddRange() if it's a List<T> otherwise make a public property (not field) of string[]. Even better REMOVE it, accessing a static field will make your code hard or impossible to test. Sorted list may simply be a function return value.

Question 5

Yes. Better to pass value (if it can't be designed in another way) . Think about a public static field as a global variable in - let's say - C. It has exactly same drawbacks. Of course in this small example the most obvious one is testability. Don't you feel strange to have objects (then instances) that share data through a static variable?

Question 6

I'm not sure why that is wrong, in my experience I have always had a need to have a public static field which can be accessed by other classes, but if this is bad practice I will need to stop doing it! I am thinking of splitting the whole thing out into some more classes (e.g. GetSpList), maybe this will give me a better understanding

Question 7

Usually, but it's just in my personal experience, I find easy to first describe what I want to do in plain English (actually in Italian) absolutely without any implementation detail. From that I pick keywords and I make them my base (often abstract) classes. Then I go with implementation. Globals are bad because anyone anywhere can modify them. Synchronization is a pain, classes become top coupled and testing is absolutely a nightmare (when not even impossible).

Question 8

It doesn't mean you can't use them but more often than not (I don't believe in axioms in our job) there is another better designed and even easier solution.

score 2 · Accepted Answer · 2016-01-18 15:31:32Z