Tuesday, March 19, 2013

Custom Crawler for Parsing PDF files with Sitecore

I recently had to create a crawler in my Sitecore 6.5 project that looked at PDF files in the Media library and called an external API to get a list of PDF files to parse and index. You can do this with Sitecore but the examples for doing this are old and really don’t work any more. It was a bit painful to try and get it all working. It is actually not a hard process, it is just the lack of working examples that made it hard to put all the parts together. I will break this into two parts 1) Create a Customer Crawler 2) Setup PDF indexing. You need to do them both to make PDF indexing happen and both, at least for me had no working examples I could find.

Create a Custom Crawler

For the crawler I started with Sitecore’s documentation (section3.2). It got me started but did not work the way they have it set up, but it does get you introduced to what a crawler is and takes you most the way there.

1) Create the new class

public class FileCrawler : BaseCrawler, ICrawler
2) Implement the two interface methods.
void ICrawler.Add(IndexUpdateContext context)
void ICrawler.Initialize(Index index)

3) Add some properties

public string Root { get; set; }
public string Database { get; set; }

These properties are set via the configuration file you will setup in a moment. Root defines where in the Sitecore tree you want your crawler to start working. Database allows you via your config to change rather the crawler is looking at Web or Master database.

Add is the main entry point to your code. Initialize is called but you may or may not want to do anything in here. For me I did not want to do anything other then a little logging. The Add method is where all my code started. With this you have all the code you need for a custom crawler. You can put a breakpoint somewhere inside the Add method and once you do the next steps you should be able to hit that breakpoint (yes, that means you can attached like normal for Sitecore debugging and debug the crawler).

Here is what my class now looks like:

using Sitecore;
using Sitecore.Search;
using Sitecore.Search.Crawlers;
 
public class FileCrawler : BaseCrawler, ICrawler
{
    public string Root { get; set; }
    public string Database { get; set; }
    public float Boost { get; set; }
    long _totalProcessedSize;
    int _fileCount;
    int _successCount;
    int _failureCount;
 
   void ICrawler.Add(IndexUpdateContext context)
   {
       _fileCount = 0;
       _totalProcessedSize = 0;
       _successCount = 0;
       _failureCount = 0;
 
       Stopwatch watch = Stopwatch.StartNew();
 
       ParseMediaLibraryFiles(context);
 
       watch.Stop();
       Log.Info(string.Format("Finished parsing files -- Total files:{0}(Errors:{1}-Success:{2}) -- {3}m:{4}s:{5}ms -- Total bytes {6}",_fileCount, _failureCount,_successCount, watch.Elapsed.Minutes, watch.Elapsed.Seconds,
           watch.Elapsed.Milliseconds, _totalProcessedSize), this);
   }
   void ICrawler.Initialize(Index index)
   {
       Log.Info("File Crawler Init", this);
   }
}

Line 19 is where I will get into the PDF part of this post, but for now this is the code for my crawler.

3) Setup your config file

In Sitecore’s documentation they tell you to create a FileCrawler.config file. If you don’t already have a file in your project that holds information about custom indexing you will need to set this new config file up. If you already have one for this purpose you can just add a new index or area inside a location attribute in that file (these are located in <website>/app_config/include). Using the details of the config setup they provided I ran into all types of issues getting errors like “AddIndex method not found” or “Add method not found.” Here is what I set up to get it working.

 
<search>
 <configuration>
   <indexes>
        <index patch:after="index[@id='system']" id="MyIndexName" type="Sitecore.Search.Index, Sitecore.Kernel">
            <param desc="name">$(id)</param>
            <param desc="folder">MyIndexName</param>
            <Analyzer ref="search/analyzer" />
            <locations hint="list:AddCrawler">
                <tqsFiles type="MyNamespace.FileCrawler, MyNamespace">
                    <Database>web</Database>
                    <Root>/sitecore/media library/files/resources</Root>
                    <Tags>PDFFiles</Tags>
                    <Boost>1.0</Boost>
                </tqsFiles>
            </locations>
        </index>
  <configuration>
<search>

Once you have this you should now be able to go into Sitecore –> control panel –> databases –> Rebuild Search Indexes and see your new index (“MyIndexName”). If you see your new index you can attached to the w3w process and put a breakpoint in the Add method. When you have your breakpoint ready to go make sure your new index is checked and click “rebuild.” You should hit your breakpoint. That is it, you can now create whatever custom code you want in here using your database and root properties to know where to look for the data. The context item passed into the “Add” method is where you create or add new documents which are added to the index. Just make sure you do “context.AddDocument().” Without this the index will never get updated with your information.

Setup PDF Indexing

Now lets setup some code that will grab all the files in the media library and index any PDF file it finds. Again I was able to find an old Sitecore document (section 2.3 and chapter 5. Chapter 5 provides imagea link to some old open source libraries you will need, but they are old. Updated libraries can be found for PDFBox here.) on this subject that got me started but it did not work on its own. I will not bore you will all my code here, just the important methods for PDF parsing.

First, download the zip file from the link above for PDFBox. When you unzip the file make sure you unblock the files, if not you will get errors when trying to build. You will need to add these dlls as references to your project. The zip file comes with a lot of dlls and I am not sure when each one is needed. Some are called and loaded at runtime, though they are not needed at build time, but add them to your bin folder.

Once you have the dlls and reference set up you are ready for the main methods. I will touch on two methods here. The ParsePDF method does what you would think. This actually takes the string from a media item and parses it. The AddPDFContent takes Lucence.Net Document object and adds the index fields to the document.

protected virtual void AddPDFContent(Document document, MediaItem media)
{
   _totalProcessedSize += media.Size;
   _fileCount++;
   if (media.GetMediaStream().CanRead)
   {
       document.Add(this.CreateTextField(BuiltinFields.Content, this.ParsePDF(media.GetMediaStream(), media.Name)));
       document.Add(this.CreateTextField(BuiltinFields.Name, media.Name));
   }
}
private string ParsePDF(Stream mediaStream, string fileName)
{
   PDDocument doc = null;
   ikvm.io.InputStreamWrapper wrapper = null;
 
   try
   {
       Stream stream = mediaStream;
       wrapper = new ikvm.io.InputStreamWrapper(stream);
       doc = PDDocument.load(wrapper);
       PDFTextStripper stripper = new PDFTextStripper();
       var docText = stripper.getText(doc);
 
       _successCount++;
 
       return docText;
   }
   catch (Exception Ex)
   {
       Log.Error("Error parsing " + fileName + " for indexing", Ex, this.GetType());
       _failureCount++;
       return String.Empty;
   }
   finally
   {
       if ((doc != null) && (wrapper != null))
       {
           doc.close();
           wrapper.close();                                   
       }
   }
}

The ParsePDF method does the work of reading the stream from the PDF file and getting the text from it. Then it just returns that string to the AddPDFContent method which puts that string in the BuiltinFields.Content field (when looking at the index this is the “_content” field).

You can added a TextField (this.CreateTextField) or a DataField (this.CreateDataField) to the document. The text fields are used by the index to find hits and the data fields are used so you can programmatically access information about the document if a hit is found. So if you want a piece of data to be accessible to both the index and programmatically accessible you will want to add a textfield and a datafield for that value.

That is it. Now just pass in the root path to where your PDF files are, get the media stream from those files and call these methods.

After I had finished my coding I finally did find a good example on code project. So hopefully between this post and that post you can get what you need.

Share this post :

No comments: