Search text in PDF documents (C# sample)

While building the perfect apps with Apitron PDF Rasterizer for .NET you were probably thinking about own PDF text search functionality implementation.

It could be useful for custom viewers, and having this feature makes your PDF rendering toolkit really complete: you’d have the Apitron PDFRasterizer for rendering, Apitron PDF Viewer (our free pdf viewer control) for viewing and integrated text search for fast and efficient text search and navigation.

We did it for you.

From now on, the Apitron PDF Rasterizer has the integrated text search engine and you can easily use it in your apps.

The code

For demonstration purposes we will review a sample that opens PDF file, searches for some text, renders corresponding PDF page and highlights the results(complete C# code can be found under samples\SearchAndHighlightSpecifiedText folder in our download package).

Our PDF text search engine uses concept called search index, prior to searching we analyze the document and build its “index” - data used for actual searching. It can be stored by you for later use if you wish, so you could avoid its recreation next time the document is being opened.

Index creation

FileStream pdfDocumentStreamToSearch = newFileStream( Path.Combine( pathToDocuments, "2003_ar.pdf" ), FileMode.Open, FileAccess.Read );

SearchIndex searchIndex = newSearchIndex( pdfDocumentStreamToSearch );

As you see, index creation takes just a few lines of code. It’s also possible to save the index to output stream for later use. Password-protected PDF files are also supported.

Text search

I used the PDF document from Adobe website, http://www.adobe.com/aboutadobe/invrelations/pdfs/2003_ar.pdfand all images attached to this post represent actual output from the code sample.

staticvoid Main(string[] args)

{

string pathToDocument = @"..\Documents\2003_ar.pdf";

// create index from PDF file

using ( StreampdfDocumentStreamToSearch = newFileStream( pathToDocument, FileMode.Open, FileAccess.Read ) )

{

SearchIndex searchIndex = newSearchIndex(pdfDocumentStreamToSearch);

// create document used for rendering

using ( StreampdfDocumentStreamToRasterize = newFileStream( pathToDocument, FileMode.Open, FileAccess.Read ) )

{

document = newDocument(pdfDocumentStreamToRasterize);

// search text in PDF document and render pages containing results

searchIndex.Search( SearchHandler, "software products derive" );

}

///<summary>

/// Handle search results here. Draw pages with highlighted text.

///</summary>

///<param name="handlerArgs">The handler args.</param>

privatestaticvoid SearchHandler(SearchHandlerArgshandlerArgs)

{

if (handlerArgs.ResultItems.Count != 0)

{

string outputFileName = string.Format("{0}.png", handlerArgs.PageIndex);

Page page = document.Pages[handlerArgs.PageIndex];

using (Imagebm = page.Render((int)page.Width * 2, (int)page.Height * 2, renderingSettings))

{

foreach (SearchResultItemsearchResultItem in handlerArgs.ResultItems)

{

HighlightSearchResult(bm, searchResultItem, page);

}

bm.Save( outputFileName );

}

Process.Start( outputFileName );

}

// Search cancellation condition, now we stop if we have more than 3 results found,

// or all pages are searched

if (handlerArgs.ResultItems.Count > 3)

{

handlerArgs.CancelSearch = true;

}

What happens here? We take the previously created index data and call the SearchIndex.Search method accepting the search event handler.
It processes our results one by one and highlights found items using HighlightSearchResult call - this method contains simple GDI+ code that draws a transparent rectangle around the found text (if any). It also has a condition set for search cancellation, demonstrating the flexibility of PDF search API.

Resulting images

Resulting image(see yellow markers)

One of the results produced by searching for “Intelligent Documents”

Result produced for spiral text

How to get it

The described PDF search engine is included in latest Apitron PDF Rasterizer for .NET release, all related classes can be found under Apitron.PDF.Rasterizer.Search namespace. We always welcome any feedback, so feel free to ask questions and share ideas.

Search text in PDF documents (C# sample)

The code

Index creation

FileStream pdfDocumentStreamToSearch = newFileStream( Path.Combine( pathToDocuments, "2003_ar.pdf" ), FileMode.Open, FileAccess.Read );

Text search

Resulting images

How to get it

Trending Articles

ESENT データベース USS.jtx で、エラーイベント ID 490、454、489、455 が記録される事象について

Felony Arrest of Joseph A. White and Heather Coomer-White

the range cannot be deleted (6028) in microsoft word

Practice Sheet of Right form of verbs for HSC Students

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

Name Of Parts Of The Day In hindi And English-List Of Part Of Days In Hindi

Revised GDS Gratuity, Severance Amount and SDBS contribution - Social...

PRC MOE SCHOOL TEACHER CHARGED FOR SEXUALLY PENETRATING 12 YEAR-OLD WITH FINGERS

Joshua Pigden from Bristol faces trial over rape and Diazepam...

Arrow Flash 2 – Sinhala Dubbed – Episode 17 – 28th February 2016

Password Reset on SX6036?

Outlook でメールを保存または送信時に...

Throw Back: Samini — Where My Baby Dey (Prod by Kaywa)

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

GTA 5 PPSSPP Zip File Download For Android Mediafire 382 MB

Muloraki Au

Bhiknur Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers List...

Nahitaji matokeo ya kidato cha nne ya mwaka 1998

Chai Status, Funny Tea Quotes in Hindi, चाय पर शायरी