Discussion

7
Replies
7944
Views
SivarajMuthusamy Member since 2016 11 posts
Accenture
Posted: 3 years ago
Last activity: 2 years 11 months ago

PEGA ROBOTIC STUDIO . A LONG WAITED , ARTICLE ON OCR .. EXTRACTING TEXT FROM IMAGE BY USING OPENSPAN.

Hi All,

I have extracted text from a image by using Microsoft.office.interop.onenote 15.0 library.

You may find this library on Pega Robotics 8.0.1030 version.

One note has in built feature to extract text from image.

Here i have added scripts to extract text. Here are the steps to achieve it.

1. Delete any pages in a onenote by using below c# script.

A. Add reference Microsoft.office.interop.onenote.15.0 library not dll.

B. Add reference system.linq.xml

Script to delete

Imports :

using System;

using System.Linq;

using System.Xml.Linq;

using Microsoft.Office.Interop.OneNote;

Parameters:

out string atName,out Microsoft.Office.Interop.OneNote.Application onenoteApp

Body:

onenoteApp = new Microsoft.Office.Interop.OneNote.Application();

string notebookXml;

atName = "";

onenoteApp.GetHierarchy(null, HierarchyScope.hsPages, out notebookXml);

var doc = XDocument.Parse(notebookXml);

var ns = doc.Root.Name.Namespace;

 

foreach (var notebookNode in from node in doc.Descendants(ns +

"Page") select node)

{

try{

onenoteApp.DeleteHierarchy(notebookNode.Attribute("ID").Value,DateTime.MinValue,true);

}

2. Create new page to paste our image inside to it. Same reference has to be added.

Imports:

using Microsoft.Office.Interop.OneNote;

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using System.Threading.Tasks;

using System.Xml;

using System.Xml.Linq;

Parameters:

Microsoft.Office.Interop.OneNote.Application onenoteApp, out string pageID

Body:

pageID="";

//var onenoteApp = new Microsoft.Office.Interop.OneNote.Application();

string notebookXml;

onenoteApp.GetHierarchy(null, HierarchyScope.hsSections, out notebookXml);

var doc = XDocument.Parse(notebookXml);

var ns = doc.Root.Name.Namespace;

int i =0;

foreach (var sectionNode in from node in doc.Descendants(ns + "Section") select node)

{

//MessageBox.Show(sectionNode.Attribute("name").Value.ToString());

string sectionId = sectionNode.Attribute("ID").Value;

if(i==0 && sectionNode.Attribute("name").Value.ToString() == "New Section 1")

onenoteApp.CreateNewPage(sectionId, out pageID, NewPageStyle.npsDefault);

}

3. Insert your image in the one note post converting into structured xml.

Import:

using System;

using System.Linq;

using System.Xml.Linq;

using Microsoft.Office.Interop.OneNote;

using System.Drawing.Imaging;

Parameters:

filename is your local saved image file path

string filename,string pageToBeChange,Microsoft.Office.Interop.OneNote.Application onenoteApp

Body:

string strNamespace = @"http://schemas.microsoft.com/office/onenote/2013/onenote";

string m_xmlImageContent =

"<one:Image><one:Size width=\"{1}\" height=\"{2}\" isSetByUser=\"true\" /><one:Data>{0}</one:Data></one:Image>";

string m_xmlNewOutline =

"<?xml version=\"1.0\"?><one:Page xmlns:one=\"{2}\" ID=\"{1}\"><one:Title><one:OE><one:T><![CDATA[{3}]]></one:T></one:OE></one:Title>{0}</one:Page>";

// string pageToBeChange = "Untitled page";

Bitmap bitmap = new Bitmap(filename);

MemoryStream stream = new MemoryStream();

bitmap.Save(stream, ImageFormat.Png);

string fileString = Convert.ToBase64String(stream.ToArray());

string notebookXml;

onenoteApp = new Microsoft.Office.Interop.OneNote.Application();

onenoteApp.GetHierarchy(null, HierarchyScope.hsPages, out notebookXml);

4. Reading text from Image

Import:

using System;

using System.Linq;

using Microsoft.Office.Interop.OneNote;

using System.Xml.Linq;

using System.IO;

using System.Drawing;

using System.Drawing.Imaging;

Parameters:

Showing the extracted text in a textbox.

TextBox t,string PageID,Microsoft.Office.Interop.OneNote.Application

onenoteApp = new Microsoft.Office.Interop.OneNote.Application();

int i = 0;

string notebookXml;

do

{

try

{

onenoteApp.GetHierarchy(null, HierarchyScope.hsPages, out notebookXml);

//MessageBox.Show(notebookXml);

 

var doc = XDocument.Parse(notebookXml);

var ns = doc.Root.Name.Namespace;

var pageNode = doc.Descendants(ns + "Page").Where(n =>

n.Attribute("ID").Value == PageID).FirstOrDefault();

//MessageBox.Show(pageNode.ToString());

if (pageNode != null)

{

string pageXml;

onenoteApp.GetPageContent(pageNode.Attribute("ID").Value, out pageXml);

//MessageBox.Show(XDocument.Parse(pageXml).ToString());

 

var doc1 = XDocument.Parse(pageXml);

var ns1 = doc1.Root.Name.Namespace;

//MessageBox.Show(ns1.ToString());

pageNode = doc1.Descendants(ns1 + "OCRText").FirstOrDefault();

//MessageBox.Show(pageNode.ToString());

string a = pageNode.ToString();

 

a = a.Substring(a.IndexOf("CDATA[")+6,(a.IndexOf("]]")-(a.IndexOf("CDATA["))-6));

t.Text =a;

i = 1;

}

}

catch(Exception e)

{

}

}while(i!=1);

 

***Updated by moderator: Lochan to add Categories and Group Tags***

**Moderation Team has archived post**

This post has been archived for educational purposes. Contents and links will no longer be updated. If you have the same/similar question, please write a new post.

Robotic Process Automation Developer Knowledge Share
Share this page LinkedIn