Close popover
Jakub Lebeda (JakubL28)
PricewaterhouseCoopers Advisory SRO

PricewaterhouseCoopers Advisory SRO
JakubL28 Member since 2018 1 post
PricewaterhouseCoopers Advisory SRO
Posted: November 16, 2018
Last activity: April 15, 2019

DocumentOCR - ProcessToText method output

I am trying to process a text from scanned receipt using DocumentOcr's ProcessToText method. The method returns a string containing two whitespaces(apart from linespace) - \u2028 and u\2029 that should help format the string to match formatting of the scanned file. However, I cannot figure out how that can be achieved and the method seems to be processing different parts of the image at random - not by lines, not by columns, not by some kind of parts.

Attached is one example of the input for the method(test_ocr.jpg) along with a text file with output(output.txt) - new line and the ";" are the different whitespaces. As you can see, there seems to be no apparent order in which the method processes the text.

Is there a way to format the processed data to recreate (at least) relative positions of text from the original image?

Robotic Process Automation
Moderation Team has archived post