Question
DocumentOCR - ProcessToText method output
I am trying to process a text from scanned receipt using DocumentOcr's ProcessToText method. The method returns a string containing two whitespaces(apart from linespace) - \u2028 and u\2029 that should help format the string to match formatting of the scanned file. However, I cannot figure out how that can be achieved and the method seems to be processing different parts of the image at random - not by lines, not by columns, not by some kind of parts.
Attached is one example of the input for the method(test_ocr.jpg) along with a text file with output(output.txt) - new line and the ";" are the different whitespaces. As you can see, there seems to be no apparent order in which the method processes the text.
Is there a way to format the processed data to recreate (at least) relative positions of text from the original image?
Hello Jacub,
Unfortunately, this random parsing is a limitation of Abbyy OCR engine in cases like this (crumpled paper, tilted text, low resolution).
Try to use ProcesstoPdf method instead. It reads an image as a one piece of text and puts it in PDF as is. Then you can use PdfConnector and parse the text from pdf as you like. Some info on these methods: http://help.openspan.com/80/Components/DocumentOCR_Component.htm#ProcessToPdf and http://help.openspan.com/80/Components/PDFConnector_Component_Properties,_Methods,_and_Events.htm
Screenshot of a test automation and its pdf and txt outputs are attached