Posted: 16 Nov 2018 8:45 EST Last activity: 15 Apr 2019 5:43 EDT
DocumentOCR - ProcessToText method output
I am trying to process a text from scanned receipt using DocumentOcr's ProcessToText method. The method returns a string containing two whitespaces(apart from linespace) - \u2028 and u\2029 that should help format the string to match formatting of the scanned file. However, I cannot figure out how that can be achieved and the method seems to be processing different parts of the image at random - not by lines, not by columns, not by some kind of parts.
Attached is one example of the input for the method(test_ocr.jpg) along with a text file with output(output.txt) - new line and the ";" are the different whitespaces. As you can see, there seems to be no apparent order in which the method processes the text.
Is there a way to format the processed data to recreate (at least) relative positions of text from the original image?