Posted: 26 Dec 2019 10:39 EST Last activity: 30 Dec 2019 9:39 EST
OCR pattern extraction
I was trying to fetch particular details from an image say pan card. But we faced a blocker where 2 similar scanned ID's on extracting through OCR had some difference in the order in which the information was extracted. I was unable to find the pattern to get particular details from the text .Any help on the above will be appreciated.
Please provide some more details. It is not clear from your post what you are asking for. Do you have any examples you can provide? Screenshots and specific behaviors are really helpful. With that, I might be able to offer some suggestions.
Thanks for the quick response. We are trying to extract the details from a scanned image using the documentocr component.
When trying to extract the details from the first sample identity card, the unique number which is at the bottom of the card is being extracted first (order of the text output).
When we extract it for a different sample we are getting it in the same order as present in the identity card but how do we find a common pattern to fetch the required details we need from the full extracted strong as our extracts all the details in the entire image. Please suggest. Happy to provide more information if needed.
I still do not think I fully understand what you are asking. It sounds like you have text that you've extracted via OCR and you are asking how to parse that text. I cannot answer that without that text. If you cannot provide the text as an example, then I would suggest looking into RegEx (Regular Expressions). These are useful for extracting patterns from text. I use a tool called "Expresso" to help build and test more complex RegEx although there may be better ones or even web-based ones as well.
We are mainly looking to extract the important details like the name and the card number(XXXX1234). But in both scenario 1 and 2 the OCR outputs are out of order compared to the card scanned and we couldn't find any common pattern we could use to extract the details. Please suggest.
Have you tried using the DocumentOCR component with the new PdfConnector functionality? When setting up a document type you can add both scenarios to the document type and then test to see which is present. In my experience, when the OCR reads the document in a different order it is usually either not the same document or there is some rotation on the document that has caused lines to skew. The DocumentOCR component handles rotation pretty well but that is not an exact science.