Posted: 5 Mar 2018 7:54 EST Last activity: 9 Mar 2018 6:09 EST
How to read form data and especially checkboxes in a PDF file
I have requirement of reading a PDF file having some standard data and Form data (Text boxes, Check boxes etc.). We tried using PDF Connector and were able to read some data in PDF, but having challenge in getting/reading the value of a check box.
Please find few more details below,
How can we read a checkbox in a PDF file, to identify whether check box is checked or un-checked. In this case the checkbox has a ‘X’ mark and the pdf file has some data in read only mode.
Kindly share your inputs taking form data read only / editable formats.
The PDF Connector allows you to modify the settings to read your PDF file as lines, segments and words. The settings are very specific to how the PDF is constructed. Once you have found the settings that allow you to properly grab the lines, segments and words from your document, you can read through them one line at a time. Each segment or word has a Left property. This will tell you the offset in the line for the segment or word. With some detective work you can find the offset for your checkbox and then determine whether it is checked or not based on the value of the segment or word at that location.
I attached sample PDF form snip screenshot where the sample data of check boxes, that are checked is highlighted in yellow. Also attached sample automation for reference.
As part of POC we have done below trail & error to get solution, please find few details,
1. After using the PDF Connector the checkbox couldn't be identified using any unique flag to determine whether the checkbox is checked / un-checked. We also tried to get the checkbox value using the left property of segment, but failed to figure out the checkbox value (checked/un-checked).
2. Can you please share some pointers, in which category the checkbox will fall under.
Request you to also confirm where the Radio button will fall under in PDF Connector.
I have been testing with your form. By setting the word threshold lower (it is 2.2 by default) you can isolate the check boxes from the text next to them. I don't know how they are represented when they are checked however.
I set the Word threshold to 1 and this is how the words were delimited. Notice how the checkboxes are highlighted by themselves.
Each word has a left value - the starting pica value. If you read through the words, whenever the left value is less than the previous left value it means you are on a new line. I use this to assign line numbers to each word. Then you can index each word by line number and left position. here is an example.