(But this is just for testing: you could also (for instance) create PDFs in memory using HTMLTOPDF or fetch a PDF from a website etc : so long as you can get the PDF bytes, the same approach should work).
Running the Activity shows the extracted text in the Clipboard Property.
You would need to write additional logic to parse the text of course; or you could use different PDFBox APIs to parse the structure of the PDF in a different way (probably).
Thanks for the additional information : you should realize that PDFs are not simply a 'wrapper' with some hidden XML inside them ; they are a printable/presentation format - so structures like TABLEs etc (that exist in HTML/XML) are not necessarily present in a nice easy-to-parse format.
You can get at strucutures such as 'pages' within PDFs if that will help you : see this StackOverFlow post for more information on that. Possibly you can get other structures such as paragraphs or blocks of text ; but I've never gone to that level myself : the PDFBox (or perhaps 'itext' : which is also present in PRPC OOTB [ although it is quite an old version]) Javadocs/examples may provide examples.
Are you always looking for URLs in the PDFs ? Because you can probably use 'REGEX' for this - perhaps you will need a 'human-approval' stage at the end of this, but it should be able to grab a lot of the information that way ?
(I'm not sure why you said looking for 'co' 'com' is not appropriate here ? Do you mean it doesn't find all the text you need ?)
Additionally: are all the PDFs essentially comprised of two columns of data ?
You should be able to use the API to differentiate between text on the Left-Hand-Side from text on the Right-Hand-Side if so : also you *might* be able to use the background colour here to help you identify the text as well?)
One more thing: the PDFTextStripper -should have returned you a big block of text; that includes Line Endings : so you should be able to parse this text one line at a time ; which should then allow you to start locating the text you need ?
Thanks John, this was really helpful for solving one of my issue. Can you please suggest using the same approach, I need to parse PDF attached to my Case. These PDF's will be attached during the case creation through email Listener. I need to parse some info from PDF and show on Case UI.
I have a requirement to parse PDF using eForm .I am using the activity ExtractDataFromEForm and able to get the binary data on pyEform but getting the error that unable to extract data from pyEform. I m using 7.1.9.