Need to parse html content

Question

NageshwaraR

Member since 2016

5 posts

l & t infotech

Posted: May 31, 2017

Last activity: Oct 16, 2018

Posted: 31 May 2017 8:53 EDT
Last activity: 16 Oct 2018 12:03 EDT

Closed

Need to parse html content

Report

Hi All,

We have a requirement,

we are getting caseid's information in the email and it will be html format. We need to parse the html and get the case id from it and need to search in the pega.

Could you please help us how to parse the html content to text. I know there is a option in service "inline- prefer text" but the service contain html format which we should not change.

whatever html is coming that we need to parse.

**Moderation Team has archived post**

This post has been archived for educational purposes. Contents and links will no longer be updated. If you have the same/similar question, please write a new post.

To see attachments, please log in.

Low-Code App Development

Like (0)
Share this page Facebook Twitter LinkedIn Email Copying... Copied!

Posted: 6 years ago

Posted: 31 May 2017 9:08 EDT

JOHNPW_GCS replied to NageshwaraR

Report

Do you know whether the HTML is guaranteed to be well-formed or not ? (If so: you could perhaps treat the HTML as XML and use a PARSE-XML rule here ?)

To see attachments, please log in.

Like (0)

Posted: 6 years ago

Posted: 5 Jun 2017 5:22 EDT

NageshwaraR

l & t infotech

replied to JOHNPW_GCS

Report

Thank you for your reply,

The html formatting like below. I Need Subject from each record. Please help me how to parse it? if you save as the code as html you can see the table of content which is coming and system is generating sample html code below.

Thank you for your reply,

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <meta name="Generator" content="Microsoft Word 14 (filtered medium)"> <style></style> </head> <body lang="EN-US" link="blue" vlink="purple"> <div class="WordSection1"> <h3>Records Count = 2 <o:p></o:p></h3> <table class="MsoNormalTable" border="1" cellpadding="0"> <tbody> <tr> <td style="padding:.75pt .75pt .75pt .75pt"> <p class="MsoNormal" align="center" style="text-align:center"><b>EmailID <o:p></o:p></b></p> </td> <td style="padding:.75pt .75pt .75pt .75pt"> <p class="MsoNormal" align="center" style="text-align:center"><b>UserName <o:p></o:p></b></p> </td> <td style="padding:.75pt .75pt .75pt .75pt"> <p class="MsoNormal" align="center" style="text-align:center"><b>Subject <o:p></o:p></b></p> </td> <td style="padding:.75pt .75pt .75pt .75pt"> <p class="MsoNormal" align="center" style="text-align:center"><b>SentOn <o:p></o:p></b></p> </td> </tr> <tr> <td style="padding:.75pt .75pt .75pt .75pt"> <p class="MsoNormal">1234<o:p></o:p></p> </td> <td style="padding:.75pt .75pt .75pt .75pt"> <p class="MsoNormal">TestUser1<o:p></o:p></p> </td> <td style="padding:.75pt .75pt .75pt .75pt"> <p class="MsoNormal"> Term Quote Requested: – ABCD1234<o:p></o:p></p> </td> <td style="padding:.75pt .75pt .75pt .75pt"> <p class="MsoNormal">2017-05-31T14:22:52.167<o:p></o:p></p> </td> </tr> <tr> <td style="padding:.75pt .75pt .75pt .75pt"> <p class="MsoNormal">3215<o:p></o:p></p> </td> <td style="padding:.75pt .75pt .75pt .75pt"> <p class="MsoNormal">TestUser2<o:p></o:p></p> </td> <td style="padding:.75pt .75pt .75pt .75pt"> <p class="MsoNormal">Term Quote Requested: – ABCD5689<o:p></o:p></p> </td> <td style="padding:.75pt .75pt .75pt .75pt"> <p class="MsoNormal">2017-05-31T08:53:30.387<o:p></o:p></p> </td> </tr> </tbody> </table> <p class="MsoNormal"><o:p> </o:p></p> </div> </body> </html>

Please help me how to parse this message and subject from it. Thank you .

Show Less

To see attachments, please log in.

Like (0)

Posted: 6 years ago

Posted: 6 Jun 2017 6:51 EDT

Murali.Krishnan.D

Coforge DPA

replied to NageshwaraR

Report

As per your requirement you may need all 3rd <TD><P> tag values of each <TR> except first one.

You may have to create a Parse XML Rule to map these specific <TD> elements to a list property in clipboard. Then use a "Apply-ParseXML" method in activity to execute it.

If not, you can also use Java JAXP apis to parse through the XML (in your case HTML string) and read the needed values based on the specific XPATHs.

To see attachments, please log in.

Like (0)

Posted: 6 years ago

Posted: 8 Jan 2018 9:05 EST

JAYASHREENS

Tata Consultancy Services

replied to Murali.Krishnan.D

Report

hi, can you provide detailed guideline on achieving this?

To see attachments, please log in.

Like (0)

Posted: 6 years ago

Posted: 8 Jan 2018 9:58 EST

JOHNPW_GCS replied to NageshwaraR

Report

Unfortunately: that HTML is *not* well-formed - so you will *not* be able to use a PARSE-XML against it.

It contains two unclosed 'meta' tags in the 'head' section of the document.

You would first have to clean these up , using another method (you will need to treat this as a 'string' first, then extract out the 'clean' XML).

Additionally: this HTML (at least the version I 'scraped' from the post) is not in the default (UTF-8) encoding - it appears to be in "Windows-1252" format - you might also need to take this into account.

And lastly: I'm not sure how much control you have over the 'shape' of the incoming HTML; if the format is even slightly bit different (for instance; if text is or isn't contained with a bold "<B>" element); your mapping will fail.

But for the sake of demonstration: here's some screenshots showing how to build a PARSE-XML rule which is able to extract out portions from the (cleaned-up) XML.

I have attached the cleaned-up XML for reference.

It only maps the *first* TD element's text (which is contained with a BOLD tag); for demonstration ; but you could extend this rule to map other elements. (You could even use Repeating Structures here; but they are more complicated to build, so I went for the simpler option here).

1. Create a PRPC Text Property ; in this example "Extract":

Unfortunately: that HTML is *not* well-formed - so you will *not* be able to use a PARSE-XML against it.

It contains two unclosed 'meta' tags in the 'head' section of the document.

You would first have to clean these up , using another method (you will need to treat this as a 'string' first, then extract out the 'clean' XML).

But for the sake of demonstration: here's some screenshots showing how to build a PARSE-XML rule which is able to extract out portions from the (cleaned-up) XML.

I have attached the cleaned-up XML for reference.

1. Create a PRPC Text Property ; in this example "Extract":

2. Create a PARSE-XML rule (From 'Integration Mapping' Category) , call it 'ExtractTextFromEmail' (but we will *change* the namespace, see below).

Change the root element to 'html' : and double-click it:

Click on the 'node' tab and change the 'Node Namespace' to "http://www.w3.org/TR/REC-html40" (because the input XML has the 'xmlns' directive set on the 'html' element).

3. Follow the structure of the input XML document; use 'add element' to add in the sub-tree of html>body>div>table>tbody>tr>td>p>b

4. Map the final 'b' element to the property: (double-click the element):

5. Test the rule: just use ACTION button > Run.

Use the option: "Text to be parsed" and copy and paste in the complete ('cleaned-up') XML and press 'execute':

Note the XML representation of the Page, correctly shows the text 'EmailID' (picked-out of the input XML) mapped to the 'ExtractProperty':

Again, I think this approach (of trying to use PARSE-XML to process incoming HTML) isn't the ideal solution here, mainly because:

1. You have to 'pre-process' the HTML to make it well-formed, and has the correct character-set directive.

2. It will break; if the incoming XML 'shape' changes.

You might want to consider building something in Java (where you can build lower-level processing - as a String etc) ,and then importing that Java as a standalone library/class into PRPC and then 'wrapping' it from PRPC.

Show Less

To see attachments, please log in.

Likes (1)

Siddhant Suryakant Jivane

Get Started with Community

Question

Need to parse html content

Need help or want to help others?

Experience the benefits of Support Center when you log in.

Question

Need to parse html content

Related content:

Need help or want to help others?

Experience the benefits of Support Center when you log in.

We'd prefer it if you saw us at our best.