Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

<img ....> gets corrupted when using parse-html #20

Open

Description

  • Mercury Parser API Version:Latest
  • Node Version:8

Expected Behavior

The parser should not corrupt the <img> content.

Current Behavior

The <img> tag originally is

 <img src=\"https://cdn.example-domain.com/example1.jpg"/>

and after parsing

 <img src="https://www.example-domain.com/%22https://cdn.example-domain.com/example1.jpg/%22/">

Steps to Reproduce

  1. Take the following HTML
<html>
<head>
<body> 
Main content
<br/>
<img src="https://cdn.example-domain.com/example1.jpg"/>
More content
<br/>
More Content to Simulate main content.
<img src="https://cdn.example-domain.com/example2.jpg"/>
</body>
</html>
  1. Call the api with the path /parse-html. The API takes a POST with a JSON object containing a URL and HTML. The HMTL is the HTML as provided in step 1 but is first converted to the following format:
<html>\\n<head>\\n<body>\\nMain content\\n<br/>\\n<img src=\"https://cdn.example-domain.com/example1.jpg\"/>\\nMore content\\n<br/>\\nMore Content to Simulate main content.\\n<img src=\"https://cdn.example-domain.com/example2.jpg\"/>\\n</body>\\n</html>\\n

and the URL value that is passed is https://www.example-domain.com

  1. The JSON result content being returned contains the main content including the images. The image values are however corrupted:
<img src="https://www.example-domain.com/%22https://cdn.example-domain.com/example1.jpg/%22/">
<img src="https://www.example-domain.com/%22https://cdn.example-domain.com/example2.jpg/%22/">

Question/Comment

Am I using the API in a correct way? I could not find any documentation so this is a bit of reverse engineering.

The reason for not doing this directly, i.e. using the /parser?url=..... is that I am trying to work around a problem where a TypeError is returned. See. The page gives back a 202 which the parser cannot handle. I am now downloading the content and try to pass the HTML into the API as a workaround instead. Unfortunately it doesn't react as I expected it would.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

      Relationships

      None yet

      Development

      No branches or pull requests

      Issue actions

        AltStyle によって変換されたページ (->オリジナル) /