Edit - Stack Overflow

You are not logged in. Your edit will be placed in a queue until it is peer reviewed.

We welcome edits that make the post easier to understand and more valuable for readers. Because community members review edits, please try to make the post substantially better than how you found it, for example, by fixing grammar or adding additional resources and hyperlinks.

Required fields*

Rev

Required fields*

Encoding Issues with Cyrillic Text Scraped using rvest in R

I'm trying to scrape russian media web pages that contains Cyrillic text using the rvest package in R.

However, for some of the pages (not all for some reason) I'm encountering an encoding issue where the text does not display correctly after scraping. Instead of the expected Cyrillic characters, I see a garbled text output like:

ÐÐ»Ð°Ð2Ð° ÐÐ ́ÐμÑÑÑ Ð¿ÑÐ3⁄4ÑÐ ̧Ñ Ð2Ð»Ð°ÑÑÐ ̧ ÑÑÑÐ°Ð1⁄2Ñ ÑÐμÑÑÑ Ð·Ð° ÑÑÐ3⁄4Ð» Ð¿ÐμÑÐμÐ3Ð3⁄4Ð2Ð3⁄4ÑÐ3⁄4Ð2 Ñ Ð Ð3⁄4ÑÑÐ ̧ÐμÐ1

Take this page for example:

url <- "https://news-front.su/2022/08/29/glava-odessy-prosit-vlasti-strany-sest-za-stol-peregovorov-s-rossiej/"

Both the page header (httr::headers(httr::HEAD(url))) and all parameters in the html script tell me the encoding should be UTF-8 and the page is static.

Originally, my scraper did not specify the encoding (so it should use UTF-16 if UTF-8 throws an error I guess?!), resulting the character mess above.

Specifying Encoding in read_html():

I attempted to specify the encoding directly in the read_html() function:

text <- rvest::read_html(url, encoding = "UTF-8") %>% html_elements(".entry-title") %>% html_text2()

which results in

Error in read_xml.raw(raw, encoding = encoding, base_url = base_url, as_html = as_html, : Input is not proper UTF-8, indicate encoding ! Bytes: 0xD0 0x27 0x20 0x2F [9]

I also tried other encodings like "windows-1251" and "UTF-16" (and looped over all the encodings from stringi::stri_enc_list()), but that didn't get me closer to resolve the issue.

Ex-post string manipulation:

This was the closest I could get to the intended result (though still not perfect and ideally not necessary if the scraper can handel the encoding in the first place).

I did this directly in the MariaDB I write the scraped text into using SQL:

CREATE TEMPORARY TABLE ttable (text VARCHAR(255) CHARACTER SET utf8);
INSERT INTO ttable (text) VALUES 
('ÐÐ»Ð°Ð2Ð° ÐÐ ́ÐμÑÑÑ Ð¿ÑÐ3⁄4ÑÐ ̧Ñ Ð2Ð»Ð°ÑÑÐ ̧ ÑÑÑÐ°Ð1⁄2Ñ ÑÐμÑÑÑ Ð·Ð° ÑÑÐ3⁄4Ð» Ð¿ÐμÑÐμÐ3Ð3⁄4Ð2Ð3⁄4ÑÐ3⁄4Ð2 Ñ Ð Ð3⁄4ÑÑÐ ̧ÐμÐ1');
SELECT text, CONVERT(CAST(CONVERT(text USING latin1) AS BINARY) USING utf8) AS corrected_text FROM ttable;

and got close:

??лава ??десс?? п??оси?? влас??и с????ан?? сес???? за с??ол пе??егово??ов с ? оссией

instead of

Глава Одессы просит власти страны сесть за стол переговоров с Россией

as displayed on the Website. I could probably get to the right result from here using a LLM or so, but prefer to avoid scraping messy data in the first place...

Has anyone faced a similar encoding issue when scraping Cyrillic texts? Any help would be hugely appreciated!

[I'm using R 4.3.2 and rvest 1.0.3 on Windows 10]

Answer*

##### Fixing invalid UTF-8 byte sequence for `rvest`
Apparently one of the `<meta>` elements includes invalid UTF-8 byte sequence. 
From [W3 Nu Html Checker report][1]:
>Error: Malformed byte sequence: `d0`. 
>At line 133, column 174 

Line 133:
``` html
<meta name="verification" content="f612c7d25f5690ad41496fcfdbf8d1" /><meta name='description' content='<strong>Мэр Одессы Геннадий Труханов призвал киевские власти вести пер�' /> 
 col 174 *
```

To "fix" or replace incorrect code points for `rvest` / `xml2`, we could load content as raw bytes through `httr2`, replace offending byte(s) with `stringi::stri_conv()` and only then let `rvest` parse it. 

``` r
library(rvest)
library(httr2)
library(stringi)
url_ <- "https://news-front.su/2022/08/29/glava-odessy-prosit-vlasti-strany-sest-za-stol-peregovorov-s-rossiej/"

resp <- 
 request(url_) |> 
 req_perform() 

resp_header(resp, "Content-Type")
#> [1] "text/html; charset=UTF-8"

# attempt to parse response as-is 
# ( resp_body_html just calls resp_body_raw(resp) |> xml2::read_html() )
resp_body_html(resp) |> 
 html_elements(".entry-title") |> 
 html_text2()
#> [1] "Ð\u0093Ð»Ð°Ð²Ð° Ð\u009eÐ´ÐµÑ\u0081Ñ\u0081Ñ\u008b Ð¿Ñ\u0080Ð¾Ñ\u0081Ð¸Ñ\u0082 Ð²Ð»Ð°Ñ\u0081Ñ\u0082Ð¸ Ñ\u0081Ñ\u0082Ñ\u0080Ð°Ð½Ñ\u008b Ñ\u0081ÐµÑ\u0081Ñ\u0082Ñ\u008c Ð·Ð° Ñ\u0081Ñ\u0082Ð¾Ð» Ð¿ÐµÑ\u0080ÐµÐ³Ð¾Ð²Ð¾Ñ\u0080Ð¾Ð² Ñ\u0081 Ð Ð¾Ñ\u0081Ñ\u0081Ð¸ÐµÐ¹"

# replace incorrect code points with 'missing/erroneous' character,
# warning is expected, `?stringi::stri_conv` for details
text_utf8 <- 
 resp_body_raw(resp) |> 
 stri_conv(from = "UTF-8", to = "UTF-8") 
#> Warning in stri_conv(resp_body_raw(resp), from = "UTF-8", to = "UTF-8"): input
#> data \xffffffd0 in the current source encoding could not be converted to
#> Unicode

read_html(text_utf8) |> 
 html_elements(".entry-title") |> 
 html_text2()
#> [1] "Глава Одессы просит власти страны сесть за стол переговоров с Россией"
```

--- 
##### Checking invalid byte sequence.
We can locate errors by looking for replacements (`\ufffd` & `\u001a`) in output string:
``` r
# locate replacement(s)
(err <- stri_locate_all_regex(text_utf8, '[\ufffd\u001a]'))
#> [[1]]
#> start end
#> [1,] 27152 27152

err_elem <- 
 stri_sub(text_utf8, err[[1]][,"start"] - 200, err[[1]][,"end"] + 10) |> 
 stri_extract_last_regex("<[^/]+/>")
err_elem
#> [1] "<meta name='description' content='<strong>Мэр Одессы Геннадий Труханов призвал киевские власти вести пер�' />"

# check escape sequences to locate our replacements (\ufffd, \u001a)
stri_escape_unicode(err_elem)
#> [1] "<meta name=\\'description\\' content=\\'<strong>\\u041c\\u044d\\u0440 \\u041e\\u0434\\u0435\\u0441\\u0441\\u044b \\u0413\\u0435\\u043d\\u043d\\u0430\\u0434\\u0438\\u0439 \\u0422\\u0440\\u0443\\u0445\\u0430\\u043d\\u043e\\u0432 \\u043f\\u0440\\u0438\\u0437\\u0432\\u0430\\u043b \\u043a\\u0438\\u0435\\u0432\\u0441\\u043a\\u0438\\u0435 \\u0432\\u043b\\u0430\\u0441\\u0442\\u0438 \\u0432\\u0435\\u0441\\u0442\\u0438 \\u043f\\u0435\\u0440\\ufffd\\' />"
```
Last character in *content* string is now replaced with `\ufffd`, a [replacement character �][2] . Browsers handle this in a somewhat similar way and the use of `\ufffd`, when handling parsing errors, is described in [HTML5 standard][3] .

When extracting content through `chromote`, as was proposed in first revision, we get a string where invalid sequences are already replaced with the same `\ufffd`.

---
We can also try to split response content into lines, check those with `stringi::stri_enc_isutf8()` to get offending line(s) where invalid sequence is still present. While string operations will not work on invalid byte sequences, `vroom`/`readr` can still handle this raw vector and split it into lines for us:

``` r
l <- 
 resp_body_raw(resp) |> 
 vroom::vroom_lines()

(err_idx <- which(!stri_enc_isutf8(l)))
#> [1] 133
l[err_idx]
#> [1] "<meta name=\"verification\" content=\"f612c7d25f5690ad41496fcfdbf8d1\" /><meta name='description' content='<strong>Мэр Одессы Геннадий Труханов призвал киевские власти вести пер\xd0' /> "
```
Hexdump of offending line: 
[![hexdump screenshot][4]][4] 

Last word in *description content* occupies 7 bytes which doesn't look right in this context. And last byte value is `0xD0` (`b1101 0000`), which [indicates][5] that it should be 1st byte of a 2-byte character.

They seem to truncate *description content* at byte not at character or word boundary, so in some cases only first part of a multibyte character is left hanging there, creating an invalid UTF-8 string.

---
##### Original answer, workaround with `chromote`
One possible workaround would be using `chromote` to fetch page content. It could be done through `rvest::read_html_live()` , but this would also load all linked resources and evaluate JavaScript. 

To grab only main page content, we could evaluate our own JavaScript in Chrome session to use Fetch API, for example:
``` r
library(rvest)
library(chromote)
url_ <- "https://news-front.su/2022/08/29/glava-odessy-prosit-vlasti-strany-sest-za-stol-peregovorov-s-rossiej/"

fetch_content <- function(chromote_session, url_, timeout = 10000){
 chromote_session$Runtime$evaluate(
 glue::glue('fetch("{url_}").then(response => response.text());'), 
 awaitPromise = TRUE,
 timeout = timeout
 )$result$value
}

b <- ChromoteSession$new()
fetch_content(b, url_) |> 
 read_html() |> 
 html_elements(".entry-title") |> 
 html_text2()
#> [1] "Глава Одессы просит власти страны сесть за стол переговоров с Россией"
```


 [1]: https://validator.w3.org/nu/?doc=https%3A%2F%2Fnews-front.su%2F2022%2F08%2F29%2Fglava-odessy-prosit-vlasti-strany-sest-za-stol-peregovorov-s-rossiej%2F&submit=Check
 [2]: https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character
 [3]: https://html.spec.whatwg.org/multipage/parsing.html#parse-errors
 [4]: https://i.sstatic.net/Z4t5axHm.png
 [5]: https://en.wikipedia.org/wiki/UTF-8#Description

Draft saved

Draft discarded

Edit Summary*

Cancel

Thanks a lot, amazing workaround! Also great suggestion to use chromote directly, as an explicit timeout (instead of needing to change the default as in rvest::read_html_live) comes in handy with sometimes very slow page speed as in the example. Still would be very interesting to know where exactly the issue originates and why it works with chromote/read_html_live... any ideas that might help detect affected pages beforehand?

mschro04
– mschro04

2024年11月14日 14:43:22 +00:00
Commented Nov 14, 2024 at 14:43
@mschro04 , updated my answer and I probably got closer to the actual issue. If similar problems are comm and you decide to use httr2, you could just pass resp_body_raw(resp) through stri_conv() for all responses. To check if response is affected, you could use resp_body_raw(resp) |> stri_enc_isutf8() .

margusl
– margusl

2024年11月15日 09:57:39 +00:00
Commented Nov 15, 2024 at 9:57

Add a comment |

How to Edit

Correct minor typos or mistakes
Clarify meaning without changing it
Add related resources or links
Always respect the author’s intent
Don’t use edits to reply to the author

How to Format

create code fences with backticks ` or tildes ~
```
like so
```
add language identifier to highlight code
```python
def function(foo):
print(foo)
```
put returns between paragraphs
for linebreak add 2 spaces at end
_italic_ or **bold**
indent code by 4 spaces
backtick escapes `like _so_`
quote by placing > at start of line
to make links (use https whenever possible)

<https://example.com>

[example](https://example.com)

<a href="https://example.com">example</a>

formatting help »
answering help »

How to Tag

A tag is a keyword or label that categorizes your question with other, similar questions. Choose one or more (up to 5) tags that will help answerers to find and interpret your question.

complete the sentence: my question is about...
use tags that describe things or concepts that are essential, not incidental to your question
favor using existing popular tags
read the descriptions that appear below the tag

If your question is primarily about a topic for which you can't find a tag:

combine multiple words into single-words with hyphens (e.g. python-3.x), up to a maximum of 35 characters
creating new tags is a privilege; if you can't yet create a tag you need, then post this question without it, then ask the community to create it for you

popular tags »

lang-r

CollectivesTM on Stack Overflow

Encoding Issues with Cyrillic Text Scraped using rvest in R

Answer*