Skip to main content
Stack Overflow
  1. About
  2. For Teams

You are not logged in. Your edit will be placed in a queue until it is peer reviewed.

We welcome edits that make the post easier to understand and more valuable for readers. Because community members review edits, please try to make the post substantially better than how you found it, for example, by fixing grammar or adding additional resources and hyperlinks.

Required fields*

Required fields*

Encoding Issues with Cyrillic Text Scraped using rvest in R

I'm trying to scrape russian media web pages that contains Cyrillic text using the rvest package in R.

However, for some of the pages (not all for some reason) I'm encountering an encoding issue where the text does not display correctly after scraping. Instead of the expected Cyrillic characters, I see a garbled text output like:

ÐлаÐ2а ÐÐ ́ÐμÑÑÑ Ð¿ÑÐ3⁄4ÑÐ ̧Ñ Ð2лаÑÑÐ ̧ ÑÑÑаÐ1⁄2Ñ ÑÐμÑÑÑ Ð·Ð° ÑÑÐ3⁄4л пÐμÑÐμÐ3Ð3⁄4Ð2Ð3⁄4ÑÐ3⁄4Ð2 Ñ Ð Ð3⁄4ÑÑÐ ̧ÐμÐ1

Take this page for example:

url <- "https://news-front.su/2022/08/29/glava-odessy-prosit-vlasti-strany-sest-za-stol-peregovorov-s-rossiej/"

Both the page header (httr::headers(httr::HEAD(url))) and all parameters in the html script tell me the encoding should be UTF-8 and the page is static.

Originally, my scraper did not specify the encoding (so it should use UTF-16 if UTF-8 throws an error I guess?!), resulting the character mess above.

Specifying Encoding in read_html():

I attempted to specify the encoding directly in the read_html() function:

text <- rvest::read_html(url, encoding = "UTF-8") %>% html_elements(".entry-title") %>% html_text2()

which results in

Error in read_xml.raw(raw, encoding = encoding, base_url = base_url, as_html = as_html, : Input is not proper UTF-8, indicate encoding ! Bytes: 0xD0 0x27 0x20 0x2F [9]

I also tried other encodings like "windows-1251" and "UTF-16" (and looped over all the encodings from stringi::stri_enc_list()), but that didn't get me closer to resolve the issue.

Ex-post string manipulation:

This was the closest I could get to the intended result (though still not perfect and ideally not necessary if the scraper can handel the encoding in the first place).

I did this directly in the MariaDB I write the scraped text into using SQL:

CREATE TEMPORARY TABLE ttable (text VARCHAR(255) CHARACTER SET utf8);
INSERT INTO ttable (text) VALUES 
('ÐлаÐ2а ÐÐ ́ÐμÑÑÑ Ð¿ÑÐ3⁄4ÑÐ ̧Ñ Ð2лаÑÑÐ ̧ ÑÑÑаÐ1⁄2Ñ ÑÐμÑÑÑ Ð·Ð° ÑÑÐ3⁄4л пÐμÑÐμÐ3Ð3⁄4Ð2Ð3⁄4ÑÐ3⁄4Ð2 Ñ Ð Ð3⁄4ÑÑÐ ̧ÐμÐ1');
SELECT text, CONVERT(CAST(CONVERT(text USING latin1) AS BINARY) USING utf8) AS corrected_text FROM ttable;

and got close:

??лава ??десс?? п??оси?? влас??и с????ан?? сес???? за с??ол пе??егово??ов с ? оссией

instead of

Глава Одессы просит власти страны сесть за стол переговоров с Россией

as displayed on the Website. I could probably get to the right result from here using a LLM or so, but prefer to avoid scraping messy data in the first place...

Has anyone faced a similar encoding issue when scraping Cyrillic texts? Any help would be hugely appreciated!

[I'm using R 4.3.2 and rvest 1.0.3 on Windows 10]

Answer*

Draft saved
Draft discarded
Cancel
2
  • Thanks a lot, amazing workaround! Also great suggestion to use chromote directly, as an explicit timeout (instead of needing to change the default as in rvest::read_html_live) comes in handy with sometimes very slow page speed as in the example. Still would be very interesting to know where exactly the issue originates and why it works with chromote/read_html_live... any ideas that might help detect affected pages beforehand? Commented Nov 14, 2024 at 14:43
  • @mschro04 , updated my answer and I probably got closer to the actual issue. If similar problems are comm and you decide to use httr2, you could just pass resp_body_raw(resp) through stri_conv() for all responses. To check if response is affected, you could use resp_body_raw(resp) |> stri_enc_isutf8() . Commented Nov 15, 2024 at 9:57

lang-r

AltStyle によって変換されたページ (->オリジナル) /