I'm trying to scrape russian media web pages that contains Cyrillic text using the rvest package in R.
However, for some of the pages (not all for some reason) I'm encountering an encoding issue where the text does not display correctly after scraping. Instead of the expected Cyrillic characters, I see a garbled text output like:
ÐлаÐ2а ÐÐ ́ÐμÑÑÑ Ð¿ÑÐ3⁄4ÑÐ ̧Ñ Ð2лаÑÑÐ ̧ ÑÑÑаÐ1⁄2Ñ ÑÐμÑÑÑ Ð·Ð° ÑÑÐ3⁄4л пÐμÑÐμÐ3Ð3⁄4Ð2Ð3⁄4ÑÐ3⁄4Ð2 Ñ Ð Ð3⁄4ÑÑÐ ̧ÐμÐ1
Take this page for example:
url <- "https://news-front.su/2022/08/29/glava-odessy-prosit-vlasti-strany-sest-za-stol-peregovorov-s-rossiej/"
Both the page header (httr::headers(httr::HEAD(url))) and all parameters in the html script tell me the encoding should be UTF-8 and the page is static.
Originally, my scraper did not specify the encoding (so it should use UTF-16 if UTF-8 throws an error I guess?!), resulting the character mess above.
Specifying Encoding in read_html():
I attempted to specify the encoding directly in the read_html() function:
text <- rvest::read_html(url, encoding = "UTF-8") %>% html_elements(".entry-title") %>% html_text2()
which results in
Error in read_xml.raw(raw, encoding = encoding, base_url = base_url, as_html = as_html, :
Input is not proper UTF-8, indicate encoding !
Bytes: 0xD0 0x27 0x20 0x2F [9]
I also tried other encodings like "windows-1251" and "UTF-16" (and looped over all the encodings from stringi::stri_enc_list()), but that didn't get me closer to resolve the issue.
Ex-post string manipulation:
This was the closest I could get to the intended result (though still not perfect and ideally not necessary if the scraper can handel the encoding in the first place).
I did this directly in the MariaDB I write the scraped text into using SQL:
CREATE TEMPORARY TABLE ttable (text VARCHAR(255) CHARACTER SET utf8);
INSERT INTO ttable (text) VALUES
('ÐлаÐ2а ÐÐ ́ÐμÑÑÑ Ð¿ÑÐ3⁄4ÑÐ ̧Ñ Ð2лаÑÑÐ ̧ ÑÑÑаÐ1⁄2Ñ ÑÐμÑÑÑ Ð·Ð° ÑÑÐ3⁄4л пÐμÑÐμÐ3Ð3⁄4Ð2Ð3⁄4ÑÐ3⁄4Ð2 Ñ Ð Ð3⁄4ÑÑÐ ̧ÐμÐ1');
SELECT text, CONVERT(CAST(CONVERT(text USING latin1) AS BINARY) USING utf8) AS corrected_text FROM ttable;
and got close:
??лава ??десс?? п??оси?? влас??и с????ан?? сес???? за с??ол пе??егово??ов с ? оссией
instead of
Глава Одессы просит власти страны сесть за стол переговоров с Россией
as displayed on the Website. I could probably get to the right result from here using a LLM or so, but prefer to avoid scraping messy data in the first place...
Has anyone faced a similar encoding issue when scraping Cyrillic texts?
Any help would be hugely appreciated!
[I'm using R 4.3.2 and rvest 1.0.3 on Windows 10]
chromotedirectly, as an explicit timeout (instead of needing to change the default as inrvest::read_html_live) comes in handy with sometimes very slow page speed as in the example. Still would be very interesting to know where exactly the issue originates and why it works withchromote/read_html_live... any ideas that might help detect affected pages beforehand?httr2, you could just passresp_body_raw(resp)throughstri_conv()for all responses. To check if response is affected, you could useresp_body_raw(resp) |> stri_enc_isutf8().