Parse feeds in Python

Title Strange Characters issue when reading RSS XML files not encoded in utf-8#478

Closed
Opened 9/24/20244 commentsby laiyonghao
laiyonghao

For example, when the RSS XML file encoding is windows-1252, if the last byte of the ttile field text value is a blank character, such as 0xA0, which is NSBP, it will be deleted by the strip() function, resulting in the problem of strange characters. Here is a feed URL, [https://www.lfhacks.com/index.xml](https://www.lfhacks.com/index.xml) Some of the article titles inside will become garbled, such as the link https://www.lfhacks.com/tech/python-find-positive/ The corresponding article title will become 'strange characters' because the last byte of its window-1252 encoding is 0xA0 and will be deleted. I have modified the code to no longer call strip() to remove blank characters at both ends. As a temporary solution, it can meet the needs of my project. But I think the fundamental solution to the problem is to first convert the entire XML file into UTF-8 encoding and then parse it. I hope developers can fix this issue as soon as possible. thank you.

AI Analysis

This issue appears to be discussing a feature request or bug report related to the repository. Based on the content, it seems to be resolved. The issue was opened by laiyonghao and has received 4 comments.

Add a comment
Comment form would go here