最新消息:20210816 当前crifan.com域名已被污染,为防止失联,请关注(页面右下角的)公众号


HTML crifan 1428浏览 0评论
Riff Raff Sails the High Cheese by Susan Schade | Scholastic
(注意 html的节点,在PySpider爬取出来的 和 浏览器中的 略有不同)
        description = ""
        descriptionElement = response.doc('div[id="description"]')
        print("descriptionElement=%s" % descriptionElement)
        # descLessElement = descriptionElement.find('span[class="description-less-con"]')
        # print("descLessElement=%s" % descLessElement)
        # descLessText = descLessElement.text()
        # print("descLessText=%s" % descLessText)
        # descMoreLement = descriptionElement.find('span[class="description-more-con"]')
        # print("descMoreLement=%s" % descMoreLement)
        # if descMoreLement:
        #     descMoreText = descMoreLement.text()
        #     print("descMoreText=%s" % descMoreText)
        descriptionText = descriptionElement.text()
        print("descriptionText=%s" % descriptionText)
descriptionText=Shiver me whiskers! Someone has stolen the mice pirates’ most valuable loot: a great big chunk of cheese! Captain Riff Raff and the gang set out to retrieve the stolen booty, but can the cheese be seized?<br />With colorful illustrations from Anne Kennedy and lively text from Susan Schade, <i>Riff Raff Sails the High Cheese</i> strengthens reading skills for beginning readers and buccaneers. Mice pirates and young readers use rhyming words and simple wordplay to solve the mystery of the missing cheese.<br /><i>Riff Raff Sails the High Cheese</i> is a Level Two I Can Read book, geared for kids who read on their own but still need a little help.
pyquery – PyQuery complete API — pyquery 1.2.4 documentation
lxml.etree.tostring html to string
python – Incredibly basic lxml questions: getting HTML/string content of lxml.etree._Element? – Stack Overflow
python – lxml.tostring incorrectly replacing text with HTML entities – Stack Overflow
Python Why lxml etree tostring method returns bytes – Python – Makble
pyspider html remove tag to string
Python code to remove HTML tags from a string – Stack Overflow
def remove_tags(text):
    return ''.join(xml.etree.ElementTree.fromstring(text).itertext())
strip – How to remove tags from a string in python using regular expressions? (NOT in HTML) – Stack Overflow
【已解决】Python中xml.etree.ElementTree出错:AttributeError: module ‘xml’ has no attribute ‘etree’
>>> from lxml import etree, html
>>> element = etree.fromstring('<p>Hel-lo World</p>')
import lxml
# import xml

def htmlToString(htmlText):
    # return ''.join(xml.etree.ElementTree.fromstring(htmlText).itertext())
    return ''.join(lxml.etree.ElementTree.fromstring(htmlText).itertext())
[E 181011 11:55:58 base_handler:203] 'cython_function_or_method' object has no attribute 'fromstring'
    Traceback (most recent call last):
      File "/Users/crifan/.local/share/virtualenvs/crawler_scholastic_storybook-ttmbK5Yf/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 196, in run_task
        result = self._run_task(task, response)
      File "/Users/crifan/.local/share/virtualenvs/crawler_scholastic_storybook-ttmbK5Yf/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 176, in _run_task
        return self._run_func(function, response, task)
      File "/Users/crifan/.local/share/virtualenvs/crawler_scholastic_storybook-ttmbK5Yf/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 155, in _run_func
        ret = function(*arguments[:len(args) - 1])
      File "<ScholasticStorybook>", line 227, in singleBookCallback
      File "<ScholasticStorybook>", line 19, in htmlToString
    AttributeError: 'cython_function_or_method' object has no attribute 'fromstring'
[E 181011 13:33:22 base_handler:203] syntax error: line 1, column 0
    Traceback (most recent call last):
      File "/Users/crifan/.local/share/virtualenvs/crawler_scholastic_storybook-ttmbK5Yf/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 196, in run_task
        result = self._run_task(task, response)
      File "/Users/crifan/.local/share/virtualenvs/crawler_scholastic_storybook-ttmbK5Yf/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 176, in _run_task
        return self._run_func(function, response, task)
      File "/Users/crifan/.local/share/virtualenvs/crawler_scholastic_storybook-ttmbK5Yf/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 155, in _run_func
        ret = function(*arguments[:len(args) - 1])
      File "<ScholasticStorybook>", line 232, in singleBookCallback
      File "<ScholasticStorybook>", line 23, in htmlToString
      File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/xml/etree/ElementTree.py", line 1314, in XML
      File "<string>", line None
    xml.etree.ElementTree.ParseError: syntax error: line 1, column 0
python html to string
Decode HTML entities in Python string? – Stack Overflow
web scraping – Converting html to text with Python – Stack Overflow
python – Convert HTML entities to Unicode and vice versa – Stack Overflow
Extracting text from HTML file using Python – Stack Overflow
Decoding HTML Entities to Text in Python – fredericiana
EscapingHtml – Python Wiki
html — HyperText Markup Language support — Python 3.7.1rc1 documentation
19.1. HTMLParser — Simple HTML and XHTML parser — Python 2.7.15 documentation
默认安装BeautifulSoup会去安装BeautifulSoup 3,所以此处报错,不给安装:
➜  crawler_scholastic_storybook git:(master) ✗ pipenv install BeautifulSoup
Installing BeautifulSoup...
Looking in indexes: 
Collecting BeautifulSoup
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/46/2hjxz38n22n3ypp_5f6_p__00000gn/T/pip-install-dxfdripk/BeautifulSoup/setup.py", line 22
        print "Unit tests have failed!"
    SyntaxError: Missing parentheses in call to 'print'. Did you mean print(int "Unit tests have failed!")?


Error:  An error occurred while installing BeautifulSoup!
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/46/2hjxz38n22n3ypp_5f6_p__00000gn/T/pip-install-dxfdripk/BeautifulSoup/
You are using pip version 18.0, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

This is likely caused by a bug in BeautifulSoup. Report this to its maintainers.
➜  crawler_scholastic_storybook git:(master) ✗ pipenv install bs4
Installing bs4...
Looking in indexes: 
Collecting bs4
Collecting beautifulsoup4 (from bs4)
Building wheels for collected packages: bs4
  Running setup.py bdist_wheel for bs4: started
  Running setup.py bdist_wheel for bs4: finished with status 'done'
  Stored in directory: /Users/crifan/Library/Caches/pipenv/wheels/d8/e6/2f/a8e9e4058de6bf1a3d0cd64e23ba5fba27e75dc282e47a5077
Successfully built bs4
Installing collected packages: beautifulsoup4, bs4
Successfully installed beautifulsoup4-4.6.3 bs4-0.0.1

Adding bs4 to Pipfile's [packages]...
Pipfile.lock (225a5b) out of date, updating to (4a06ee)...
Locking [dev-packages] dependencies...
Locking [packages] dependencies...
Updated Pipfile.lock (4a06ee)!
Installing dependencies from Pipfile.lock (4a06ee)...
  🐍   ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 27/27 — 00:00
from bs4 import BeautifulSoup

def htmlToString(htmlText):
    soup = BeautifulSoup(htmlText)
    print("soup=%s" % soup)
    pureText = soup.text
    print("pureText=%s" % pureText)
    return pureText
把br替换为 换行
python html to text with new line
Python 3: Write newlines to HTML – Stack Overflow
python – how to create the new line character – Stack Overflow
python – Converting HTML to plain text while preserving line breaks – Stack Overflow
Reading a text file with new lines in html with python – Stack Overflow
python html to text br
python – Using beautifulsoup to extract text between line breaks (e.g. <br /> tags) – Stack Overflow
html – python how to extract text after br? – Stack Overflow
<br />
from bs4 import BeautifulSoup

def htmlToString(htmlText, retainNewLine=True):
    if retainNewLine:
        htmlText = htmlText.replace("<br>", '\n')
        htmlText = htmlText.replace("<br/>", '\n')
        htmlText = htmlText.replace("<br />", '\n')

    print("htmlText=%s" % htmlText)
    soup = BeautifulSoup(htmlText)
    print("soup=%s" % soup)
    pureText = soup.text
    print("pureText=%s" % pureText)
    return pureText
descriptionText=Shiver me whiskers! Someone has stolen the mice pirates’ most valuable loot: a great big chunk of cheese! Captain Riff Raff and the gang set out to retrieve the stolen booty, but can the cheese be seized?<br />With colorful illustrations from Anne Kennedy and lively text from Susan Schade, <i>Riff Raff Sails the High Cheese</i> strengthens reading skills for beginning readers and buccaneers. Mice pirates and young readers use rhyming words and simple wordplay to solve the mystery of the missing cheese.<br /><i>Riff Raff Sails the High Cheese</i> is a Level Two I Can Read book, geared for kids who read on their own but still need a little help.

htmlText=Shiver me whiskers! Someone has stolen the mice pirates’ most valuable loot: a great big chunk of cheese! Captain Riff Raff and the gang set out to retrieve the stolen booty, but can the cheese be seized?
With colorful illustrations from Anne Kennedy and lively text from Susan Schade, <i>Riff Raff Sails the High Cheese</i> strengthens reading skills for beginning readers and buccaneers. Mice pirates and young readers use rhyming words and simple wordplay to solve the mystery of the missing cheese.
<i>Riff Raff Sails the High Cheese</i> is a Level Two I Can Read book, geared for kids who read on their own but still need a little help.

soup=<html><body><p>Shiver me whiskers! Someone has stolen the mice pirates’ most valuable loot: a great big chunk of cheese! Captain Riff Raff and the gang set out to retrieve the stolen booty, but can the cheese be seized?
With colorful illustrations from Anne Kennedy and lively text from Susan Schade, <i>Riff Raff Sails the High Cheese</i> strengthens reading skills for beginning readers and buccaneers. Mice pirates and young readers use rhyming words and simple wordplay to solve the mystery of the missing cheese.
<i>Riff Raff Sails the High Cheese</i> is a Level Two I Can Read book, geared for kids who read on their own but still need a little help.</p></body></html>

pureText=Shiver me whiskers! Someone has stolen the mice pirates’ most valuable loot: a great big chunk of cheese! Captain Riff Raff and the gang set out to retrieve the stolen booty, but can the cheese be seized?
With colorful illustrations from Anne Kennedy and lively text from Susan Schade, Riff Raff Sails the High Cheese strengthens reading skills for beginning readers and buccaneers. Mice pirates and young readers use rhyming words and simple wordplay to solve the mystery of the missing cheese.
Riff Raff Sails the High Cheese is a Level Two I Can Read book, geared for kids who read on their own but still need a little help.
Shiver me whiskers! Someone has stolen the mice pirates’ most valuable loot: a great big chunk of cheese! Captain Riff Raff and the gang set out to retrieve the stolen booty, but can the cheese be seized?<br />With colorful illustrations from Anne Kennedy and lively text from Susan Schade, <i>Riff Raff Sails the High Cheese</i> strengthens reading skills for beginning readers and buccaneers. Mice pirates and young readers use rhyming words and simple wordplay to solve the mystery of the missing cheese.<br /><i>Riff Raff Sails the High Cheese</i> is a Level Two I Can Read book, geared for kids who read on their own but still need a little help.
pipenv install bs4
from bs4 import BeautifulSoup

def htmlToString(htmlText, retainNewLine=True):
    if retainNewLine:
        htmlText = htmlText.replace("
", '\n')
        htmlText = htmlText.replace("<br/>", '\n')
        htmlText = htmlText.replace("<br />", '\n')

    print("htmlText=%s" % htmlText)
    soup = BeautifulSoup(htmlText)
    print("soup=%s" % soup)
    pureText = soup.text
    print("pureText=%s" % pureText)
    return pureText
Shiver me whiskers! Someone has stolen the mice pirates’ most valuable loot: a great big chunk of cheese! Captain Riff Raff and the gang set out to retrieve the stolen booty, but can the cheese be seized?
With colorful illustrations from Anne Kennedy and lively text from Susan Schade, Riff Raff Sails the High Cheese strengthens reading skills for beginning readers and buccaneers. Mice pirates and young readers use rhyming words and simple wordplay to solve the mystery of the missing cheese.
Riff Raff Sails the High Cheese is a Level Two I Can Read book, geared for kids who read on their own but still need a little help.
“from bs4 import Beautifulsoup
soup = Beautifulsoup(text)
“In also places newlines in the middle of sentences if you have e.g. “<p>That’s <strong>not</strong> what I want</p>””
  • 直接用get_text():pureText = soup.get_text()
    • -》只能得到文本,但是没有br变成换行
  • 用get_text(‘\n’):pureText = soup.get_text(‘\n’)
    • -》会把所有的tag标签前后都换行
    • 导致<i>Riff Raff Sails the High Cheese</i>,也会变成单独的一行
    • 不是我们希望看到的
<ScholasticStorybook>:23: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 23 of the file <ScholasticStorybook>. To get rid of this warning, pass the additional argument 'features="lxml"' to the BeautifulSoup constructor.
BeautifulSoup Parser Warning · Issue #49 · ckreibich/scholar.py
soup = BeautifulSoup(htmlText, "lxml")

转载请注明:在路上 » 【已解决】PySpider中PyQuery中把得到的html的text转换为带换行的纯文本字符串




  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
86 queries in 0.247 seconds, using 22.19MB memory