【教程】抓取网并提取网页中所需要的信息之 C#版

在通过：

【整理】关于抓取网页，分析网页内容，模拟登陆网站的逻辑/流程和注意事项

了解了抓取网页的一般流程之后，加上之前介绍的：

【总结】浏览器中的开发人员工具（IE9的F12和Chrome的Ctrl+Shift+I）-网页分析的利器

应该就很清楚如何利用工具去抓取网页，并分析源码，获得所需内容了。

下面，就来通过实际的例子来介绍，如何通过Python语言，实现这个抓取网页并提取所需内容的过程：

假设我们的需求是，从我(crifan)的Songtaste上的页面：

http://www.songtaste.com/user/351979/

先抓取网页的html源码，然后再提取其中我的songtaste上面的名字：crifan

对应的html代码为：

<h1 class="h1user">crifan</h1>

此任务，相对很简单。下面就来说说，如何用C#来实现。

新建一个C#项目，使用.NET Framework 2.0，设置一些基本的控件用于显示。

相关的，先写出，获得html的代码：

using System.Net;
using System.IO;

//step1: get html from url
//http://www.songtaste.com/user/351979/
string urlToCrawl = txbUrlToCrawl.Text;
//generate http request
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(urlToCrawl);
//use GET method to get url's html
req.Method = "GET";
//use request to get response
HttpWebResponse resp = (HttpWebResponse)req.GetResponse();
string htmlCharset = "GBK";
//use songtaste's html's charset GB2312 to decode html
//otherwise will return messy code
Encoding htmlEncoding = Encoding.GetEncoding(htmlCharset);
StreamReader sr = new StreamReader(resp.GetResponseStream(), htmlEncoding);
//read out the returned html
string respHtml = sr.ReadToEnd();
rtbExtractedHtml.Text = respHtml;

对应的，UI中，点击按钮“抓取网页html源码”：

可以获得对应的html了：

注意：
此处，需要根据你的需要，而决定是否关心html的编码类型（charset）；
以及，此处为何使用GBK的编码，不了解的均可参考：
【整理】关于HTML网页源码的字符编码（charset）格式（GB2312，GBK，UTF-8，ISO8859-1等）的解释

然后获得了html之后，再去通过C#中的正则表达式库函数，Regex，去提取出我们想要的数据：

using System.Text.RegularExpressions;
//step2: extract expected info
//<h1 class="h1user">crifan</h1>
string h1userP = @"<h1\s+class=""h1user"">(?<h1user>.+?)</h1>";
Match foundH1user = (new Regex(h1userP)).Match(rtbExtractedHtml.Text);
if (foundH1user.Success)
{
    //extracted the expected h1user's value
    txbExtractedInfo.Text = foundH1user.Groups["h1user"].Value;
}
else
{
    txbExtractedInfo.Text = "Not found h1 user !";
}

点击“提取所需的信息”，即可提取出我们要的h1user的值crifan：

对应的完整的C#代码为：

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Text;
using System.Windows.Forms;

using System.Net;
using System.IO;
using System.Text.RegularExpressions;

namespace crawlWebsiteAndExtractInfo
{
    public partial class frmCrawlWebsite : Form
    {
        public frmCrawlWebsite()
        {
            InitializeComponent();
        }

        private void btnCrawlAndExtract_Click(object sender, EventArgs e)
        {
            //step1: get html from url
            //http://www.songtaste.com/user/351979/
            string urlToCrawl = txbUrlToCrawl.Text;
            //generate http request
            HttpWebRequest req = (HttpWebRequest)WebRequest.Create(urlToCrawl);
            //use GET method to get url's html
            req.Method = "GET";
            //use request to get response
            HttpWebResponse resp = (HttpWebResponse)req.GetResponse();
            string htmlCharset = "GBK";
            //use songtaste's html's charset GB2312 to decode html
            //otherwise will return messy code
            Encoding htmlEncoding = Encoding.GetEncoding(htmlCharset);
            StreamReader sr = new StreamReader(resp.GetResponseStream(), htmlEncoding);
            //read out the returned html
            string respHtml = sr.ReadToEnd();
            rtbExtractedHtml.Text = respHtml;
        }

        private void btnExtractInfo_Click(object sender, EventArgs e)
        {
            //step2: extract expected info
            //<h1 class="h1user">crifan</h1>
            string h1userP = @"<h1\s+class=""h1user"">(?<h1user>.+?)</h1>";
            Match foundH1user = (new Regex(h1userP)).Match(rtbExtractedHtml.Text);
            if (foundH1user.Success)
            {
                //extracted the expected h1user's value
                txbExtractedInfo.Text = foundH1user.Groups["h1user"].Value;
            }
            else
            {
                txbExtractedInfo.Text = "Not found h1 user !";
            }
        }

        private void lklTutorialUrl_LinkClicked(object sender, LinkLabelLinkClickedEventArgs e)
        {
            string tutorialUrl = "https://www.crifan.com/crawl_website_html_and_extract_info_using_csharp";
            System.Diagnostics.Process.Start(tutorialUrl);
        }
    }
}

完整的VS2010的项目，可以去这里下载：

crawlWebsiteAndExtractInfo_csharp_2012-11-07.7z

【总结】

总的来说，使用C#抓取网站，从返回的html源码中提取所需内容，相对之前的Python，还是要复杂一些的。

因为要手动处理很多和http相关的request，response，以及stream，编码类型等内容。

转载请注明：在路上 » 【教程】抓取网并提取网页中所需要的信息之 C#版

Post Views: 4,350

感谢楼主无私分享，虽然简单点，还是有我需要的

天涯海角10年前 (2015-06-11)回复

相关的，先写出，获得html的代码。。这一步到底是干嘛呢？后面的分析我看懂了。。原谅我WED知识不行啊。。

长空10年前 (2015-06-01)回复

c#版的运行不了啊，求指点

胡10年前 (2014-11-18)回复

推一個受益良多阿尤其對我這種幼幼班的大大的文章十分的深入淺出建立不少概念謝謝

shou.ryan11年前 (2013-10-03)回复

博主你qq多少，我想问你几个技术问题

小米12年前 (2013-08-11)回复

自己去看： https://www.crifan.com/about/me/
crifan12年前 (2013-08-12)回复

看了所有的教程，有一点不明白，为什么c#用.net 2.0 ？用4.5的话，那个百度登录是无法正确获得token的。。。。。这到底是为什么啊？

wer12年前 (2012-12-18)回复

作者之前有一个文章就是说怎么分析百度登陆的。。现在可能改了吧
长空10年前 (2015-06-01)回复
- 简单点说：你还是去看系统的教程，会更加清楚些：详解抓取网站，模拟登陆，抓取动态网页的原理和实现（Python，C#等）看完就你基本都明白了。
  crifan10年前 (2015-06-05)回复

非常感谢博主提供下载SongTaste的网站的功能，这个下载工具的源码可以发一份给我吗？我想在里面增加一个下载专辑评论和歌曲评论的功能。

Joe12年前 (2012-11-09)回复

1. 你所说的“下载SongTaste的网站”，指的是下载songtaste中某个url地址的html代码？ 2.不论是指的是啥，我上面已经贴出来源码了啊： crawlWebsiteAndExtractInfo_csharp_2012-11-07.7z 难道这句话：“ 完整的VS2010的项目，可以去这里下载： crawlWebsiteAndExtractInfo_csharp_2012-11-07.7z” 还会引起啥歧义，导致你没看出来，这个就是源码？？？
crifan12年前 (2012-11-09)回复
- 可以有两个字表达错了，“非常感谢博主提供下载SongTaste的网站的功能”应该是“非常感谢博主提供下载SongTaste网站的歌曲和专辑功能”，我想博主原来一个工具上面增加下载评论的功能。
  Joe12年前 (2012-11-09)回复
  - 可以参考：【整理】google code简介和用法去： DownloadSongtasteMusic - 下载ST（Songtaste）中正在播放的歌曲/单首歌曲/整张专辑的源代码。另外，顺便提示一下：其实，细心点的话，你会在我网站的任意页面的右边栏的“页面”部分找到： downloadSonstasteMusic(下载Songtaste歌曲) v1.5 – 下载Songtaste(ST)中正在播放的歌曲/单首歌曲/整张专辑，其中可以找到google code的地址的：downloadSongtasteMusic 下载页面
    crifan12年前 (2012-11-09)回复
    - 好像没有源码，只有exe程序？
      Joe12年前 (2012-11-09)回复
      - 请认真看【整理】google code简介和用法，谢谢。
        crifan12年前 (2012-11-09)回复

【教程】抓取网并提取网页中所需要的信息之 C#版

与本文相关的文章

Hi，您需要填写昵称和邮箱！

网友最新评论 (17)