背景
比如:
【教程】手把手教你如何利用工具(IE9的F12)去分析模拟登陆网站(百度首页)的内部逻辑过程
中所涉及的staticpage变量的值是:
http%3A%2F%2Fwww.baidu.com%2Fcache%2Fuser%2Fhtml%2Fjump.html |
其很明显(当然,如果你不熟悉,你就会觉得,很不明显,但是看了下面的介绍后,你就会发现很明显了,^_^)是(某url地址)被编码(encode)后的值。
其中,该值是用网页分析工具:
【总结】浏览器中的开发人员工具(IE9的F12和Chrome的Ctrl+Shift+I)-网页分析的利器
中的IE9的F12,所分析查看到的。
如何从被编码后的url地址,解码出原始url地址
此处,就来解释一下,如何从上述的,被编码后的值,解码出原始的url地址。
Python中通过urllib.unquote,可以解码出原始url地址
相关代码为:
#!/usr/bin/python # -*- coding: utf-8 -*- """ Function: 【整理】关于http(GET或POST)请求中的url地址的编码和解码 https://www.crifan.com/summary_the_url_encode_and_decode_during_http_get_post_request Version: 2012-11-20 Author: Crifan """ import urllib; encodedUrl = "http%3A%2F%2Fwww.baidu.com%2Fcache%2Fuser%2Fhtml%2Fjump.html"; decodedUrl = urllib.unquote(encodedUrl); print "encodedUrl=\t%s\r\ndecodedUrl=\t%s"%(encodedUrl, decodedUrl); #encodedUrl= http%3A%2F%2Fwww.baidu.com%2Fcache%2Fuser%2Fhtml%2Fjump.html #decodedUrl= http://www.baidu.com/cache/user/html/jump.html
当然,也可以直接通过Python的IDLE去做同样的操作:
Python的urllib.unquote示例代码下载:
urlEncodeDecode_python_2012-11-20.7z
C#中通过HttpUtility.UrlDecode去,可以解码出原始的url地址
相关代码为:
using System; using System.Collections.Generic; using System.ComponentModel; using System.Data; using System.Drawing; using System.Text; using System.Windows.Forms; using System.Web; namespace urlEncodeDecode { public partial class urlEncodeDecode : Form { public urlEncodeDecode() { InitializeComponent(); } private void urlEncodeDecode_Load(object sender, EventArgs e) { string encodedUrl = "http%3A%2F%2Fwww.baidu.com%2Fcache%2Fuser%2Fhtml%2Fjump.html"; string decodedUrl = HttpUtility.UrlDecode(encodedUrl); MessageBox.Show("encodedUrl=" + encodedUrl + Environment.NewLine + "decodedUrl=" + decodedUrl); } } }
(注意:我这里是,把VS2010建立出来的winform程序,默认为.NET 4.0,我改为了.NET 2.0,然后再添加相关的System.Web的引用(references),然后代码中再添加:
using System.Web;
最后才能正常使用
HttpUtility.UrlDecode
等相关函数的。
运行效果为:
C#的HttpUtility.UrlDecode示例代码下载:
urlEncodeDecode_csharp_2012-11-20.7z
如何对原始的url地址进行编码
想要将原始的url地址,比如上面的
http://www.baidu.com/cache/user/html/jump.html
编码为:
http%3A%2F%2Fwww.baidu.com%2Fcache%2Fuser%2Fhtml%2Fjump.html
不同的语言,都有对应不同的函数使用此功能的。
不过在介绍不同语言的实现方法之前,先要解释清楚,一般的字符的值,都是如何进行编码的。
其实,将某个字符,编码为对应的值,就是将其值变为%xx而已,而xx就是该字符的16进制值而已。
字符所对应的16进制值,不了解的,可以去查表:
不过,一般来说,记住下面,我所列出来的,最最常见的,也就基本够用了:
ASCII字符 | 字符中文名 | 编码后的值 |
‘ ‘ | 空格 | %20 |
‘!’ | 感叹号 | %21 |
‘&’ | 花at | %26 |
‘/’ | 斜杠 | %2F |
‘:’ | 冒号) | %3A |
‘=’ | 等于号 | %3D |
‘?’ | 问号 | %3F |
‘~’ | 波浪号 | %7E |
不过,很明显,上面说过了,如何转换,是利用函数去操作的,不用你关系。
你所关系的,是使用什么函数而已。
另外,还需要注意一点的是,一般来说,空格’ ‘,都是和其他字符一样,被编码为对应16进制形式,即%20,
但是空格却是被编码为加号’+’的。
所以,也因此,有两套不同的函数:
Python中的url地址编码函数
Python相关函数:
- 将空格编码为%20:urllib.quote
- urllib.quote(string[, safe])
Replace special characters in string using the %xx escape. Letters, digits, and the characters '_.-' are never quoted. By default, this function is intended for quoting the path section of the URL. The optional safe parameter specifies additional characters that should not be quoted — its default value is '/'.
Example: quote('/~connolly/') yields '/%7econnolly/'.
- 将空格编码为加号’+’:urllib.quote_plus
- urllib.quote_plus(string[, safe])
Like quote(), but also replaces spaces by plus signs, as required for quoting HTML form values when building up a query string to go into a URL. Plus signs in the original string are escaped unless they are included in safe. It also does not have safe default to '/'.
C#中的url地址编码函数
C#相关,用于url地址编码的函数是:
- 将空格编码为%20:HttpUtility.UrlPathEncode
Applies the encoding logic of the UrlPathEncode method to only the path part of the URL (which excludes the query string). The method assumes that the URL is encoded as a UTF-8 string.
Encodes non-spaces so that only a subset of the first 128 ASCII characters is used in the resulting encoded string. Any characters at Unicode value 128 and greater, or 32 and less, are URL-encoded.
Encodes spaces as %20.
- 将空格编码为加号’+’:HttpUtility.UrlEncode
The UrlPathEncode method performs the following steps:
You can encode a URL using the UrlEncode method or the UrlPathEncode method. However, the methods return different results. The UrlEncode method converts each space character to a plus character (+). The UrlPathEncode method converts each space character into the string %20, which represents a space in hexadecimal notation. Use the UrlPathEncode method when you encode the path portion of a URL in order to guarantee a consistent decoded URL, regardless of which platform or browser performs the decoding. When you use the UrlPathEncode method, the query-string values are not encoded. Therefore, any values that are past the question mark (?) in the string, will not be encoded. If you must pass a URL as a query string, use the UrlEncode method.
The UrlEncode(String) method can be used to encode the entire URL, including query-string values. If characters such as blanks and punctuation are passed in an HTTP stream without encoding, they might be misinterpreted at the receiving end. URL encoding converts characters that are not allowed in a URL into character-entity equivalents; URL decoding reverses the encoding. For example, when the characters < and > are embedded in a block of text to be transmitted in a URL, they are encoded as %3c and %3e.
You can encode a URL using with the UrlEncode method or the UrlPathEncode method. However, the methods return different results. The UrlEncode method converts each space character to a plus character (+). The UrlPathEncode method converts each space character into the string "%20", which represents a space in hexadecimal notation. Use the UrlPathEncode method when you encode the path portion of a URL in order to guarantee a consistent decoded URL, regardless of which platform or browser performs the decoding.
The HttpUtility.UrlEncode method uses UTF-8 encoding by default. Therefore, using the UrlEncode method provides the same results as using the UrlEncode method and specifying UTF8 as the second parameter.
UrlEncode is a convenient way to access the UrlEncode method at run time from an ASP.NET application. Internally, UrlEncode uses the UrlEncode method to encode strings.
url地址编码解码的相关注意事项
多次被编码的地址
当然,有时候,也会遇到更加变态的,像这样的地址:
http%253A%252F%252Fhuanxiao.cc%252Fmall%252Findex.php |
你会发现,用上述解码方法,得到的值是这样的:
http%3A%2F%2Fhuanxiao.cc%2Fmall%2Findex.php |
很明显,不是普通的url地址。
而经过调试后发现,原来这种地址,实际上是再经过一次解码,就可以获得最原始的地址了:
http://huanxiao.cc/mall/index.php |
相关代码为:
staticpage = "http://huanxiao.cc/mall/index.php"; print "Original staticpage\t\t\t\t=",staticpage; staticpage = urllib.quote_plus(staticpage); print "after first quote_plus,staticpage\t\t=",staticpage; staticpage = urllib.quote_plus(staticpage); print "after second quote_plus,staticpage\t\t=",staticpage; youWantRetUrl = "http%253A%252F%252Fhuanxiao.cc%252Fmall%252Findex.php"; print "youWantRetUrl\t\t\t\t\t=",youWantRetUrl;
相关输出为:
Original staticpage = http://huanxiao.cc/mall/index.php after first quote_plus,staticpage = http%3A%2F%2Fhuanxiao.cc%2Fmall%2Findex.php after second quote_plus,staticpage = http%253A%252F%252Fhuanxiao.cc%252Fmall%252Findex.php youWantRetUrl = http%253A%252F%252Fhuanxiao.cc%252Fmall%252Findex.php
所以,回头来看此原先的地址:
http%253A%252F%252Fhuanxiao.cc%252Fmall%252Findex.php |
就是被编码(encode)了两次,然后才通过http请求,发送到对应的服务器的。
所以,相对于普通的情况,即只对url地址编码了一次,就发送到服务器了,还是有点变态的。
为何url地址不是直接发送到服务器,而是被编码后再发送?
首先,先说一下,关于为何必须将url地址,去编码后,再发送,是因为相关的协议规范:RFC 1738,定义了,url地址中,不能包含,除了,0-9的数字,大小写字母(a-zA-Z),短横线’-‘
之外的字母,
换句话说,如果其中包括了很多特殊符合,比如$-_.+!*'(),
那么都要尽量编码。
而关于为何要这么定义,经过一番简单调查,基本的理由是:
1.本身html代码中,很多特殊字符,就有其本身的特殊含义,比如’#’,就适用于定位(html anchor),所以,这类字符,本身有特殊含义的字符,斌直接用于发送,所以需要编码;
2.如果其中本身就包含一些,非打印的控制字符,那么无法正常打印显示,所以必须被编码才能传输。
注:关于控制字符,不了解的可以参考:
3.还有些保留字符(&,=,:),不安全字符(<,>,#),所以需要对url地址编码。
4.另外,还有一些,最容易想到的,比如空格,如果出现在一个url地址中间,我们就很难判断,空格前后的内容,是否是属于整个的单一的url的,所以,对于空格这样的特殊字符,肯定是需要编码的。
【参考资料】
1.INTRODUCTION TO URL ENCODING
转载请注明:在路上 » 【整理】关于http(GET或POST)请求中的url地址的编码(encode)和解码(decode)
utf8Url = decodeurl unicodeUrl = utf8Url.decode("UTF-8") print unicodeUrl
2.alreadyUnicodeUrl = decodeurl omitUnsupportedGbkUrl = alreadyUnicodeUrl.encode("GBK", "ignore") print omitUnsupportedGbkUrl
另外,问一句: (1)你print出来的内容显示到windows的cmd中的吧? (2)你的windows的cmd是GBK编码(不是全英文的,不支持中文输出)吧?