This post is first created by CodingMrWang, 作者 @Zexian Wang ,please keep the original link if you want to repost it.
Cookie
Cookie is a small piece of data store in user’s computer used to identify a user. Sometime, before we crawl a website, the website requires login first, then we have to pretended to be a user who have already login. So we need to use cookie when we visit a website. We use Urllib2 package to store our cookie of login.
Get the Cookie
Firstly, we need to use CookieJar to get cookie and save it into a variable.
import urllib2
import cookielib
#decalre a CookieJar instance to save cookie
cookie = cookielib.CookieJar()
#use HTTPCookieProcessor in urllib2 package to create cookie handler
handler=urllib2.HTTPCookieProcessor(cookie)
#user handler to crate an opener
opener = urllib2.build_opener(handler)
response = opener.open('https://www.google.com.hk/')
Then cookie is saved to the variable. We can print the value of the cookie
for item in cookie:
print 'Name = '+item.name
print 'Value = '+item.value
result:
Name = 1P_JAR
Value = 2018-07-14-15
Name = NID
Value = 134=CsJE-MeMjz4QNq3svsTGAEP0HKMleMafx6h6x6Jfp8lXTcG85x29Maxe_ok2WJo1aJl-ceOPPTFKqYNAApBpxu9lFohOoBeULfUj3-2L0pn_NGU7J4BvCgKeSfrHWFHT
Save Cookie into a file
We can also save the cookie into a file, then read the cookie from the file to visit the webisite.
import cookielib
import urllib2
#set a file to write the cookie
filename = 'cookie'
#decalre a MozillaCookieJar instance to save cooki and write to file later
cookie = cookielib.MozillaCookieJar(filename)
handler = urllib2.HTTPCookieProcessor(cookie)
#use handler to build an opener
opener = urllib2.build_opener(handler)
#create an request
response = opener.open("https://www.google.com.hk/")
#save cookie to a file
cookie.save(ignore_discard=True, ignore_expires=True)
Read Cookie from the file
We can use the Cookie save in the file to visit website.
import cookielib
import urllib2
#Create a MozillaCookieJar instance
cookie = cookielib.MozillaCookieJar()
#read cookie from the file
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
#use build_opener in urllib2 to create a opener
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
response = opener.open("https://www.google.com.hk/")
print response.read()
Baidu Index Example
Here, I give you an example of Baidu Index Spider I wrote before, and I wrote it in python3, if you are using python3, you can directly use this example, if you are using python2, just change the urllib2 to urllib and cookielib to http.cookiejar will be fine.
import urllib
import http.cookiejar
filename = 'cookie.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie))
postdata = urllib.parse.urlencode({
'username':'youaccount',
'password':'yourpassword'
}).encode("utf-8")
#login URL
loginUrl = 'https://wappass.baidu.com/passport/?login&u=http://index.baidu.com/baidu-index-mobile/#/'
#pretend to login and save the cookie
result = opener.open(loginUrl,postdata)
#write cookie into the file
cookie.save(ignore_discard=True, ignore_expires=True)
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
#this is the url you want to request
url "http://index.baidu.com/Interface/Newwordgraph/getIndex?region=0&startdate=20180315&enddate=20180321&wordlist%5B0%5D=%E5%BF%AB%E6%89%8B")
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie))
response = opener.open(url)
print(response.read())
result:
b'{"status":0,"uniqid":"5b4b047a4ea518.31264000","data":[{"key":"\\u5feb\\u624b","index":[{"period":"20180315|20180321","_all":"KDfMKg6K00fgf0HiFFgD,KffgKKD,KgKiDHFgK0F00","_pc":"K60fgK,DKgKMD,gKHMHgKKHfgKKM0gKF,K","_wise":"KiFMFg,MHf,gM6iKDgDfffDgK0HHFgF6iFFgFFMHK"}]}]}'
In Crawl with cookie (II), I will introduce another method to get cookie and use cookie.