Crawl with cookie (I)

When you need login

Posted by CodingMrWang on July 14, 2018

This post is first created by CodingMrWang, 作者 @Zexian Wang ,please keep the original link if you want to repost it.

Cookie is a small piece of data store in user’s computer used to identify a user. Sometime, before we crawl a website, the website requires login first, then we have to pretended to be a user who have already login. So we need to use cookie when we visit a website. We use Urllib2 package to store our cookie of login.

Firstly, we need to use CookieJar to get cookie and save it into a variable.

import urllib2
import cookielib
#decalre a CookieJar instance to save cookie
cookie = cookielib.CookieJar()
#use HTTPCookieProcessor in urllib2 package to create cookie handler
handler=urllib2.HTTPCookieProcessor(cookie)
#user handler to crate an opener
opener = urllib2.build_opener(handler)
response = opener.open('https://www.google.com.hk/')

Then cookie is saved to the variable. We can print the value of the cookie

for item in cookie:
    print 'Name = '+item.name
    print 'Value = '+item.value

result:

Name = 1P_JAR
Value = 2018-07-14-15
Name = NID
Value = 134=CsJE-MeMjz4QNq3svsTGAEP0HKMleMafx6h6x6Jfp8lXTcG85x29Maxe_ok2WJo1aJl-ceOPPTFKqYNAApBpxu9lFohOoBeULfUj3-2L0pn_NGU7J4BvCgKeSfrHWFHT

We can also save the cookie into a file, then read the cookie from the file to visit the webisite.

import cookielib
import urllib2
 
#set a file to write the cookie
filename = 'cookie'
#decalre a MozillaCookieJar instance to save cooki and write to file later
cookie = cookielib.MozillaCookieJar(filename)
handler = urllib2.HTTPCookieProcessor(cookie)
#use handler to build an opener
opener = urllib2.build_opener(handler)
#create an request
response = opener.open("https://www.google.com.hk/")
#save cookie to a file
cookie.save(ignore_discard=True, ignore_expires=True)

We can use the Cookie save in the file to visit website.

import cookielib
import urllib2
 
#Create a MozillaCookieJar instance
cookie = cookielib.MozillaCookieJar()
#read cookie from the file
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
#use build_opener in urllib2 to create a opener
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
response = opener.open("https://www.google.com.hk/")
print response.read()

Baidu Index Example

Here, I give you an example of Baidu Index Spider I wrote before, and I wrote it in python3, if you are using python3, you can directly use this example, if you are using python2, just change the urllib2 to urllib and cookielib to http.cookiejar will be fine.

import urllib
import http.cookiejar

filename = 'cookie.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie))
postdata = urllib.parse.urlencode({
            'username':'youaccount',
            'password':'yourpassword'
        }).encode("utf-8")
#login URL
loginUrl = 'https://wappass.baidu.com/passport/?login&u=http://index.baidu.com/baidu-index-mobile/#/'
#pretend to login and save the cookie
result = opener.open(loginUrl,postdata)
#write cookie into the file
cookie.save(ignore_discard=True, ignore_expires=True)

cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
#this is the url you want to request
url "http://index.baidu.com/Interface/Newwordgraph/getIndex?region=0&startdate=20180315&enddate=20180321&wordlist%5B0%5D=%E5%BF%AB%E6%89%8B")
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie))
response = opener.open(url)
print(response.read())

result:

b'{"status":0,"uniqid":"5b4b047a4ea518.31264000","data":[{"key":"\\u5feb\\u624b","index":[{"period":"20180315|20180321","_all":"KDfMKg6K00fgf0HiFFgD,KffgKKD,KgKiDHFgK0F00","_pc":"K60fgK,DKgKMD,gKHMHgKKHfgKKM0gKF,K","_wise":"KiFMFg,MHf,gM6iKDgDfffDgK0HHFgF6iFFgFFMHK"}]}]}'

In Crawl with cookie (II), I will introduce another method to get cookie and use cookie.