This post is first created by CodingMrWang, 作者 @Zexian Wang ,please keep the original link if you want to repost it.
Get the cookie
In this post, I will introduce another way to get cookies and use it in your spider. Because most cookies have its time limit, if you keep refreshing them, the roi may be low. So, you can ask your demand side to provide lots of cookies for you. The easiest way to get cookie is to use a chrome extension, easy_cookie.
Then, when you visit a website, you can easily get your cookie.
You can create a platform to maintain the cookie pool, let some people to refresh it if it doesn’t work.
The cookie is actually a dict contains key-value pairs, some of them are useful and some of them are useless.
Clean the cookie
First, we need to clean the cookie and make it into a dict.
import tempfile
import http.cookiejar
import requests
def parse_cookies_txt(cookie_txt):
cookies = {}
temp = tempfile.NamedTemporaryFile(suffix='.txt')
temp.write(cookie_txt)
temp.seek(0)
cookie_file = temp.name
cj = http.cookiejar.MozillaCookieJar(cookie_file)
cj.load()
cookies = requests.utils.dict_from_cookiejar(cj)
return cookies
print(parse_cookies_txt(cookie_read_from_db))
result:
{"BD_UPN": "123253",
"HMACCOUNT": "F6158B27195F1823",
"BAIDU_WISE_UID": "wpass_1526442213166_683",
"PANWEB": "1", "CPROID": "1C0B8F98E983E512252738843153DC5F:FG=1",
"wkview_gotodaily_tip": "1",
"dvid": "1532061020267|b6dd9958-63fd-418c-b16e-a348ec071676", "Hm_lvt_7a3960b6f067eb0085b7f96ff5e660b0": "1529758399",
"HOSUPPORT": "1", "Hm_lvt_d101ea4d2a5c67dab98251f0b5de24dc": "1532061014,1532065084", "MCITY": "-131%3A",
"BAIDUID": "1C0B8F98E983E512252738843153DC5F:FG=1",
"SAVEUSERID": "1446a2799f5aac9ff961c2c38b127220",
"TIEBAUID": "91d2c8f7431b4409cfecd6a1",
"FP_UID": "be34394c8bbc0fdba6d7d6f3569a48d3",
"USERNAMETYPE": "1",
"pgv_pvi": "5454225408",
"PTOKEN": "cc4a61a81a12fa0ee977178a662f47fb",
"bdshare_firstime": "1529758649228", "Hm_lvt_98b9d8c2fd6608d564bf2ac2ae642948": "1529758649", "BIDUPSID":
"1C0B8F98E983E512252738843153DC5F",
"Hm_lvt_d8bfb560f8d03bbefc9bdecafc4a4bf6": "1530006734",
"HISTORY": "665b76e2cd8420dff4b8d0ac7535364b10e95f13c91c19de4bd365459f56d8f4a427",
"CHKFORREG": "1f012c7fe1852eb5bbda287ea55a12c5", "Hm_lvt_294dbbdeb1fabbed8433533a3564560e": "1526635375,1526635398,1526635680,1526786487", "Hm_lvt_55b574651fcae74b0a9f1cf9c8d7c93a": "1528188362,1528960120,1529820099,1530590462",
"BDUSS": "IyRHRRY1hDQ21oMDY4T3R-cH5hTTJHOFhJdXplM1FPRXlIZk1BaUJURkRBbmxiQVFBQUFBJCQAAAAAAAAAAAEAAAAbfmTRQ29kaW5nTXJXYW5nAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEN1UVtDdVFbY0",
"TIEBA_USERTYPE": "251e382d4fe9a8268e036c50",
"viewedPg": "fb5f9cb765ce050876321323%3D1%7C0%26a2260c3110661ed9ad51f315%3D1%7C0",
"SCRC": "0560b51c80ecebc14e2478aec3b6da10",
"_click_param_pc_rec_doc_2017_testid": "5",
"STOKEN": "d622721f14bb7e937073a73a9539659c725a1e4b8cc0c1328db745944cde3853",
"UBI": "fi_PncwhpxZ%7ETaKAfmIgENXtdqCBa%7EIfYpGTt1sEZnHsdpor3Q62S72ZLPlO6pToy3VsWQwl0r3rJpEzBHp",
"PSTM": "1527129643"}
Use the cookie
We can see that the parse_cookies_txt method have transferred cookie text into a dict, then we can use the dict to create reuqests.
s = requests.Session()
agent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36"
headers = {
"HOST": "index.baidu.com",
"Referer": "http://index.baidu.com",
"User-Agent": agent
}
s.headers = headers
requests.utils.add_dict_to_cookiejar(s.cookies, newcookie)
return s
url = "http://index.baidu.com/Interface/Newwordgraph/getIndex?region=0&startdate=20180315&enddate=20180321&wordlist%5B0%5D=%E5%BF%AB%E6%89%8B"
r = s.get(url)
print(r.text)
result:
u'{"status":0,"uniqid":"5b517a027bfeb6.90573684","data":[{"key":"\\u5feb\\u624b","index":[{"period":"20180315|20180321","_all":"ER5AENvEgg5N5gHZJJNR-E55NEER-ENEZRHJNEgJgg","_pc":"Evg5NE-RENEAR-NEHAHNEEH5NEEAgNEJ-E","_wise":"EZJAJN-AH5-NAvZERNR555RNEgHHJNJvZJJNJJAHE"}]}]}'
The result is same to the one in last post CookieI.
Later, I found that the item that really matters is the BDUSS, only this item is useful when sending a request, so you can also use regex to extract BDUSS part and add it into your header, then the result would be same as well.