用Python爬虫收集代理信息
销售常常遇到的问题是将一家公司的渠道信息收集起来。比如一个公司的全球渠道有50家,若是将公司名、地址与官网都复制到一个Excel表格,那我需要复制200次,粘贴200次。显然人不喜欢重复而琐碎的工作。这样的工作更适合计算机。
假如我需要将以下这个网址的所有渠道信息都整理到一个表格,在计算机上如何实现呢?
https://www.biologic.net/sales-network/
采用当下一个主流的程序言语Python,并且用到requests和beautiful soup两个模块,就可以解决这个问题。
程序步骤
1、Make a request
Begin by importing the Requests module:
import requests
Now, let’s try to get a webpage.
r = requests.get('https://www.biologic.net/sales-network/')
Now, we have a Response
object called r
. We can get all the information we need from this object.
2、Make a soup
Beautiful Soup is a Python library for pulling data out of HTML and XML files.To parse a document, pass it into the BeautifulSoup
constructor.
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
3、Locate the elements
Locate the elements which contains the information like company name and website. Then collect the information in the list.
expert_name_list = [x.text for x in soup.select("li > span:nth-child(1)")]
expert_position_list = [x.text for x in soup.select("li > span:nth-child(2)")]
expert_organization_list = [x.text for x in soup.select("li > span:nth-child(3)")]
expert_email_list = [x.text for x in soup.select("li > span:nth-child(4)")]
最终的代码
几行代码便可以实现想要的功能。
import requests
from bs4 import BeautifulSoup
import pandas as pd
r = requests.get('http://www.cmba.org.cn/fzjg/wylist.aspx-nodeid=144&userid=52.htm')
r.encoding = 'utf-8'
soup = BeautifulSoup(r.text, 'html.parser')
expert_name_list = [x.text for x in soup.select("li > span:nth-child(1)")]
expert_position_list = [x.text for x in soup.select("li > span:nth-child(2)")]
expert_organization_list = [x.text for x in soup.select("li > span:nth-child(3)")]
expert_email_list = [x.text for x in soup.select("li > span:nth-child(4)")]
data = {'专家名字': expert_name_list,
'职务': expert_position_list,
'组织': expert_organization_list,
'邮件': expert_email_list}
frame = pd.DataFrame(data)
frame['专家名字'].str.strip()
frame['职务'].str.strip()
frame['组织'].str.strip()
frame['邮件'].str.strip()