用Python爬虫收集代理信息

销售常常遇到的问题是将一家公司的渠道信息收集起来。比如一个公司的全球渠道有50家,若是将公司名、地址与官网都复制到一个Excel表格,那我需要复制200次,粘贴200次。显然人不喜欢重复而琐碎的工作。这样的工作更适合计算机。

用Python爬虫收集代理信息

假如我需要将以下这个网址的所有渠道信息都整理到一个表格,在计算机上如何实现呢?

https://www.biologic.net/sales-network/

采用当下一个主流的程序言语Python,并且用到requestsbeautiful soup两个模块,就可以解决这个问题。

程序步骤

1、Make a request

Begin by importing the Requests module:

import requests

Now, let’s try to get a webpage.  

r = requests.get('https://www.biologic.net/sales-network/')

Now, we have a Response object called r. We can get all the information we need from this object.

2、Make a soup

Beautiful Soup is a Python library for pulling data out of HTML and XML files.To parse a document, pass it into the BeautifulSoup constructor.

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')

3、Locate the elements

Locate the elements which contains the information like company name and website. Then collect the information in the list.

expert_name_list = [x.text for x in soup.select("li > span:nth-child(1)")]
expert_position_list = [x.text for x in soup.select("li > span:nth-child(2)")]
expert_organization_list = [x.text for x in soup.select("li > span:nth-child(3)")]
expert_email_list = [x.text for x in soup.select("li > span:nth-child(4)")]

最终的代码

几行代码便可以实现想要的功能。

import requests
from bs4 import BeautifulSoup
import pandas as pd


r = requests.get('http://www.cmba.org.cn/fzjg/wylist.aspx-nodeid=144&userid=52.htm')
r.encoding = 'utf-8'
soup = BeautifulSoup(r.text, 'html.parser')

expert_name_list = [x.text for x in soup.select("li > span:nth-child(1)")]
expert_position_list = [x.text for x in soup.select("li > span:nth-child(2)")]
expert_organization_list = [x.text for x in soup.select("li > span:nth-child(3)")]
expert_email_list = [x.text for x in soup.select("li > span:nth-child(4)")]

data = {'专家名字': expert_name_list,
        '职务': expert_position_list,
        '组织': expert_organization_list,
        '邮件': expert_email_list} 
frame = pd.DataFrame(data)

frame['专家名字'].str.strip()
frame['职务'].str.strip()
frame['组织'].str.strip()
frame['邮件'].str.strip()