Web scraping with `Selenium` web driver¶

Since Selenium WebDriver is created for browser automation, it can be easily used for scraping data from the web. Selenium is to select and navigate the components of a website that are non-static and need to be clicked or chosen from drop-down menus.

If there is any content on the page rendered by javascript then Selenium webdriver wait for the entire page to load before crwaling whereas other libs like BeautifulSoup,Scrapy and Requests works only on static pages.

Any browsyer actions can be done with the help of Selenium webdriver, if there is any content on the page displayed by on button click or Scrolling or Page Navigation.

Pros of using WebDriver¶

WebDriver can simulate a real user working with a browser
WebDriver can scrape a web site using a specific browser
WebDriver can scrape complicated web pages with dynamic content
WebDriver is able to take screenshots of the webpage

Cons of using WebDriver¶

The program becomes quite large
The scraping process is slower
The browser generates a bigger network traffic
The scraping can be detected by such simple means as Google Analytics

Web Scraping Bing with `Selenium` Firefox driver¶

Let’s now load the main bing search page and makes a query to look for “feng li”: You need to install selenium module for Python. You also need geckodriver and place it in a directory where $PATH can find. You could download it from https://github.com/mozilla/geckodriver/releases .

from selenium import webdriver
driver = webdriver.Firefox()

driver.get("https://www.bing.com/")

driver.find_element_by_id("sb_form_q").send_keys("feng li")

driver.find_element_by_id("sb_form_go").click()

#driver.close()
#driver.quit()

Web Scraping Baidu with headless web driver¶

To use a headless firefox requires a bit of configuration.

from selenium import webdriver
from selenium.webdriver.firefox.options import Options

options = Options()
options.add_argument("--headless")
driver = webdriver.Firefox(firefox_options=options)
print("Firefox Headless Browser Invoked")

Firefox Headless Browser Invoked

driver.get("https://www.baidu.com/")
driver.find_element_by_id("kw").send_keys("李丰 中央财经大学")

driver.find_element_by_id("su").click()
results = driver.find_elements_by_xpath('//div[@srcid="1599"]/h3/a')

for result in results:
    print(result.text)

中央财经大学统计与数学学院导师李丰简介_考研派
瑞典斯德哥尔摩大学博士--中央财经大学李丰
主页- 中央财经大学教学技术服务中心
...届全国高校经管类实验教学案例大赛中取得佳绩 - 中央财经大学...
COS访谈第22期:李丰老师-来自微信公众号统计之都-wx.abbao.cn
...级博士生李丰羽在《Energy Policy》发表论文_中央财经大学金融...
中央财经大学2015年度国家自然科学基金资助立项项..._人大经济论坛
李丰 – Dr. Feng Li
强势围观!这么炫酷的大学实验室,你见过吗? 丨 探秘中财实验室

driver.close()

Lab¶

Use selenium to implement the case we studied with BeautifulSoup in L2.