Web scraping with Selenium web driver

Since Selenium WebDriver is created for browser automation, it can be easily used for scraping data from the web. Selenium is to select and navigate the components of a website that are non-static and need to be clicked or chosen from drop-down menus.

If there is any content on the page rendered by javascript then Selenium webdriver wait for the entire page to load before crwaling whereas other libs like BeautifulSoup,Scrapy and Requests works only on static pages.

Any browsyer actions can be done with the help of Selenium webdriver, if there is any content on the page displayed by on button click or Scrolling or Page Navigation.

Pros of using WebDriver

  • WebDriver can simulate a real user working with a browser
  • WebDriver can scrape a web site using a specific browser
  • WebDriver can scrape complicated web pages with dynamic content
  • WebDriver is able to take screenshots of the webpage

Cons of using WebDriver

  • The program becomes quite large
  • The scraping process is slower
  • The browser generates a bigger network traffic
  • The scraping can be detected by such simple means as Google Analytics

Web Scraping Bing with Selenium Firefox driver

Let’s now load the main bing search page and makes a query to look for “feng li”: You need to install selenium module for Python. You also need geckodriver and place it in a directory where $PATH can find. You could download it from https://github.com/mozilla/geckodriver/releases .

In [11]:
from selenium import webdriver
driver = webdriver.Firefox()
In [12]:
In [13]:
driver.find_element_by_id("sb_form_q").send_keys("feng li")
In [14]:
In [23]:

Web Scraping Baidu with headless web driver

To use a headless firefox requires a bit of configuration.

In [32]:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options

options = Options()
driver = webdriver.Firefox(firefox_options=options)
print("Firefox Headless Browser Invoked")
Firefox Headless Browser Invoked
In [33]:
driver.find_element_by_id("kw").send_keys("李丰 中央财经大学")
In [34]:
results = driver.find_elements_by_xpath('//div[@srcid="1599"]/h3/a')
In [35]:
for result in results:
主页- 中央财经大学教学技术服务中心
...届全国高校经管类实验教学案例大赛中取得佳绩 - 中央财经大学...
...级博士生李丰羽在《Energy Policy》发表论文_中央财经大学金融...
李丰 – Dr. Feng Li
强势围观!这么炫酷的大学实验室,你见过吗? 丨 探秘中财实验室
In [36]:


Use selenium to implement the case we studied with BeautifulSoup in L2.