# Interactive Scraping with Selenium


Feng Li

School of Statistics and Mathematics

Central University of Finance and Economics

[feng.li@cufe.edu.cn](mailto:feng.li@cufe.edu.cn)

[https://feng.li/python](https://feng.li/python)

## Web scraping with `Selenium` web driver

Since Selenium WebDriver is created for browser automation, it can be easily used for scraping data from the web. Selenium is to select and navigate the components of a website that are non-static and need to be clicked or chosen from drop-down menus. 

If there is any content on the page rendered by javascript then Selenium webdriver wait for the entire page to load before crwaling whereas other libs like BeautifulSoup,Scrapy and Requests works only on static pages.

Any browsyer actions can be done with the help of Selenium webdriver, if there is any content on the page displayed by on button click or Scrolling or Page Navigation.

#### Pros of using WebDriver

- WebDriver can simulate a real user working with a browser
- WebDriver can scrape a web site using a specific browser
- WebDriver can scrape complicated web pages with dynamic content
- WebDriver is able to take screenshots of the webpage


#### Cons of using WebDriver

- The program becomes quite large
- The scraping process is slower
- The browser generates a bigger network traffic
- The scraping can be detected by such simple means as Google Analytics

## Web Scraping Bing with `Selenium` Firefox driver

- Let’s now load the main bing search page and makes a query to look for `feng li`.

- You need to install `selenium` module for Python. 

- You also need `geckodriver`. This program provides the HTTP API described by the WebDriver protocol to communicate with Gecko browsers, such as Firefox.


- Place `geckodriver` in a directory where `$PATH` can find. You could download it from https://github.com/mozilla/geckodriver/releases.

In [2]:
! pip3 install selenium

Looking in indexes: https://mirrors.163.com/pypi/simple/
Collecting selenium
 Downloading https://mirrors.163.com/pypi/packages/ad/24/39cab5fbaf425ff522e1e51cce79f94f10f9523f015d2b2251e43f45e8a2/selenium-4.0.0-py3-none-any.whl (954 kB)
[K |████████████████████████████████| 954 kB 3.6 MB/s eta 0:00:01
[?25hCollecting trio-websocket~=0.9
 Downloading https://mirrors.163.com/pypi/packages/db/c5/b5e8bc1f40568a354f2a9cc296b8892605a9d2f22e725290fc33836dd2a3/trio_websocket-0.9.2-py3-none-any.whl (16 kB)
Collecting trio~=0.17
 Downloading https://mirrors.163.com/pypi/packages/35/c3/5a4befc3812b3b606e0ae9338bfdd02da3ad0a90df27dc66c37315e94f5c/trio-0.19.0-py3-none-any.whl (356 kB)
[K |████████████████████████████████| 356 kB 6.9 MB/s eta 0:00:01
Collecting outcome
 Downloading https://mirrors.163.com/pypi/packages/0d/bb/f60ce97b304b1979d1fef96b6517af47b9bb026770b1f198b6e921b5edf5/outcome-1.1.0-py2.py3-none-any.whl (9.7 kB)
Collecting sniffio
 Downloading https://mirrors.163.com/pypi/packages/

In [24]:
from selenium import webdriver
driver = webdriver.Firefox()

In [25]:
driver.get("https://www.bing.com/")

In [26]:
from selenium.webdriver.common.by import By
driver.find_element(By.ID,"sb_form_q").send_keys("feng li cufe")

In [27]:
driver.find_element(By.ID, "search_icon").click()

In [28]:
driver.close()

## Web Scraping Baidu with headless web driver

To use a headless firefox requires a bit of configuration.

In [34]:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options

# options = Options()
# options.add_argument("--headless")
# driver = webdriver.Firefox(options=options)
driver = webdriver.Firefox()
print("Firefox Headless Browser Invoked")

Firefox Headless Browser Invoked


In [35]:
driver.get("https://www.baidu.com/")
driver.find_element(By.ID, "kw").send_keys("李丰 中央财经大学")

In [36]:
driver.find_element(By.ID, "su").click()

In [37]:
results = driver.find_elements(By.XPATH,'//*[@id="content_left"]')

In [38]:
for result in results:
 print(result.text)

李丰-中央财经大学统计与数学学院
[官方]2014年4月25日 李丰博士现任中央财经大学统计与数学学院副院长、副教授、大数据分析专业硕士导师,中国统计教育学会高等教育分会会副秘书长。博士毕业于瑞典斯德哥尔摩大学,研究领域包括贝叶斯统计...
sam.cufe.edu.cn/info/1043/35.....

百度快照
李丰(中央财经大学统计与数学学院教师) - 百度百科
职业:教师
毕业院校:瑞典斯德哥尔摩大学
简介:李丰,中央财经大学统计与数学学院教师,副院长,大数据分析专业硕士导师,中国教育统计学高等教育分会会副秘书长,北...
教育背景 工作经历 研究方向 近期学术论文 著作成果 更多 >
百度百科
李丰 – Dr. Feng Li
李丰,中央财经大学统计与数学学院副院长,大数据分析专业硕士导师,中国统计教育学会高等教育分会会副秘书长。博士毕业于瑞典斯德哥尔摩大学,研究领域包括贝叶斯计算,统计预测,...
feng.li/cn/

百度快照
COS访谈第22期:李丰老师
2016年11月21日 李丰,博士, 中央财经大学统计与数学学院,副院长,硕士研究生导师, 主要研究方向为大数据与复杂模型、贝叶斯推断与统计计算、计量经济与预测方法以及多元模型。现任北京大数据协会理...
搜狐网

百度快照
中央财经大学李丰博士应邀到我校做学术报告-广西科技大学...
2019年12月9日 12月7日下午,中央财经大学统计与数学学院副院长李丰博士应邀在理学院206报告厅做主题为《统计与大数据:工具与未来》的学术报告。理学院张涛副院长主持报告会,部分教师、应用统计学专...
www.gxust.edu.cn/lxy/info/1023...

百度快照
其他人还在搜
中央财经大学四大才子李丰简介孙志猛中央财经大学王成章中央财经大学中央财经大学孙晓伟中央财经大学方意李丰是谁中央财经大学林木
中央财经大学统计与数学学院导师教师师资介绍简介-李丰
2020年4月20日 2007年8月-2008年7月 瑞典达拉那大学统计学系硕士研究生,获统计学硕士学位2003年9月-2007年6月 中国人民大学统计学院本科学生,获经济学学士学位 工作经历2013年...
school.freekaoyan.com/bj/cufe/...

百度快照
海量数据驱动场景及其数据科学方法 - 

In [39]:
driver.close()

## Lab

Use `selenium` to implement the case we studied with `BeautifulSoup` in the previous lab.