{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Interactive Scraping with Selenium\n", "\n", "\n", "Feng Li\n", "\n", "School of Statistics and Mathematics\n", "\n", "Central University of Finance and Economics\n", "\n", "[feng.li@cufe.edu.cn](mailto:feng.li@cufe.edu.cn)\n", "\n", "[https://feng.li/python](https://feng.li/python)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Web scraping with `Selenium` web driver\n", "\n", "Since Selenium WebDriver is created for browser automation, it can be easily used for scraping data from the web. Selenium is to select and navigate the components of a website that are non-static and need to be clicked or chosen from drop-down menus. \n", "\n", "If there is any content on the page rendered by javascript then Selenium webdriver wait for the entire page to load before crwaling whereas other libs like BeautifulSoup,Scrapy and Requests works only on static pages.\n", "\n", "Any browsyer actions can be done with the help of Selenium webdriver, if there is any content on the page displayed by on button click or Scrolling or Page Navigation." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Pros of using WebDriver\n", "\n", "- WebDriver can simulate a real user working with a browser\n", "- WebDriver can scrape a web site using a specific browser\n", "- WebDriver can scrape complicated web pages with dynamic content\n", "- WebDriver is able to take screenshots of the webpage\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Cons of using WebDriver\n", "\n", "- The program becomes quite large\n", "- The scraping process is slower\n", "- The browser generates a bigger network traffic\n", "- The scraping can be detected by such simple means as Google Analytics" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Web Scraping Bing with `Selenium` Firefox driver\n", "\n", "- Let’s now load the main bing search page and makes a query to look for `feng li`.\n", "\n", "- You need to install `selenium` module for Python. \n", "\n", "- You also need `geckodriver`. This program provides the HTTP API described by the WebDriver protocol to communicate with Gecko browsers, such as Firefox.\n", "\n", "\n", "- Place `geckodriver` in a directory where `$PATH` can find. You could download it from https://github.com/mozilla/geckodriver/releases." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Looking in indexes: https://mirrors.163.com/pypi/simple/\n", "Collecting selenium\n", " Downloading https://mirrors.163.com/pypi/packages/ad/24/39cab5fbaf425ff522e1e51cce79f94f10f9523f015d2b2251e43f45e8a2/selenium-4.0.0-py3-none-any.whl (954 kB)\n", "\u001b[K |████████████████████████████████| 954 kB 3.6 MB/s eta 0:00:01\n", "\u001b[?25hCollecting trio-websocket~=0.9\n", " Downloading https://mirrors.163.com/pypi/packages/db/c5/b5e8bc1f40568a354f2a9cc296b8892605a9d2f22e725290fc33836dd2a3/trio_websocket-0.9.2-py3-none-any.whl (16 kB)\n", "Collecting trio~=0.17\n", " Downloading https://mirrors.163.com/pypi/packages/35/c3/5a4befc3812b3b606e0ae9338bfdd02da3ad0a90df27dc66c37315e94f5c/trio-0.19.0-py3-none-any.whl (356 kB)\n", "\u001b[K |████████████████████████████████| 356 kB 6.9 MB/s eta 0:00:01\n", "\u001b[?25hRequirement already satisfied: urllib3[secure]~=1.26 in /usr/lib/python3/dist-packages (from selenium) (1.26.5)\n", "Collecting outcome\n", " Downloading https://mirrors.163.com/pypi/packages/0d/bb/f60ce97b304b1979d1fef96b6517af47b9bb026770b1f198b6e921b5edf5/outcome-1.1.0-py2.py3-none-any.whl (9.7 kB)\n", "Requirement already satisfied: attrs>=19.2.0 in /usr/lib/python3/dist-packages (from trio~=0.17->selenium) (20.3.0)\n", "Collecting sniffio\n", " Downloading https://mirrors.163.com/pypi/packages/52/b0/7b2e028b63d092804b6794595871f936aafa5e9322dcaaad50ebf67445b3/sniffio-1.2.0-py3-none-any.whl (10 kB)\n", "Collecting sortedcontainers\n", " Downloading https://mirrors.163.com/pypi/packages/32/46/9cb0e58b2deb7f82b84065f37f3bffeb12413f947f9388e4cac22c4621ce/sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)\n", "Requirement already satisfied: async-generator>=1.9 in /usr/local/lib/python3.9/dist-packages (from trio~=0.17->selenium) (1.10)\n", "Requirement already satisfied: idna in /usr/lib/python3/dist-packages (from trio~=0.17->selenium) (2.10)\n", "Collecting wsproto>=0.14\n", " Downloading https://mirrors.163.com/pypi/packages/ea/25/0934b1d00f404d75335b144d4396e01998f25db8953bf54b4d6fe65b80ab/wsproto-1.0.0-py3-none-any.whl (24 kB)\n", "Requirement already satisfied: certifi in /usr/lib/python3/dist-packages (from urllib3[secure]~=1.26->selenium) (2020.6.20)\n", "Requirement already satisfied: cryptography>=1.3.4 in /usr/lib/python3/dist-packages (from urllib3[secure]~=1.26->selenium) (3.3.2)\n", "Requirement already satisfied: pyOpenSSL>=0.14 in /usr/lib/python3/dist-packages (from urllib3[secure]~=1.26->selenium) (21.0.0)\n", "Collecting h11<1,>=0.9.0\n", " Downloading https://mirrors.163.com/pypi/packages/60/0f/7a0eeea938eaf61074f29fed9717f2010e8d0e0905d36b38d3275a1e4622/h11-0.12.0-py3-none-any.whl (54 kB)\n", "\u001b[K |████████████████████████████████| 54 kB 1.7 MB/s eta 0:00:01\n", "\u001b[?25hInstalling collected packages: sortedcontainers, sniffio, outcome, h11, wsproto, trio, trio-websocket, selenium\n", "Successfully installed h11-0.12.0 outcome-1.1.0 selenium-4.0.0 sniffio-1.2.0 sortedcontainers-2.4.0 trio-0.19.0 trio-websocket-0.9.2 wsproto-1.0.0\n" ] } ], "source": [ "! pip3 install selenium" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "from selenium import webdriver\n", "driver = webdriver.Firefox()" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "driver.get(\"https://www.bing.com/\")" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "from selenium.webdriver.common.by import By\n", "driver.find_element(By.ID,\"sb_form_q\").send_keys(\"feng li cufe\")" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "driver.find_element(By.ID, \"search_icon\").click()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "driver.close()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Web Scraping Baidu with headless web driver\n", "\n", "To use a headless firefox requires a bit of configuration." ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Firefox Headless Browser Invoked\n" ] } ], "source": [ "from selenium import webdriver\n", "from selenium.webdriver.firefox.options import Options\n", "\n", "# options = Options()\n", "# options.add_argument(\"--headless\")\n", "# driver = webdriver.Firefox(options=options)\n", "driver = webdriver.Firefox()\n", "print(\"Firefox Headless Browser Invoked\")" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "driver.get(\"https://www.baidu.com/\")\n", "driver.find_element(By.ID, \"kw\").send_keys(\"李丰 中央财经大学\")" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "driver.find_element(By.ID, \"su\").click()" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "results = driver.find_elements(By.XPATH,'//*[@id=\"content_left\"]')" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "scrolled": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "李丰-中央财经大学统计与数学学院\n", "[官方]2014年4月25日 李丰博士现任中央财经大学统计与数学学院副院长、副教授、大数据分析专业硕士导师,中国统计教育学会高等教育分会会副秘书长。博士毕业于瑞典斯德哥尔摩大学,研究领域包括贝叶斯统计...\n", "sam.cufe.edu.cn/info/1043/35.....\n", "\n", "百度快照\n", "李丰(中央财经大学统计与数学学院教师) - 百度百科\n", "职业:教师\n", "毕业院校:瑞典斯德哥尔摩大学\n", "简介:李丰,中央财经大学统计与数学学院教师,副院长,大数据分析专业硕士导师,中国教育统计学高等教育分会会副秘书长,北...\n", "教育背景 工作经历 研究方向 近期学术论文 著作成果 更多 >\n", "百度百科\n", "李丰 – Dr. Feng Li\n", "李丰,中央财经大学统计与数学学院副院长,大数据分析专业硕士导师,中国统计教育学会高等教育分会会副秘书长。博士毕业于瑞典斯德哥尔摩大学,研究领域包括贝叶斯计算,统计预测,...\n", "feng.li/cn/\n", "\n", "百度快照\n", "COS访谈第22期:李丰老师\n", "2016年11月21日 李丰,博士, 中央财经大学统计与数学学院,副院长,硕士研究生导师, 主要研究方向为大数据与复杂模型、贝叶斯推断与统计计算、计量经济与预测方法以及多元模型。现任北京大数据协会理...\n", "搜狐网\n", "\n", "百度快照\n", "中央财经大学李丰博士应邀到我校做学术报告-广西科技大学...\n", "2019年12月9日 12月7日下午,中央财经大学统计与数学学院副院长李丰博士应邀在理学院206报告厅做主题为《统计与大数据:工具与未来》的学术报告。理学院张涛副院长主持报告会,部分教师、应用统计学专...\n", "www.gxust.edu.cn/lxy/info/1023...\n", "\n", "百度快照\n", "其他人还在搜\n", "中央财经大学四大才子李丰简介孙志猛中央财经大学王成章中央财经大学中央财经大学孙晓伟中央财经大学方意李丰是谁中央财经大学林木\n", "中央财经大学统计与数学学院导师教师师资介绍简介-李丰\n", "2020年4月20日 2007年8月-2008年7月 瑞典达拉那大学统计学系硕士研究生,获统计学硕士学位2003年9月-2007年6月 中国人民大学统计学院本科学生,获经济学学士学位 工作经历2013年...\n", "school.freekaoyan.com/bj/cufe/...\n", "\n", "百度快照\n", "海量数据驱动场景及其数据科学方法 - 李丰 副教授 (中央财...\n", "报告人:李丰 副教授 (中央财经大学) 时间: 2021年10月24日(周日)下午16:00-17:00 地点: 腾讯会议 会议 ID: 804 4124 3260 摘要: 随着海量数据的涌现,全新的商业管理和决策场...\n", "math.cnu.edu.cn/xzhd1/ebb9da99...\n", "\n", "百度快照\n", "李丰简历_中央财经大学统计与数学学院副院长李丰受邀参会...\n", "2021年5月19日 李丰,博士, 中央财经大学统计与数学学院,副院长,硕士研究生导师, 主要研究方向为大数据与复杂模型、贝叶斯推断与统计计算、计量经济与预测方法以及多元模型。现...\n", "news.huodongjia.com/1391...html\n", "\n", "百度快照\n", "COS 访谈第 22 期: 李丰老师 | 统计之都\n", "李丰,博士,中央财经大学统计与数学学院,副院长,硕士研究生导师, 主要研究方向为大数据与复杂模型、贝叶斯推断与统计计算、计量经济与预测方法以及多元模型。现任北京大数据协...\n", "cosx.org/2016/11/interview-fen...\n", "\n", "百度快照\n", "凝心聚力谋发展,继往开来谱新篇——记中央财经大学统计与...\n", "2018年7月8日 统计与数学学院院长殷先军,副院长贾尚晖,副院长李丰,党委副书记、纪委书记李全敏,中央财经大学校学生会主席温文,国际经济与贸易学院学生会主席罗丹,统计与数学学院团委书记姚逍,学工...\n", "sam.cufe.edu.cn/info/1036/21.....\n", "\n", "百度快照\n" ] } ], "source": [ "for result in results:\n", " print(result.text)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "driver.close()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Lab\n", "\n", "Use `selenium` to implement the case we studied with `BeautifulSoup` in the previous lab." ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.9" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": false, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "589px", "left": "10px", "top": "150px", "width": "225px" }, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }