# Web Scraping with Python

Feng Li

School of Statistics and Mathematics

Central University of Finance and Economics

[feng.li@cufe.edu.cn](mailto:feng.li@cufe.edu.cn)

[https://feng.li/python](https://feng.li/python)

# What Is Web Scraping?

The automated gathering of data from the internet is nearly as old as the internet itself. Although web scraping is not a new term, in years past the practice has been more commonly known as screen scraping, data mining, web harvesting, or similar variations. General consensus today seems to favor web scraping, so that is the term I use throughout the book, although I also refer to programs that specifically traverse multiple pages as web crawlers or refer to the web scraping programs themselves as bots.


- In theory, web scraping is the practice of gathering data through any means other than a program interacting with an API (or, obviously, through a human using a web browser). This is most commonly accomplished by writing an automated program that queries a web server, requests data (usually in the form of HTML and other files that compose web pages), and then parses that data to extract needed information.

- In practice, web scraping encompasses a wide variety of programming techniques and technologies, such as data analysis, natural language parsing, and information security. Because the scope of the field is so broad, this book covers the fundamental basics of web scraping and crawling in Part I and delves into advanced topics in Part II. I suggest that all readers carefully study the first part and delve into the more specific in the second part as needed.

# Your First Web Scraper

## Let's try the toy first

In [1]:
from urllib.request import urlopen
html = urlopen('https://feng.li/python/')
print(html.read()) # does not look nice for human eyes.

b'<!doctype html>\n<html lang="en-US" class="respect-color-scheme-preference">\n<head>\n\t<meta charset="UTF-8" />\n\t<meta name="viewport" content="width=device-width, initial-scale=1" />\n\t<title>Python\xe7\xa8\x8b\xe5\xba\x8f\xe8\xae\xbe\xe8\xae\xa1\xe4\xb8\x8e\xe8\xb4\xa2\xe7\xbb\x8f\xe6\x95\xb0\xe6\x8d\xae\xe6\x8c\x96\xe6\x8e\x98 &#8211; Dr. Feng Li</title>\n<meta name=\'robots\' content=\'max-image-preview:large\' />\n<link rel="alternate" type="application/rss+xml" title="Dr. Feng Li &raquo; Feed" href="https://feng.li/feed/" />\n<link rel="alternate" type="application/rss+xml" title="Dr. Feng Li &raquo; Comments Feed" href="https://feng.li/comments/feed/" />\n<script>\nwindow._wpemojiSettings = {"baseUrl":"https:\\/\\/s.w.org\\/images\\/core\\/emoji\\/14.0.0\\/72x72\\/","ext":".png","svgUrl":"https:\\/\\/s.w.org\\/images\\/core\\/emoji\\/14.0.0\\/svg\\/","svgExt":".svg","source":{"concatemoji":"https:\\/\\/feng.li\\/wordpress\\/wp-includes\\/js\\/wp-emoji-release.min.js?ver=6.

The above doesn’t look so great. Below is better.

In [2]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('https://feng.li/python/')
bs = BeautifulSoup(html.read(), 'html.parser')
print(bs)

<!DOCTYPE html>

<html class="respect-color-scheme-preference" lang="en-US">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>Python程序设计与财经数据挖掘 – Dr. Feng Li</title>
<meta content="max-image-preview:large" name="robots"/>
<link href="https://feng.li/feed/" rel="alternate" title="Dr. Feng Li » Feed" type="application/rss+xml"/>
<link href="https://feng.li/comments/feed/" rel="alternate" title="Dr. Feng Li » Comments Feed" type="application/rss+xml"/>
<script>
window._wpemojiSettings = {"baseUrl":"https:\/\/s.w.org\/images\/core\/emoji\/14.0.0\/72x72\/","ext":".png","svgUrl":"https:\/\/s.w.org\/images\/core\/emoji\/14.0.0\/svg\/","svgExt":".svg","source":{"concatemoji":"https:\/\/feng.li\/wordpress\/wp-includes\/js\/wp-emoji-release.min.js?ver=6.1.1"}};
/*! This file is auto-generated */
!function(e,a,t){var n,r,o,i=a.createElement("canvas"),p=i.getContext&&i.getContext("2d");function s(e,t){var a=String.fromCharCode,e=(p.clearRe

## The complete case

In [3]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('https://feng.li/python/')
bs = BeautifulSoup(html.read(), 'html.parser')
nameList = bs.findAll('div', {'class': 'entry-content'})
for name in nameList:
    print(name.get_text())




Contents1 课程简介2 授课教师3 参考书4 讲课视频5 第一部分：Python程序设计6 第二部分：Python财经应用7 第三部分：财经数据挖掘
课程简介
Python程序设计是面向财经和统计专业学生开设的一门以应用为主的编程课程，该课程最早由李丰老师在中央财经大学以公开讲座的形式开设，后成为中央财经大学金融、会计和MBA项目的核心课程。 本课程分为三部分，第一部分为Python程序设计，第二部分为Python财经应用，第三部分为基于Python的财经数据挖掘。
授课教师


李丰博士现任中央财经大学统计与数学学院副院长、副教授、硕士生导师。博士毕业于瑞典斯德哥尔摩大学，研究领域包括贝叶斯统计学，预测方法，大数据分布式学习等。曾获瑞典皇家统计学会 Cramér 奖，国际贝叶斯学会青年奖励基金， 第二届全国高校经管类实验教学案例大赛二等奖。主持和参与多项国家自然科学基金项目。
李丰博士最新研究成果发表在统计期刊 Journal of Computational and Graphical Statistics，Journal of Business and Economic Statistics, Statistical Analysis and Data Mining，经济与管理学期刊 International Journal of Forecasting，Journal of Business Research，运筹学期刊European Journal of Operational Research, Journal of the Operational Research Society，人工智能期刊 Expert Systems with Applications，医学期刊 BMJ Open, Journal of Surgical Research, Journal of Affective Disorders等。同时著有 Bayesian Modeling of Conditional Densities，《大数据分布式计算与案例》和《统计计算》。


参考书
Python可以被广泛地使用在财经领域，以下列出一些零基础书目。
类别书名中译本数据分析Python for Data Analysis (by Wes McKinney)利用P

## Web Scraping with `BeautifulSoup`

Let's start with this page

https://finance.eastmoney.com/a/cgnjj_1.html

In [4]:
import logging
import requests
import sys
import urllib

from bs4 import BeautifulSoup
from collections import OrderedDict
from urllib.parse import urlencode

page = 1 # We try for one page

# Set a User agent to tell the remote we are human not machines
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:93.0) Gecko/20100101 Firefox/93.0'}

href = 'https://finance.eastmoney.com/a/cgnjj_%s.html' %page
html = requests.get(href, headers=headers)

Notes:

- Add a header may convince the server to think this connect is made by human not an attack.
- You could visit https://ifconfig.me/ to quickly find your browser's user agent.

In [5]:
# Check the request headers
html.request.headers

{'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:93.0) Gecko/20100101 Firefox/93.0', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive'}

In [6]:
# Check the html status
html.status_code

200

In [7]:
# Parsing html
soup = BeautifulSoup(html.content, 'html.parser')
soup


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<!--published at 2023/1/12 15:44:59 by finance.eastmoney.com WG NEWS 240-->
<html lang="en">
<head>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta content="webkit" name="renderer"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>国内经济 _ 东方财富网</title>
<link href="emres/css/newslistbefore.css?v=2023.01.12.15.24.06" rel="stylesheet"/>
<link href="/favicon.ico" rel="shortcut icon" type="image/x-icon">
<base target="_blank"/>
</link></head>
<body style="margin-top:43px">
<div style="background-color:#fff;width:1000px;margin:0 auto;">
<img id="weixin-share" src="//cmsjs.eastmoney.com/common/weixin-share.png" style="position: absolute;width: 0;height: 0;left: -1000px;z-index: -1;"/>
<div class="page">
<!--头部-->
<div class="clearfix" id="header">
<!-- banner -->
<div style="float:left;height:60px;">
<iframe class="lyad" f

In [8]:
# Note that some parts of the page is much difficult to scrap becasue the source code is hidden. 
# The cotents may change from time to time
# Let's start with some simple one. The below code works on Jan 12, 2023

divs = soup.findAll('div', {"class": "title"}) # 评论精华 板块 
divs

[<div class="title">
 <a href="http://finance.eastmoney.com/a/202301122612004230.html" target="_blank">思勰投资总经理吴家麒：2023年股票和期货投资展望</a>
 </div>,
 <div class="title">
 <a href="http://finance.eastmoney.com/a/202301122611528897.html" target="_blank">中信证券：成飞拟被注入中航电测 国企混改登上新高峰</a>
 </div>,
 <div class="title">
 <a href="http://finance.eastmoney.com/a/202301122611518334.html" target="_blank">光大证券：12月对公中长期贷款为何实现了高增？</a>
 </div>,
 <div class="title">
 <a href="http://hk.eastmoney.com/a/202301122611513618.html" target="_blank">中信建投：创新药行业迎来多重拐点 看好头部创新药公司（名单）</a>
 </div>,
 <div class="title">
 <a href="http://finance.eastmoney.com/a/202301122611319281.html" target="_blank">国泰君安：下游需求高增长 芳纶涂覆隔膜打开空间</a>
 </div>,
 <div class="title">
 <a href="http://finance.eastmoney.com/a/202301122612023768.html" target="_blank">A股三大指数缩量震荡 北向资金净买入近百亿元</a>
 </div>,
 <div class="title">
 <a href="http://futures.eastmoney.com/a/202301122611958758.html" target="_blank">华尔街开年最重要一天：今晚美国CPI可能惊现环比负增长？</a>
 </div>,
 <div clas

In [9]:
# Let's make a loop and save all infomation into a csv file.
# We use a different delimiter "\001" instead of commonly used (,;) ones.

import csv
newsData =  open("data/topCommentedNews.csv", 'w')
csv_writer = csv.writer(newsData, delimiter="\001")
for div in divs:
    # News title
    titleinfo = div.find('a')
    title = titleinfo.get_text().strip()
    # News url
    url = titleinfo['href']
    
    print([title, url])
    csv_writer.writerow([title, url])
newsData.close()

['思勰投资总经理吴家麒：2023年股票和期货投资展望', 'http://finance.eastmoney.com/a/202301122612004230.html']
['中信证券：成飞拟被注入中航电测 国企混改登上新高峰', 'http://finance.eastmoney.com/a/202301122611528897.html']
['光大证券：12月对公中长期贷款为何实现了高增？', 'http://finance.eastmoney.com/a/202301122611518334.html']
['中信建投：创新药行业迎来多重拐点 看好头部创新药公司（名单）', 'http://hk.eastmoney.com/a/202301122611513618.html']
['国泰君安：下游需求高增长 芳纶涂覆隔膜打开空间', 'http://finance.eastmoney.com/a/202301122611319281.html']
['A股三大指数缩量震荡 北向资金净买入近百亿元', 'http://finance.eastmoney.com/a/202301122612023768.html']
['华尔街开年最重要一天：今晚美国CPI可能惊现环比负增长？', 'http://futures.eastmoney.com/a/202301122611958758.html']
['工信部力挺5G和千兆光网建设 业绩猛增的概念股来了', 'http://finance.eastmoney.com/a/202301122611942169.html']
['上海警方通报王某某等打人被行政处罚 权威人士：王某某系王思聪', 'http://finance.eastmoney.com/a/202301122612024096.html']
['中汽协：若一季度汽车销量下滑较严重 相关部门会考虑相关政策的延续', 'http://finance.eastmoney.com/a/202301122612039557.html']


- Let's get the full information from one of the above urls

In [10]:
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:93.0) Gecko/20100101 Firefox/93.0'}
html = requests.get(url, headers=headers)
soup = BeautifulSoup(html.content, 'html.parser')

In [11]:
soup 


<!DOCTYPE html>

<!--published at 2023/1/12 15:45:12 by finance.eastmoney.com WG NEWS 239-->
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible">
<meta content="webkit" name="renderer"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>中汽协：若一季度汽车销量下滑较严重 相关部门会考虑相关政策的延续 _ 东方财富网</title>
<meta content="相关,政策,延续,汽车销量,部门,季度,下滑,严重,考虑,表示" name="keywords">
<meta content="【中汽协：若一季度汽车销量下滑较严重 相关部门会考虑相关政策的延续】中汽协副秘书长陈士华今日表示，燃油车购置税减半政策在去年6月份出台以来，对汽车市场的促进作用非常大，行业呼吁购置税减半政策能够在2023年延续。中国汽车工业协会副总工程师许海东则表示，如果一季度汽车销量下滑比较严重，相关部门会考虑政策的延续。" name="description"/>
<meta content="format=html5; url=https://wap.eastmoney.com/a/202301122612039557.html" name="mobile-agent"/>
<meta content="noindex, nofollow" name="robots"/>
<script type="text/javascript">
        var __WAPURL = "https://wap.eastmoney.com/a/202301122612039557.html";
        var _NewsTag = '';
    

- Let's get the time stamp and new source

In [12]:
infos = soup.find('div', {'class': 'infos'})
time_source = infos.findAll('div', {"class": "item"})
time = time_source[0].get_text()
source = time_source[1].get_text().replace("\n", "")
print(time)
print(source)

2023年01月12日 15:20
                            作者：徐昊                        


- Contents part

In [13]:
# Contents part
divs = soup.find('div', {"class": "zwinfos"})
divs

<div class="zwinfos">
<!-- 摘要 -->
<div class="abstract">
<div class="tit">摘要</div>
<div class="txt">
                                    【中汽协：若一季度汽车销量下滑较严重 相关部门会考虑相关政策的延续】中汽协副秘书长陈士华今日表示，燃油车购置税减半政策在去年6月份出台以来，对汽车市场的促进作用非常大，行业呼吁购置税减半政策能够在2023年延续。中国汽车工业协会副总工程师许海东则表示，如果一季度汽车销量下滑比较严重，相关部门会考虑政策的延续。
                                </div>
<div class="keywords">
</div>
</div>
<!-- 文本区域 -->
<div class="txtinfos" id="ContentBody" style="margin-top:0;">
<!--浪客直播-->
<!--文章主体-->
<p>　　中汽协副秘书长陈士华今日表示，燃油车购置税减半政策在去年6月份出台以来，对汽车市场的促进作用非常大，行业呼吁购置税减半政策能够在2023年延续。中国汽车工业协会副总工程师许海东则表示，如果一季度汽车销量下滑比较严重，相关部门会考虑政策的延续。</p><h3 class="emh3">　　<strong>相关报道</strong></h3><p>　　<a href="https://finance.eastmoney.com/a/202301122612046794.html" target="_blank">2022年汽车销量同比增长2.1% 预计今年一季度终端市场压力较大</a></p><p>　　中国汽车工业协会12日最新数据显示，2022年12月，汽车产量环比微降，销量小幅增长，同比均呈现下降。当月，汽车产销分别达到238.3万辆和255.6万辆，产量环比下降0.3%，销量环比增长9.7%，同比分别下降18.2%和8.4%。2022年，汽车产销分别完成2702.1万辆和2686.4万辆，同比增长3.4%和2.1%，全年实现小幅增长。</p><p>　　中汽协副秘书长陈士华就12月销量情况分析称，随着疫情防控优化调整，燃油车购置税减半政

- Retrieve the abstract from the full text

In [14]:
abstract = divs.find('div', {"class": "txt"}).get_text().replace("\n", "").replace("\r", "").replace(" ", "")
abstract

'【中汽协：若一季度汽车销量下滑较严重相关部门会考虑相关政策的延续】中汽协副秘书长陈士华今日表示，燃油车购置税减半政策在去年6月份出台以来，对汽车市场的促进作用非常大，行业呼吁购置税减半政策能够在2023年延续。中国汽车工业协会副总工程师许海东则表示，如果一季度汽车销量下滑比较严重，相关部门会考虑政策的延续。'

- Pull all paragraphs in the full texts into one single paragraph

In [15]:
content = ''
paras = divs.findAll('p')
for p in paras:
    ptext = p.get_text().strip().replace("\n", "")
    content += ptext
print(content)

中汽协副秘书长陈士华今日表示，燃油车购置税减半政策在去年6月份出台以来，对汽车市场的促进作用非常大，行业呼吁购置税减半政策能够在2023年延续。中国汽车工业协会副总工程师许海东则表示，如果一季度汽车销量下滑比较严重，相关部门会考虑政策的延续。2022年汽车销量同比增长2.1% 预计今年一季度终端市场压力较大中国汽车工业协会12日最新数据显示，2022年12月，汽车产量环比微降，销量小幅增长，同比均呈现下降。当月，汽车产销分别达到238.3万辆和255.6万辆，产量环比下降0.3%，销量环比增长9.7%，同比分别下降18.2%和8.4%。2022年，汽车产销分别完成2702.1万辆和2686.4万辆，同比增长3.4%和2.1%，全年实现小幅增长。中汽协副秘书长陈士华就12月销量情况分析称，随着疫情防控优化调整，燃油车购置税减半政策和新能源汽车补贴政策年底退出，厂商优惠幅度加大，叠加春节假期临近，12月终端市场“翘尾现象”明显。由于12月的回补效应，提前透支了部分需求，预计一季度终端市场压力较大，销量可能会出现明显下降。对此，他表示，为进一步激发市场主体和消费活力，呼吁能够继续出台购置税减半等促汽车消费政策，助力汽车产业稳定增长。总结全年发展，陈士华表示，2022年，尽管受疫情散发频发、芯片结构性短缺、动力电池原材料价格高位运行、局部地缘政治冲突等诸多不利因素冲击，但在购置税减半等一系列稳增长、促消费政策的有效拉动下，在全行业企业共同努力下，中国汽车市场在逆境下整体复苏向好，实现正增长，展现出强大的发展韧性。具体来看，乘用车在稳增长、促消费等政策拉动下，实现较快增长，为全年小幅增长贡献重要力量；商用车处于叠加因素的运行低位；新能源汽车持续爆发式增长，全年销量超680万辆，市场占有率提升至25.6%，逐步进入全面市场化拓展期，迎来新的发展和增长阶段；汽车出口继续保持较高水平，屡创月度历史新高，自8月份以来月均出口量超过30万辆，全年出口突破300万辆，有效拉动行业整体增长；中国品牌表现亮眼，紧抓新能源、智能网联转型机遇全面向上，产品竞争力不断提升，其中乘用车市场份额接近50%，为近年新高。具体数据显示，2022年12月，乘用车产销分别完成212.5万辆和226.5万辆，产量环比下降1.4%，销量环比增长9%，同比分别下降16.1%和6.7%。在乘用车主要品种中，与上月相比

- Now we could make the above into a function and use it directly

In [16]:
import requests
import sys 
from bs4 import BeautifulSoup


def get_body(href):
    """Function to retrieve news content given its url.
    Args:
        href: url of the news to be crawled.
    Returns:
        content: the crawled news content.

    """
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:93.0) Gecko/20100101 Firefox/93.0'}
    html = requests.get(href, headers=headers)
    soup = BeautifulSoup(html.content, 'html.parser')
 
    # Time and Source
    infos = soup.find('div', {'class': 'infos'})
    time_source = infos.findAll('div', {"class": "item"})
    time = time_source[0].get_text()
    source = time_source[1].get_text().replace("\n", "").replace("\r", "").replace(" ", "")
    
    divs = soup.find('div', {"class": "zwinfos"})
    # Abstract
    abstract = divs.find('div', {"class": "txt"}).get_text().replace("\n", "").replace("\r", "").replace(" ", "")
    
    # Full texts
    content = ''
    paras = divs.findAll('p')
    for p in paras:
        ptext = p.get_text().strip().replace("\n", "")
        content += ptext
    
    # Return a list
    return [time, source, abstract, content]

# Let's run the function with the previous obtained csv file.
if __name__ == "__main__":
    # Getting and printing content for each url in the crawled web list pages
    with open("data/topCommentedNews.csv") as f:
        for line in f:
            title, href = line.strip().split('\001')
            # Printing progress onto console
            print('Scraping ' + href)
            full_info = get_body(href)
            print([title, href] + full_info)

Scraping http://finance.eastmoney.com/a/202301122612004230.html
['思勰投资总经理吴家麒：2023年股票和期货投资展望', 'http://finance.eastmoney.com/a/202301122612004230.html', '2023年01月12日 14:28', '来源：东方财富网', '打开微信，点击底部的“发现”使用“扫一扫”即可将网页分享至朋友圈', '2023开年之际，东方财富特邀业内大咖齐聚2023年度投资策略会，把脉2023年投资机会。此次策略会将于2023年1月10日-1月12日隆重举行，十五场精彩直播，等您来看。1月12日上午，思勰投资总经理吴家麒在2023年度投资策略会上发表演讲，演讲的题目是《2023年股票和期货投资展望》。嘉宾简介：吴家麒，思勰投资的创始合伙人兼总经理，拥有12年在中国从事量化投资管理的经验。吴先生拥有广泛的量化策略行业的各方面经验，包括各类量化策略开发、金融数据的生产及应用以及量化产品结构设计。在创立思勰投资前，曾在券商研究所负责金融工程研究。在此之前，曾在多家私募基金及券商自营工作。以下为演讲摘要：股票市场回顾与展望期货市场回顾与展望打开微信，点击底部的“发现”使用“扫一扫”即可将网页分享至朋友圈扫描二维码关注东方财富官网微信']
Scraping http://finance.eastmoney.com/a/202301122611528897.html
['中信证券：成飞拟被注入中航电测 国企混改登上新高峰', 'http://finance.eastmoney.com/a/202301122611528897.html', '2023年01月12日 08:29', '来源：证券时报·e公司', '【中信证券：成飞拟被注入中航电测国企混改登上新高峰】中信证券在研报中表示，1月11日晚中航电测发布公告，正在筹划发行股份向航空工业集团购买成飞集团100%股权。成飞集团是我国航空武器装备研制生产和出口主要基地、民机零部件重要制造商，通过本次重组战斗机龙头有望实现整体上市。预计2023年国企改革将继续加速推进提振板块情绪。从基本面角度看，军工行业有计划属性，免疫宏观经济波动，且行业正处于“十四五”黄金发展期，具备长期配置价值。当前时点无论是从市场情

['国泰君安：下游需求高增长 芳纶涂覆隔膜打开空间', 'http://finance.eastmoney.com/a/202301122611319281.html', '2023年01月12日 07:39', '来源：证券时报', '【国泰君安：下游需求高增长芳纶涂覆隔膜打开空间】国泰君安研报指出，芳纶为三大人造高性能纤维之一，受下游安防，5G建设驱动需求持续高增长，汽车、航空结构件、过滤等领域应用不断拓展。芳纶作为性能最好的锂电隔膜涂覆材料从0到1向上空间弹性大。推荐具备一体化产业链，产能规模具备优势，向多元化应用不断开拓，有望实现进口替代的芳纶龙头。推荐标的：中化国际，受益标的：泰和新材。', '国泰君安研报指出，芳纶为三大人造高性能纤维之一，受下游安防，5G建设驱动需求持续高增长，汽车、航空结构件、过滤等领域应用不断拓展。芳纶作为性能最好的锂电隔膜涂覆材料从0到1向上空间弹性大。推荐具备一体化产业链，产能规模具备优势，向多元化应用不断开拓，有望实现进口替代的芳纶龙头。推荐标的：中化国际，受益标的：泰和新材。国君石化 | 下游需求高增长，芳纶涂覆隔膜打开空间投资建议：芳纶为三大人造高性能纤维之一，受下游安防，5G建设驱动需求持续高增长，汽车、航空结构件、过滤等领域应用不断拓展。芳纶作为性能最好的锂电隔膜涂覆材料从0到1向上空间弹性大。推荐具备一体化产业链，产能规模具备优势，向多元化应用不断开拓，有望实现进口替代的芳纶龙头。推荐标的：中化国际，受益标的：泰和新材。芳纶为三大人造高性能纤维之一，部分性能优势显著：芳纶与碳纤维、超高分子量聚乙烯并称为三大人造高性能纤维。具备高强度、高模量、高耐磨性、高耐温性等特点。芳纶在耐温性能与断裂伸长度方面分别较超高分子量聚乙烯与碳纤维有优势。芳纶在结构增强、安全防护、耐磨等方面有明显优势。多元化因素驱动芳纶需求高增长，新能源应用打开市场空间：根据帝人预测，受地缘争端加剧，个人以及企业需求增加驱动，橡胶增强领域的增长，全球芳纶市场规模将从2021年的36亿美元增加到2025年的53亿美元。对位芳纶增速快于间位芳纶。此外锂电池安全性，寿命，性能等要素重要性不断提升背景下芳纶作为性能最好的锂电隔膜涂覆材料有望替代传统陶瓷或有机涂覆，渗透率有望大幅提升。保守预测2025年芳纶涂覆隔膜市场空间44亿元，且在涂覆成本进一步下降情况下向上

['上海警方通报王某某等打人被行政处罚 权威人士：王某某系王思聪', 'http://finance.eastmoney.com/a/202301122612024096.html', '2023年01月12日 14:59', '作者：甄珺茹王昆鹏', '【上海警方通报王某某等打人被行政处罚权威人士：王某某系王思聪】1月12日，上海静安警方通报，11日4时许接报南京西路一商务楼门口有人被打。经查，王某某等人误以为在路边候车的陈某某对其拍照，遂要求陈某某不要拍摄，陈某某称未拍摄，双方发生争吵。王某某等人对陈某某殴打。经司法鉴定，陈某某综合评定为轻微伤。警方对存在殴打他人违法行为的王某某、孙某某作出行政拘留7日。因王某某等提请行政复议，警方对四人暂缓执行行政拘留。12日，记者从权威信源获悉，打人者王某某系王思聪。', '1月12日，上海静安警方通报，11日4时许接报南京西路一商务楼门口有人被打。经查，王某某等人误以为在路边候车的陈某某对其拍照，遂要求陈某某不要拍摄，陈某某称未拍摄，双方发生争吵。王某某等人对陈某某殴打。经司法鉴定，陈某某综合评定为轻微伤。警方对存在殴打他人违法行为的王某某、孙某某作出行政拘留7日。因王某某等提请行政复议，警方对四人暂缓执行行政拘留。12日，记者从权威信源获悉，打人者王某某系王思聪。王思聪在上海打人？刚刚，警方通报王思聪又出大事？上海警方通报王某某怀疑被偷拍打人上海静安警方通报：1月 11 日4时 40 分，静安公安分局接报警称，南京西路一商务楼门口有人被打，民警第一时间到场处置。经查，王某某(男，34 岁)、孙某某(男，28 岁)、魏某某(男，38 岁)、余某某(男，39 岁)等人误以为在路边候车的陈某某对其拍照，遂要求陈某某不要拍摄，陈某某称未拍摄，双方发生争吵。随后，王某某、孙某某先后挥拳击打陈某某面部，致陈某某鼻部受伤并倒地。魏某某、余某某也对陈某某进行了殴打。经司法鉴定，陈某某左侧鼻骨骨折，面部多处挫擦伤及挫伤，综合评定为轻微伤。目前，警方根据《治安管理处罚法》对存在殴打他人违法行为的王某某、孙某某作出行政拘留7日，并处罚款500 元的处罚决定；对存在殴打他人违法行为的魏某某、余某某作出行政拘留 5日，并处罚款 500 元的处罚决定。现因王某某等四人对公安机关作出的行政处罚决定提请行政复议，公安机关依法对王某某等四人暂缓执行行政拘

## Web Crawling with `Scrapy`*

One of the challenges of writing web crawlers is that you’re often performing the same tasks again and again: find all links on a page, evaluate the difference between internal and external links, go to new pages. These basic patterns are useful to know and to be able to write from scratch, but the Scrapy library handles many of these details for you.

###  Installing Scrapy

- After Anaconda is installed, you can install Scrapy by using this command:
   
      conda install -c conda-forge scrapy

### Dealing with Different Website Layouts

Fortunately, in most cases of web crawling, you’re not looking to collect data from sites you’ve never seen before, but from a few, or a few dozen, websites that are pre-selected by a human. This means that you don’t need to use complicated algorithms or machine learning to detect which text on the page “looks most like a title” or which is probably the “main content.” You can determine what these elements are manually.

The most obvious approach is to write a separate web crawler or page parser for each website. Each might take in a URL, string, or BeautifulSoup object, and return a Python object for the thing that was scraped.


## Initializing a New Spider

To create a new spider in the current directory, run the following from the **command line (NOT THE PYTHON PROMPT)**:
```
    scrapy startproject wikiSpider
```    
    
This creates a new subdirectory in the directory the project was created in, with the title wikiSpider. Inside this directory is the following file structure:

- scrapy.cfg
- wikiSpider
  - spiders
     - __init.py__
  - items.py
  - middlewares.py
  - pipelines.py
  - settings.py
  - __init.py__

### Generate some spiders with templates from the command line

    scrapy genspider example example.com 
    scrapy genspider example2 example.com 
    scrapy genspider example3 example2.com 

### Writing a Simple Scraper

To create a crawler, you will add a new file inside the spiders directory at wikiSpider/wikiSpider/spiders/article.py. In your newly created **article.py** file, write the following:

```python
    import scrapy

    class ArticleSpider(scrapy.Spider):
        name='article'

        def start_requests(self):
            urls = [
                'http://en.wikipedia.org/wiki/Python_%28programming_language%29',
                'https://en.wikipedia.org/wiki/Functional_programming',
                'https://en.wikipedia.org/wiki/Monty_Python']
            return [scrapy.Request(url=url, callback=self.parse) for url in urls]

        def parse(self, response):
            url = response.url
            title = response.css('h1::text').extract_first()
            print('URL is: {}'.format(url))
            print('Title is: {}'.format(title))
```

### Run this article spider

You can run this article spider by navigating to the wikiSpider/wikiSpider directory and running from the command line:

    scrapy runspider article.py
        
### Run your project with at the project root directory

    scrapy crawl table -o table.csv  --logfile table.log
    

### Scrapy Shell

To do the crawler interactively, just run from the command line

```bash
scrapy shell "http://en.wikipedia.org/wiki/Python_%28programming_language%29"
```

# Lab 

Use `scrapy` framework to implement the we studied with `BeautifulSoup`