Python study log #1 Scrapy

Install Scrapy

1
pip install scrapy

Observe target website structure

Example

This time I want to download the weibo page’s pictures.

Get the structure:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
<html>
<head>...</head>
<body>
...
<div class="card-wrap">
<div class="card-main">
<header class="weibo-top m-box m-avatar-box">
<a class="m-img-box">
<img src="https://tvax2.sinaimg.cn/crop.0.0.1125.1125.180/9f599f58ly8fpxm8fy94qj20v90v9774.jpg">
</a>
<div class="m-box-col m-box-dir m-box-center">
<div class="m-text-box">
<a>
<h3 class="m-text-cut">
{{ weiboAuthorName }}
<i class="m-icon m-icon-vipl2"></i>
</h3>
</a>
...
<!-- img content, our target -->
<div class="weibo-media-wraps weibo-media media-b">
<ul class="m-auto-list">
<li class="m-auto-box">
<div class="m-img-box m-imghold-square">
<img src="https://wx4.sinaimg.cn/orj360/9f599f58ly1fuqfsg6pckj202e029t8q.jpg">
</div>
</li>
<li class="m-auto-box">
<div class="m-img-box m-imghold-square">
<img src="https://wx3.sinaimg.cn/orj360/9f599f58ly1fuqfsptm6zj20c309dmy1.jpg">
</div>
</li>
...
...
...
...
...
...
...
</body>
</html>

Analysis

We can see the structure, then we know the pictrues’s class structure is like:

1
2
3
4
5
<li class="m-auto-box">
<div class="m-img-box m-imghold-square">
<img src="https://wx3.sinaimg.cn/orj360/9f599f58ly1fuqfsptm6zj20c309dmy1.jpg">
</div>
</li>

So, we can due to the ‘m-imghold-square’ class to get the img tag’s position.

Then get the img src and save it on our disk or database.

Coding

Init project

Init a scrapy project, we named it ‘tutorial’

1
scrapy startproject tutorial

The project structure like:

1
2
3
4
5
6
7
8
9
10
tutorial/
scrapy.cfg
tutorial/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...

They means:

  • scrapy.cfg: project config
  • tutorial/: python module
  • tutorial/items.py: item file
  • tutorial/pipelines.py: pipelines file
  • tutorial/settings.py: settings file
  • tutorial/spiders/: spider folder

Define Item

Item is your data container.

Create scrapy.Item

1
2
3
4
5
6
import scrapy

class WeiboItem(scrapy.Item):
title = scrapy.Field()
content = scrapy.Field()
pictures = scrapy.Field()
;