博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
爬虫大作业
阅读量:4606 次
发布时间:2019-06-09

本文共 4317 字,大约阅读时间需要 14 分钟。

本次作业来源:

可以用pandas读出之前保存的数据:

newsdf = pd.read_csv(r'F:\duym\gzccnews.csv')

 

一.把爬取的内容保存到数据库sqlite3

import sqlite3

with sqlite3.connect('gzccnewsdb.sqlite') as db:
newsdf.to_sql('gzccnews',con = db)

with sqlite3.connect('gzccnewsdb.sqlite') as db:

df2 = pd.read_sql_query('SELECT * FROM gzccnews',con=db)

 

保存到MySQL数据库

  • import pandas as pd
  • import pymysql
  • from sqlalchemy import create_engine
  • conInfo = "mysql+pymysql://user:passwd@host:port/gzccnews?charset=utf8"
  • engine = create_engine(conInfo,encoding='utf-8')
  • df = pd.DataFrame(allnews)
  • df.to_sql(name = ‘news', con = engine, if_exists = 'append', index = False)

 

二.爬虫综合大作业

1.主题        看了一部电影《绿皮书》,觉得挺不错,不知道网上的评价怎样,借此分析一下

 

                                                                                                                                                        图1-网页截图

 

 

2.爬取的对象   爬取排在前面的300条评论

 

                                                                                                                                              图2-爬取的内容

 

3.爬取内容的限制与约束

          这次爬取的是公网上的内容。不会涉及到很多的隐私性,所以应该没什么约束。不敢爬太多,怕被发现。内容里很多图片,但都没爬下来。主要是想拿数据来对手机做一个分析。

 

4。核心代码

  爬取评论的代码

 
import pandas import requests from bs4 import BeautifulSoup import time import random import re def getHtml(url): cookies = { 'PHPSESSID': 'Cookie: bid=7iHsqC-UoSo; ap_v=0,6.0; __utma=30149280.1922890802.1557152542.1557152542.1557152542.1;
__utmc=30149280; __utmz=30149280.1557152542.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utma=223695111.1923787146.1557152542.1557152542.1557152542.1;
__utmb=223695111.0.10.1557152542; __utmc=223695111; __utmz=223695111.1557152542.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none);
_pk_ses.100001.4cf6=*; push_noty_num=0; push_doumail_num=0; __utmt=1; __utmv=30149280.19600; ct=y; ll="118281";
__utmb=30149280.12.9.1557154640902; __yadk_uid=6FEHGUf1WakFoINiOARNsLcmmbwf3fRJ;
_vwo_uuid_v2=DE694EB251BD96736CA7C8B8D85C2E9A7|9505affee4012ecfc57719004e3e5789;
_pk_id.100001.4cf6=1f5148bca7bc0b13.1557152543.1.1557155093.1557152543.; dbcl2="196009385:lRmza0u0iAA"'} headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'} req = requests.get(url, headers=headers, cookies=cookies, verify=False); req.encoding = 'utf8' soup = BeautifulSoup(req.text, "html.parser") return soup; def alist(url): comment = [] for ping in soup.select('.comment-item'): pinglundict = {} user = ping.select('.comment-info')[0]('a')[0].text userUrl = ping.select('.comment-info')[0]('a')[0]['href'] look = ping.select('.comment-info')[0]('span')[0].text score = ping.select('.comment-info')[0]('span')[1]['title'] time = ping.select('.comment-time')[0]['title'] pNum = ping.select('.votes')[0].text pingjia = ping.select('.short')[0].text pinglundict['user'] = user pinglundict['userUrl'] = userUrl pinglundict['look'] = look pinglundict['score'] = score pinglundict['time'] = time pinglundict['pNum'] = pNum pinglundict['pingjia'] = pingjia comment.append(pinglundict) return comment url = 'https://movie.douban.com/subject/27060077/comments?start={}&limit=20&sort=new_score&status=P' comment = [] for i in range(15): soup = getHtml(url.format(i * 20)) comment.extend(alist(soup)) time.sleep(random.random() * 5) print(len(comment)) print('--------------------------总共爬取 ', len(comment), ' 条-------------------------') print(comment) pingtheking = pandas.DataFrame(comment) pingtheking.to_csv('jia.csv', encoding='utf_8_sig')

 

 

统计词频的代码

# coding=utf-8# 导入jieba模块,用于中文分词import jieba# 获取所有评论import pandas as pd# 读取小说f = open(r'jia.csv', 'r', encoding='utf8')text = f.read()f.close()print(text)ch="《》\n:,,。、-!?0123456789"for c in ch:    text = text.replace(c,'')print(text)newtext = jieba.lcut(text)te = {}for w in newtext:    if len(w) == 1:        continue    else:        te[w] = te.get(w, 0) + 1tesort = list(te.items())tesort.sort(key=lambda x: x[1], reverse=True)# 输出次数前TOP20的词语for i in range(0, 20):    print(tesort[i])pd.DataFrame(tesort).to_csv('tongji.csv', encoding='utf-8')

 

 

生成词云的代码

import matplotlib.pyplot as pltfrom wordcloud import WordCloudimport jiebatext_from_file_with_apath = open('jia.csv',encoding='utf-8').read()wordlist_after_jieba = jieba.cut(text_from_file_with_apath, cut_all=True)wl_space_split = " ".join(wordlist_after_jieba)my_wordcloud = WordCloud(background_color="white",width=1000,height=860, font_path="C:\\Windows\\Fonts\\STFANGSO.ttf").generate(wl_space_split)plt.imshow(my_wordcloud)plt.axis("off")plt.show()

 

4.数据统计与分析

        

 

          我们从中发现很多观众都提到了种族歧视问题,有奥斯卡才来看的,有觉得演得很棒很有细节,有人表示司机的身份很巧妙,白人却是高高的钢琴师的矛盾。总的来说我觉得这片很成功,观众感受到影片要表达的。普遍好评。

 

转载于:https://www.cnblogs.com/hekairui/p/10775703.html

你可能感兴趣的文章
pta 1144 The Missing Number
查看>>
数据库相关
查看>>
log4net按照不同的【LEVEL】级别输出到不同文件
查看>>
412. Fizz Buzz
查看>>
RegCreateKey
查看>>
position 中五种位置的区别
查看>>
TCP/IP之TCP的建立与终止
查看>>
javascript中的json数据序列化
查看>>
JS获取对象属性for...in
查看>>
JAVA中高访问量高并发的问题怎么解决?
查看>>
编程题之--打印二维螺旋矩阵
查看>>
单链表的基本操作(第二篇)
查看>>
android WebView总结
查看>>
LeetCode - Validate Binary Search Tree
查看>>
7.1-7.7进度报告
查看>>
[转载]周立功答博友问
查看>>
Spring关系依赖图
查看>>
Dom操作--跑马灯效果
查看>>
STM32进阶之串口环形缓冲区实现(转载)
查看>>
【01】在 issue 中创建 list
查看>>