如何用Python写一个抓取新浪财经网指定企业年报的脚本?

WBOY
Release: 2016-06-06 16:11:12
Original
2953 people have browsed it

题主会计学二专毕业设计论文DDL在即,做的是分析食品企业会计信息与股价的实证课题,目前需要从新浪财经上收集100家食品企业近五年的财报,如果手动收集的话是根据证监会2014年4季度上市公司行业分类结果上的上市公司股票代码输到股票首页_新浪财经 的搜索框,然后再从所选公司的网页(如康达尔(000048)股票股价,行情,新闻,财报数据)上点选“公司年报”,下载近五年的年报数据。
所选企业是2014年4季度上市公司行业分类结果上所有13、14、15大类,有100多家,全部手动收集的话工作量略大,想问下有没有办法用Python写一个脚本完成以上工作?(大学修过一门用python讲的计算思维,算是有一点点python基础吧)
感激不尽~

回复内容:

嗨~我来答题了~
虽然题主已经搞定了问题……
提问后一周已经搞定了,用了excel power query+yahoo finance api 等这周忙完毕业设计回来更新问题…还是非常感谢~!
就当练手了~问题的解决办法有很多。利用现有的api挺方便。不过我还是按照题主原来的思路笨办法写写试试。
老规矩边做边调边写~
#新手 很笨 大神求不喷 新手多交流
#start coding
第一步自然是搜集股票代码…用在线的PDF2DOC网站,然后把13、14、15三类的股票代码复制粘贴到一个文本文档里。像这样…
如何用Python写一个抓取新浪财经网指定企业年报的脚本?然后我们需要让Python按行读入文本文档里的内容并存入一个列表。很简单。
<code class="language-python"><span class="n">f</span><span class="o">=</span><span class="nb">open</span><span class="p">(</span><span class="s">'stock_num.txt'</span><span class="p">)</span>
<span class="n">stock</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">f</span><span class="o">.</span><span class="n">readlines</span><span class="p">():</span>
    <span class="c">#print(line,end = '')</span>
    <span class="n">line</span> <span class="o">=</span> <span class="n">line</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">,</span><span class="s">''</span><span class="p">)</span>
    <span class="n">stock</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
<span class="n">f</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="n">stock</span><span class="p">)</span>
</code>
Copy after login
用selenium模块可以写程序模拟手动点击按钮的整个过程。
感觉就像写按键精灵一样。
就酱。 scrapy配合chrome或者firefox分分钟的事 推荐使用东方财富网抓数据,因为可以直接保存为excel文档,后期处理也相对方便,思路如下:
1.先得到需要的上市公司的股票代码和名字。这一步可以参考 @段晓晨的答案!
2.分析下载链接地址。以康达尔为例,年报地址soft-f9.eastmoney.com/s,下载链接eastmoney.com 的页面 ,链接末尾的8个数字前6个是股票代码,后两位01代表上交所上市公司(股票代码60开头)、02代表深交所上市公司。 让后就可以用一个循环来下载所有的数据!
3.把下载下来的xml文件转化成xls文件,代码如下:
1). xml可能的中文编码错误处理
<code class="language-python"><span class="k">def</span> <span class="nf">xml_Error_C</span><span class="p">(</span><span class="n">filename</span><span class="p">):</span>
    <span class="n">fp_xml</span><span class="o">=</span><span class="nb">open</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span>
    <span class="n">fp_x</span><span class="o">=</span><span class="s">''</span><span class="c">#中文乱码改正</span>
    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">getsize</span><span class="p">(</span><span class="n">filename</span><span class="p">)):</span>
        <span class="n">i</span><span class="o">+=</span><span class="mi">1</span>
        <span class="n">a</span><span class="o">=</span><span class="n">fp_xml</span><span class="o">.</span><span class="n">read</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
        <span class="k">if</span> <span class="n">a</span><span class="o">==</span><span class="s">'&'</span><span class="p">:</span>
            <span class="n">fp_xml</span><span class="o">.</span><span class="n">seek</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span>
            <span class="k">if</span> <span class="n">fp_xml</span><span class="o">.</span><span class="n">read</span><span class="p">(</span><span class="mi">6</span><span class="p">)</span><span class="o">==</span><span class="s">' '</span><span class="p">:</span>
                <span class="n">i</span><span class="o">+=</span><span class="mi">5</span>
                <span class="k">continue</span>
            <span class="k">else</span><span class="p">:</span>
                <span class="n">fp_xml</span><span class="o">.</span><span class="n">seek</span><span class="p">(</span><span class="o">-</span><span class="mi">5</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span>
        <span class="n">fp_x</span><span class="o">+=</span><span class="n">a</span>
    <span class="n">fp_xml</span><span class="o">=</span><span class="nb">open</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span><span class="s">'w+'</span><span class="p">)</span>
    <span class="n">fp_xml</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">fp_x</span><span class="p">)</span>
    <span class="n">fp_xml</span><span class="o">.</span><span class="n">flush</span><span class="p">()</span>
    <span class="n">fp_xml</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
</code>
Copy after login
用神箭手云爬虫吧,完全在云端进行。编写快速,而且自带数据导出发布和生成图表进行数据分析,大数据时代的利器啊( ̄▽ ̄)" 用tushare,tushare.waditu.com 用scrapy写一个爬虫,爬资源嗖嗖的快! 如果是想要下载“年报数据”而不是“年报”的话,用wind的excel插件拉一下函数就行,想要什么就有什么…楼主念会计专业,说明学校肯定有商学院,有商学院就肯定有wind终端…去学院机房半个小时搞定…
Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template