要处理的文件不多,就是几万条,都是些简单的处理,Excel暂时也够用。大概思路两条:
1、python抓到的数据通过Python DB API 储存到mysql中再做处理。
2、直接把爬取到的资料用Phthon XlsxWriter模块生成Execl文件(.xlsx)。
不知道哪条路线对自动生成文件和后期的处理比较方便。自己不太会用mysql,所以比较倾向用XlsxWriter模块生成Execl文件,只是看了XlsxWriter模块的一些文档,也只是编写数据内容和格式,然后生成Execl文件,不知道有没有爬取的数据自动生成Execl文件比较简单的实现(有一些爬虫的框架可以实现,但是暂时没有学习框架,因为只是想实现一些简单的功能而已)。
I have encountered similar problems, and the final choice was to use excel.
It all depends on your needs. Whichever is more convenient for you to choose. Let me talk about my situation first.
My needs at that time were only to crawl a few hundred pieces of data each time, and I would throw it away every time.
So it was more convenient to use excel. To operate excel, I used openpyxl.
It was only used to save the crawled data, not The operation style is quite simple to use.
It depends on how many thousands of pieces of data you have. If you consider that it will continue to increase in the future, it is more convenient to store it directly in the database for later operation.
Then again, if you think that saving it in excel now can meet your needs and is more convenient. , you can also save it in excel.
As the data grows in the future, I feel that excel cannot meet the demand, so I write a script to directly import the data in excel into the database.
The questioner is worried about not being familiar with MySQL. This is not a problem at all. If you have learned other databases, it is not difficult to learn MySQL.
Database
Sooner or later we have to contact you
Less data, direct text file storage is better than Excel...
I think this has nothing to do with what database is used for storage. The data crawled by the crawler can be stored in execl. Later, you can write a program to import the execl data into the database. This can also improve the speed of crawler processing. If it is stored in the database during the crawling process Not so good
If you don’t understand mysql, just use openpyxl
Save it as a csv text file, it can still be opened with Excel, and it is also convenient to import into the database.
SQLite
If you have less data and low concurrency, use Sqlite. If you are not familiar with SQL, use ORM. For example, peewee~
Definitely use a database for post-processing.