文档内容如下:
(数据对) (信息)
----------------- ------------------------
1 2 3 4 5
----------------- ------------------------
pr333 sd23a2 thisisa 1001 1005
pr333 sd23a2 sentence 1001 1005
pr33w sd11aa we 1022 1002
pr33w sd11aa have 1022 1002
pr33w sd11aa adream 1033 1002
......
第 1, 2 列作为一个 数据对
如果前两列相同,判断后面的是否相同,如果不同就连接起来,合并成一行
如同下面的效果:
pr333 sd23a2 thisisa|sentence 1001 1005
pr33w sd11aa we|have|adream 1022|1033 1002
....
小白,不懂怎么做,只能想到用字典,好像又行不通,求各位大神帮忙
If you want to maintain the order of the output, you must use
OrderedDict
了,key用OrderedDict
来保持顺序,后面的信息用list
来保持顺序,后面可以乱的话,用set
is a better choiceExplain all the considerations for this code.
The first is the order. The order here has two parts, one is the order of the output lines, and the other is the order after the items are merged. We observed:
becomes:
The order of output lines should be taken into account: pr333 comes before pr33w
The order after merging the projects should be taken into account: thisisa comes before sentence
This means that the data type we use must be able to maintain the order
The second is speed. We all know that the sequence type is a linear search. For efficiency, it is better to use the mapping type.
After three considerations, as moling3650 said,
OrderedDict
is a good choice. This can solve the problem of line output. However, since the merge project only needs to use the key and not the value, it is a pity to useOrderedDict
. However, there is currently noOrderSet
in the standard library. choice, so I had to make do with it.OrderedDict
是個好選擇.這可以解決行輸出的問題,不過合併項目由於只需要用到 key 而不需要用到 value 所以使用OrderedDict
有點可惜,不過目前標準庫中沒有OrderSet
的選擇,所以只好將就著用一下.有關於 OrderedDict 可以參閱 OrderedDict
其實有一個 OrderedSet 的第三方庫 OrderedSet
或者可以自己實作,請參考 OrderedSet (Python recipe)
最後 linkse7en 大的觀點非常好,這類文檔處理的問題,如果能夠邊讀邊寫,邊讀邊處理絕對是
有效率(因為只需要進行一次文檔的走訪)(討論請見評論部分 moling 大的觀點
) 且 省資源(馬上輸出完畢,不必浪費空間儲存資料).但因為考慮到有可能重複的 數據對 有可能跨行出現,所以依然是多花一點資源來確保穩固性.代碼(Python3):
關於代碼部分也做個說明(也許我寫的不是最好,但有些心得可以分享).
首先是
slice
類的應用.身為一個 Python programmer,我們對 序列型態 取切片(slicing) 應該都不陌生.
其實可以寫成:
那好處是什麼呢?
我們可以利用這個特性對切片進行命名,以這個問題的代碼為例,原本要取出 數據對 與 其他資料 可以用:
但是這種方式其實閱讀起來不夠清晰,我們可以幫這兩個範圍取個名字,所以:
我們可以用比較優雅易讀的方式來從
items
中取值.其次是
setdefault
,這個函數相當實用,舉例:如果字典(或其他相符的映射型態)中存在鍵值
key
則回傳dic[key]
否則回傳自動在字典中插入新的鍵值對dic[key] = default_value
並且回傳default_value
For more information about OrderedDict, please refer to OrderedDict
or you can implement it yourself, please refer to OrderedSet (Python recipe)🎜🎜 🎜 🎜 🎜Finally, linkse7en has a very good point. For this kind of document processing problem, if you can read and write at the same time, reading and processing at the same time will definitely be
🎜efficient🎜 (because you only need to visit the document once)(For discussion, please refer to the comment section for moling's big opinions
) And 🎜saving resources🎜 (the output is completed immediately, no need to waste space to store data). However, considering that duplicate data pairs may appear across rows, it is still necessary to spend more resources to ensure stability. 🎜 🎜 🎜🎜Code (Python3)🎜:🎜 🎜 🎜I will also explain the code part (maybe my writing is not the best, but I can share some experiences). 🎜 🎜The first is the application ofslice
class. 🎜 🎜As a Python programmer, we should be familiar with 🎜sequence type🎜 slicing. 🎜 rrreee 🎜In fact, it can be written as:🎜 rrreee 🎜What are the benefits?🎜 🎜We can use this feature to name slices. Taking the code in this question as an example, we originally wanted to extract 🎜data pairs🎜 and 🎜other data🎜 using:🎜 rrreee 🎜But this method is not clear enough to read. We can give these two ranges a name, so:🎜 rrreee 🎜We can get the value fromitems
in a more elegant and easy-to-read way. 🎜 🎜 🎜The second issetdefault
, this function is quite practical, for example: 🎜 rrreee 🎜If the key valuekey
exists in the dictionary (or other matching mapping type), returndic[key]
, otherwise the return will automatically insert a new key value in the dictionary. Fordic[key] = default_value
and returndefault_value
. 🎜 🎜 🎜The last thing I want to share is the disassembly of nested tuples:🎜 rrreee 🎜This technique can be easily used to dismantle nested tuples. 🎜Thank you everyone for not saying I talk too much...
I feel like it’s much more convenient to use pandas
Four lines solve the problem
I saved the document as example.txt first