求一个关于日期时间间隔计算相关的算法,具体问题如下:
当前有一个比较大(上万个)日期时间的list(时间递增但无规律),假设 dt=[
'2015-01-02 10:21:02',
'2015-01-02 10:21:02'
'2015-01-02 10:21:03'
'2015-01-02 10:21:04'
'2015-01-02 10:21:06'
'2015-01-02 10:21:10'
'2015-01-02 10:21:11'
'2015-01-02 10:21:13'
'2015-01-02 10:22:00'
'2015-01-02 10:22:12'
... ...
]
给定一个时间间隔timeframe(假设4s),现在问题是:求在给定连续时间间隔内,list中最多项数目。
最简单的方法是依次对于list中每个项x,向后查找直至大于timeframe+x的项并统计个数。但由于这个list数量庞大,同时还要涉及时间运算,算法难以达到要求,特来请教!
Give me my answer first:
Remember to import the above code
datetime
和timedelta
.This is a O(n) approach, n is the number of
timestrs
.The code is a bit ugly, I will fix it later, but in order to compare the speed and accuracy, I did some experiments. Here is the complete code I used for experiments:
Contains
A very cool decorator for timing:
simpletimer
He will decorate
func
, print out the time it takes to executeA function that generates random test results:
gentimestrs
He will generate
count
筆 time strings 測資, 且各資料的時間間隔是小於等於delta
random integersThe following
findmax
andfindmax2
are my approach and @citaret's approach respectively (if you are interested, you can imitate the interface and add your own functions, and the questioner can also Compare with your original practice), and usefindmax
和findmax2
分別是我的做法跟 @citaret 的做法 (有興趣的大家可以自行仿照介面加入自己的 function, 題主也可以把自己原先的做法拿來比比看), 並且用simpletimer
to time it.When testing, first generate the test data, and then execute the funcions to be compared in
funclst
The execution example is as follows:
Here are some results:
Use
timeframe==5
to test the amount of funds 1000, 10000, 100000Use
timeframe==10
to test the amount of funds 1000, 10000, 100000If the timeframe size is kept unchanged, both times will show a linear growth in the number of test funds
But when the timeframe doubles, the second method will take twice as long
This means that the maximum method complexity of @citaret should be O(n * timeframe)
Another point is that sometimes the results of these two methods are not the same, there will be a difference. If the questioner has a method, he can verify the correctness
I just look at the answer I found and it seems to be correct, but please correct me if I have made any mistakes. Thank you!
We also welcome more people to discuss this issue.
Questions I answered: Python-QA
There is no data and no good profile. Let’s give a version first to see if others have better solutions.
@dokelung tested it using the citaret algorithm. For a list with a total of 27505 elements (extracted from a single log file with 630,000 lines and aggregated from multiple source IPs), it took 3.8842415850296845s. My original time took 16.544873371035163s, and the efficiency was improved. Quite a few, although still on the order of seconds, this is just one file per day. At the same time, it helped me find that the more serious efficiency problem is actually the extraction of log lines that meet the requirements within the function. Thank you everyone for your answers and participation, thank you!