datetime - 关于一个日期时间间隔计算相关的算法(Python)

Question

求一个关于日期时间间隔计算相关的算法，具体问题如下：当前有一个比较大（上万个）日期时间的list（时间递增但无规律），假设 dt=['2015-01-02 10:21:02','2015-01-02 10:21:02''2015-01-02 10:21:03''2015-01-0...

大家讲道理 · Answer

Give me my answer first:

def findmax(timestrs, timeframe):
    fmt = "%Y-%m-%d %H:%M:%S"
    times = [datetime.strptime(timestr, fmt) for timestr in timestrs]
    frame = timedelta(seconds=timeframe)
    window = dict(bidx=0, eidx=0)
    maxwindow = dict(bidx=0, eidx=0)

    # search
    for idx, time in enumerate(times):
        if time-times[window['bidx']] > frame:
            count = window['eidx'] - window['bidx'] + 1
            if count > maxwindow['eidx'] - maxwindow['bidx']:
                maxwindow = dict(window)
        while time-times[window['bidx']] > frame:
            window['bidx'] += 1
        window['eidx'] = idx
    else:
        count = window['eidx'] - window['bidx'] + 1
        if count > maxwindow['eidx'] - maxwindow['bidx']:
            maxwindow = dict(window)

    # output results
    maxcount = maxwindow['eidx'] - maxwindow['bidx'] + 1
    print('max count:{} of winodw from {}:{} to {}:{}'.format(
        maxcount, maxwindow['bidx'], timestrs[maxwindow['bidx']],
        maxwindow['eidx'], timestrs[maxwindow['eidx']]))

Remember to import the above code datetime 和 timedelta．

This is a O(n) approach, n is the number of timestrs.

The code is a bit ugly, I will fix it later, but in order to compare the speed and accuracy, I did some experiments. Here is the complete code I used for experiments:

from datetime import datetime, timedelta
import collections
import random
import sys


def simpletimer(func):
    """ a simple decorator to time a function"""
    import time
    def modified_func(*args, **kwargs):
        btime = time.time()
        result = func(*args, **kwargs)
        print('[time]', time.time()-btime, 'sec.')
        return result
    return modified_func


def gentimestrs(count, delta):
    """ generate time strings for input data"""
    timestrs = []
    time = datetime(2015, 1, 2)
    for i in range(count):
        time = time + timedelta(seconds=random.randrange(delta))
        timestrs.append(str(time))
    return timestrs

@simpletimer
def findmax(timestrs, timeframe):

    fmt = "%Y-%m-%d %H:%M:%S"
    times = [datetime.strptime(timestr, fmt) for timestr in timestrs]
    frame = timedelta(seconds=timeframe)

    window = dict(bidx=0, eidx=0)
    maxwindow = dict(bidx=0, eidx=0)

    for idx, time in enumerate(times):
        if time-times[window['bidx']] > frame:
            count = window['eidx'] - window['bidx'] + 1
            if count > maxwindow['eidx'] - maxwindow['bidx']:
                maxwindow = dict(window)
        while time-times[window['bidx']] > frame:
            window['bidx'] += 1
        window['eidx'] = idx
    else:
        count = window['eidx'] - window['bidx'] + 1
        if count > maxwindow['eidx'] - maxwindow['bidx']:
            maxwindow = dict(window)

    maxcount = maxwindow['eidx'] - maxwindow['bidx'] + 1
    print('max count:{} of winodw from {}:{} to {}:{}'.format(
        maxcount, maxwindow['bidx'], timestrs[maxwindow['bidx']],
        maxwindow['eidx'], timestrs[maxwindow['eidx']]))


@simpletimer
def findmax2(timestrs, timeframe):
    fmt = "%Y-%m-%d %H:%M:%S"
    ys = ((datetime.strptime(x, fmt) + timedelta(seconds=-i)).strftime(fmt) for x in timestrs for i in range(timeframe))
    print(collections.Counter(ys).most_common(1))


if __name__ == '__main__':

    count = int(sys.argv[1])
    delta = int(sys.argv[2])
    frame = int(sys.argv[3])

    timestrs = gentimestrs(count, delta)
    with open('timestrs', 'w') as writer:
        for timestr in timestrs:
            print(timestr, file=writer)

    funclst = [findmax, findmax2]

    for func in funclst:
        func(timestrs, frame)

Contains

A very cool decorator for timing: simpletimer
- He will decorate func, print out the time it takes to execute
A function that generates random test results: gentimestrs
- He will generate count 筆 time strings 測資, 且各資料的時間間隔是小於等於 delta random integers
The following findmax and findmax2 are my approach and @citaret's approach respectively (if you are interested, you can imitate the interface and add your own functions, and the questioner can also Compare with your original practice), and use findmax 和 findmax2 分別是我的做法跟 @citaret 的做法 (有興趣的大家可以自行仿照介面加入自己的 function, 題主也可以把自己原先的做法拿來比比看), 並且用 simpletimer to time it.
When testing, first generate the test data, and then execute the funcions to be compared in funclst

The execution example is as follows:

                           timeframe
                           ＿
$ python3 test.py 10000 30 6
                  ^^^^^ ^^ 
                  測資數 隨機最大時間間隔

Here are some results:

Use timeframe==5 to test the amount of funds 1000, 10000, 100000

$ python3 td.py 1000 30 5
max count:4 of winodw from 798:2015-01-02 03:25:56 to 801:2015-01-02 03:26:01
[time] 0.02829718589782715 sec.
[('2015-01-02 03:25:56', 3)]
[time] 0.15825390815734863 sec.

$ python3 td.py 10000 30 5
max count:5 of winodw from 1624:2015-01-02 06:37:32 to 1628:2015-01-02 06:37:37
[time] 0.19979143142700195 sec.
[('2015-01-02 12:17:22', 4)]
[time] 1.6265528202056885 sec.

$ python3 td.py 100000 30 5
max count:6 of winodw from 91688:2015-01-17 09:41:33 to 91693:2015-01-17 09:41:38
[time] 1.8956780433654785 sec.
[('2015-01-15 01:36:41', 5)]
[time] 15.950862407684326 sec.

Use timeframe==10 to test the amount of funds 1000, 10000, 100000

$ python3 td.py 1000 30 10
max count:5 of winodw from 910:2015-01-02 03:36:08 to 914:2015-01-02 03:36:17
[time] 0.02788376808166504 sec.
[('2015-01-02 03:24:06', 5)]
[time] 0.3295907974243164 sec.

$ python3 td.py 10000 30 10
max count:6 of winodw from 5304:2015-01-02 21:45:41 to 5309:2015-01-02 21:45:47
[time] 0.20185112953186035 sec.
[('2015-01-02 21:45:38', 6)]
[time] 3.2009713649749756 sec.

$ python3 td.py 100000 30 10
max count:7 of winodw from 16224:2015-01-04 17:16:17 to 16230:2015-01-04 17:16:23
[time] 1.8946728706359863 sec.
[('2015-01-04 17:16:17', 7)]
[time] 33.85069417953491 sec.

If the timeframe size is kept unchanged, both times will show a linear growth in the number of test funds
But when the timeframe doubles, the second method will take twice as long

This means that the maximum method complexity of @citaret should be O(n * timeframe)

Another point is that sometimes the results of these two methods are not the same, there will be a difference. If the questioner has a method, he can verify the correctness

I just look at the answer I found and it seems to be correct, but please correct me if I have made any mistakes. Thank you!

We also welcome more people to discuss this issue.

Questions I answered: Python-QA

高洛峰 · Answer

There is no data and no good profile. Let’s give a version first to see if others have better solutions.

xs = [
'2015-01-02 10:21:02',
'2015-01-02 10:21:02',
'2015-01-02 10:21:03',
'2015-01-02 10:21:04',
'2015-01-02 10:21:06',
'2015-01-02 10:21:10',
'2015-01-02 10:21:11',
'2015-01-02 10:21:13',
'2015-01-02 10:22:00',
'2015-01-02 10:22:12' ]

from datetime import datetime, timedelta
import collections

delta = 5
fmt = "%Y-%m-%d %H:%M:%S"
ys = ((datetime.strptime(x, fmt) + timedelta(seconds=-i)).strftime(fmt) for x in xs for i in range(delta))
print collections.Counter(ys).most_common(1)
# output: [('2015-01-02 10:21:02', 5)]

伊谢尔伦 · Answer

@dokelung tested it using the citaret algorithm. For a list with a total of 27505 elements (extracted from a single log file with 630,000 lines and aggregated from multiple source IPs), it took 3.8842415850296845s. My original time took 16.544873371035163s, and the efficiency was improved. Quite a few, although still on the order of seconds, this is just one file per day. At the same time, it helped me find that the more serious efficiency problem is actually the extraction of log lines that meet the requirements within the function. Thank you everyone for your answers and participation, thank you!