python - Questions about NumPy array operations
ringa_lee
ringa_lee 2017-06-30 09:56:09
0
3
1233
['000001_2017-03-17.csv', '000001_2017-03-20.csv', '000002_2017-03-21.csv', '000002_2017-03-22.csv', '000003_2017-03-23.csv', '000004_2017-03-24.csv']

numpy array has tens of thousands of elements in total. Now I want to retain the number 000001 or the like in front of each element, and remove duplicates, leaving only a unique number. The result should be['000001','000002','000003','000004']
In addition to using the for statement, is there a more efficient way?

ringa_lee
ringa_lee

ringa_lee

reply all (3)
迷茫

Let’s write NumPy~

python3

>>> import numpy as np >>> a = np.array(['000001_2017-03-17.csv', '000001_2017-03-20.csv', '000002_2017-03-21.csv', '000002_2017-03-22.csv', '000003_2017-03-23.csv', '000004_2017-03-24.csv']) >>> b = np.unique(np.fromiter(map(lambda x:x.split('_')[0],a),'|S6')) >>> b array([b'000001', b'000002', b'000003', b'000004'], dtype='|S6')

You can also write like this:np.frompyfunc
'|S6'is to store strings in 6 bytes

'is a stringstored in 6little-endian Unicode characters

>>> b = np.array(np.unique(np.frompyfunc(lambda x:x[:6],1,1)(a)),dtype='>> b array(['000001', '000002', '000003', '000004'], dtype='
    学习ing

    Based on the two brothers’ writing methods
    @agree and accept @xiaojieluoff

    If the length of the number is fixed to the first six digits, the fastest way to write it is the first one below

    import time lst = ['000001_2017-03-17.csv', '000001_2017-03-20.csv', '000002_2017-03-21.csv', '000002_2017-03-22.csv', '000003_2017-03-23.csv', '000004_2017-03-24.csv'] * 1000000 start = time.time() data = {_[:6] for _ in lst} print 'dic: {}'.format(time.time() - start) start = time.time() data = set(_[:6] for _ in lst) print 'set: {}'.format(time.time() - start) start = time.time() data = set(map(lambda _: _[:6], lst)) print('map:{}'.format(time.time() - start)) start = time.time() data = set() [data.add(_[:6]) for _ in lst] print('for:{}'.format(time.time() - start)) 耗时: dic: 0.72798705101 set: 0.929664850235 map:1.89214396477 for:1.76194214821
      某草草

      Use map and anonymous functions

      lists = ['000001_2017-03-17.csv', '000001_2017-03-20.csv','000002_2017-03-21.csv','000002_2017-03-22.csv','000003_2017-03-23.csv', '000004_2017-03-24.csv'] data = list(set(map(lambda x:x.split('_')[0], lists))) print(data)

      Output:

      ['000003', '000004', '000001', '000002']

      Run the following code and you can see that with 6 million pieces of data, map is about 0.6s faster than for

      import time lists = ['000001_2017-03-17.csv', '000001_2017-03-20.csv', '000002_2017-03-21.csv', '000002_2017-03-22.csv', '000003_2017-03-23.csv', '000004_2017-03-24.csv'] * 1000000 map_start = time.clock() map_data = list(set(map(lambda x:x.split('_')[0], lists))) map_end = (time.clock() - map_start) print('map 运行时间:{}'.format(map_end)) for_start = time.clock() data = set() for k in lists: data.add(k.split('_')[0]) for_end = (time.clock() - for_start) print('for 运行时间:{}'.format(for_end))

      Output:

      map 运行时间:2.36173 for 运行时间:2.9405870000000003

      If the test data is expanded to 60 million, the gap will be even more obvious

      map 运行时间:29.620203 for 运行时间:33.132621
        Latest Downloads
        More>
        Web Effects
        Website Source Code
        Website Materials
        Front End Template
        About us Disclaimer Sitemap
        php.cn:Public welfare online PHP training,Help PHP learners grow quickly!