How to Efficiently Parse Fixed Width Files in Python: Struct Module vs. Optimized String Slicing?-Python Tutorial-php.cn

How to Efficiently Parse Fixed Width Files in Python: Struct Module vs. Optimized String Slicing?

DDD

Release： 2024-10-31 15:43:03

Original

559 people have browsed it

How to Efficiently Parse Fixed Width Files in Python: Struct Module vs. Optimized String Slicing?

Efficiently Parsing Fixed Width Files

Fixed width files present a unique parsing challenge due to their predetermined column lengths. Finding efficient ways to extract data from such files is crucial for data processing.

Problem Statement

Given a file with fixed width lines, where each column represents a specific value, develop an efficient method to parse these lines into separate components. Currently, string slicing is employed, but concerns about its readability and suitability for large files arise.

Solution

Two efficient parsing methods are presented:

Method 1: Using the struct Module

The Python standard library's struct module provides a convenient way to unpack data from binary data streams. It can be used with fixed width files by defining a format string that specifies the width and type of each field. This method offers both speed and simplicity.

Example:

<code class="python">import struct

fieldwidths = (2, -10, 24)
fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's') for fw in fieldwidths)

# Convert Unicode input to bytes and the result back to Unicode string.
unpack = struct.Struct(fmtstring).unpack_from  # Alias.
parse = lambda line: tuple(s.decode() for s in unpack(line.encode()))

print('fmtstring: {!r}, record size: {} chars'.format(fmtstring, struct.calcsize(fmtstring)))

line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fields = parse(line)
print('fields: {}'.format(fields))</code>

Copy after login

Method 2: Using String Slicing with Compilation

While string slicing may seem straightforward, its speed can be improved by compiling a more efficient version using eval(). This method generates a list of slice boundaries that are constant and therefore faster to execute.

Example (Optimized):

<code class="python">def make_parser(fieldwidths):
    cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths))
    pads = tuple(fw < 0 for fw in fieldwidths) # bool flags for padding fields
    flds = tuple(zip_longest(pads, (0,)+cuts, cuts))[:-1]  # ignore final one
    slcs = ', '.join('line[{}:{}]'.format(i, j) for pad, i, j in flds if not pad)
    parse = eval('lambda line: ({})\n'.format(slcs))  # Create and compile source code.
    # Optional informational function attributes.
    parse.size = sum(abs(fw) for fw in fieldwidths)
    parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
                                                for fw in fieldwidths)
    return parse</code>

Copy after login

Both methods provide efficient ways to parse fixed width files. Method 1 using the struct module is easy to use while Method 2 using optimized string slicing offers slightly better performance when optimized.

The above is the detailed content of How to Efficiently Parse Fixed Width Files in Python: Struct Module vs. Optimized String Slicing?. For more information, please follow other related articles on the PHP Chinese website!