Home  >  Article  >  Backend Development  >  Let’s talk about how to use PHP to read large files (tutorial sharing)

Let’s talk about how to use PHP to read large files (tutorial sharing)

青灯夜游
青灯夜游forward
2022-09-22 20:09:265169browse

How does PHP read large files? The following article will introduce to you how to use PHP to read large files. I hope it will be helpful to you!

Let’s talk about how to use PHP to read large files (tutorial sharing)

As PHP developers, we don’t need to worry about memory management. The PHP engine does an excellent job of cleaning up behind our backs, and the web server model of ephemeral execution contexts means even the sloppiest code has no lasting impact.

In rare cases, we may need to step outside the boundaries of comfort — for example, when we try to run Composer for a large project on the smallest VPS we can create, or when we need to read large files on an equally small server. file time.

This is a question we will discuss in this tutorial.

The code for this tutorial can be found here GitHub.

Measure success

The only way to confirm that the improvements we made to our code are effective is to measure a bad situation and then compare it to how we have applied the improvements subsequent measurement. In other words, we don’t know whether a “solution” is a solution unless we know how much (if at all) it will help us.

We can pay attention to two indicators. The first is CPU usage. How fast or slow does the process we are dealing with run? Second is memory usage. How much memory does the script take up to execute? These are usually inversely proportional - meaning we can reduce memory usage at the expense of CPU usage, and vice versa.

In an asynchronous processing model (such as a multi-process or multi-threaded PHP application), both CPU and memory usage are important considerations. In a traditional PHP architecture, this usually becomes a problem whenever server constraints are reached.

Measuring CPU usage inside PHP is difficult to achieve. If you really care about this, consider using a command like top in Ubuntu or macOS. For Windows, consider using the Linux subsystem so you can use the top command in Ubuntu.

In this tutorial we will measure memory usage. We'll take a look at how much memory a "traditional" script will use. We will also implement some optimization strategies and measure them. Finally, I hope you can make a reasonable choice.

Here are the methods we will use to view memory usage:

// formatBytes 方法取材于 php.net 文档

memory_get_peak_usage();

function formatBytes($bytes, $precision = 2) {
    $units = array("b", "kb", "mb", "gb", "tb");

    $bytes = max($bytes, 0);
    $pow = floor(($bytes ? log($bytes) : 0) / log(1024));
    $pow = min($pow, count($units) - 1);

    $bytes /= (1 << (10 * $pow));

    return round($bytes, $precision) . " " . $units[$pow];
}

We will use these methods at the end of the script so that we can understand which script is using the most memory at one time.

What are our options?

We have many ways to read files efficiently. They are used in the following two scenarios. We may want to read and process all data at the same time, output the processed data, or perform other operations. We may also want to transform the data stream without accessing the data.

Imagine the following, for the first case, if we want to read the file and hand every 10,000 rows of data to a separate queue for processing. We would need to load at least 10,000 rows of data into memory and hand them to the queue manager (whichever one is used).

For the second case, suppose we want to compress the content of an API response that is particularly large. Although we don't care what its contents are here, we do need to make sure that it is backed up in a compressed format.

In both cases, we need to read large files. The difference is that in the first case we need to know what the data is, while in the second case we don't care what the data is. Next, let's discuss these two approaches in depth...

Read files line by line

PHP has many functions for processing files, let's list them Some functions combined to implement a simple file reader

// from memory.php

function formatBytes($bytes, $precision = 2) {
    $units = array("b", "kb", "mb", "gb", "tb");

    $bytes = max($bytes, 0);
    $pow = floor(($bytes ? log($bytes) : 0) / log(1024));
    $pow = min($pow, count($units) - 1);

    $bytes /= (1 << (10 * $pow));

    return round($bytes, $precision) . " " . $units[$pow];
}

print formatBytes(memory_get_peak_usage());
// from reading-files-line-by-line-1.php
function readTheFile($path) {
    $lines = [];
    $handle = fopen($path, "r");

    while(!feof($handle)) {
        $lines[] = trim(fgets($handle));
    }

    fclose($handle);
    return $lines;
}

readTheFile("shakespeare.txt");

require "memory.php";

We are reading a text file containing the entire works of Shakespeare. The file size is approximately 5.5 MB. Memory usage peaked at 12.8 MB. Now, let's use the generator to read each line:

// from reading-files-line-by-line-2.php

function readTheFile($path) {
    $handle = fopen($path, "r");

    while(!feof($handle)) {
        yield trim(fgets($handle));
    }

    fclose($handle);
}

readTheFile("shakespeare.txt");

require "memory.php";

The file size is the same, but the memory usage peaks at 393 KB. This data is not very meaningful, because we need to add processing of file data. For example, when two blank lines appear, split the document into chunks:

// from reading-files-line-by-line-3.php

$iterator = readTheFile("shakespeare.txt");

$buffer = "";

foreach ($iterator as $iteration) {
    preg_match("/\n{3}/", $buffer, $matches);

    if (count($matches)) {
        print ".";
        $buffer = "";
    } else {
        $buffer .= $iteration . PHP_EOL;
    }
}

require "memory.php";

Anyone have a guess at how much memory is used this time? Even if we divide the text document into 126 chunks, we still only use 459 KB of memory. Given the nature of the generator, the maximum memory we will use is the memory needed to store the largest chunk of text during the iteration. In this case, the largest block is 101985 characters.

I have already written Using generators to improve performance and Generator expansion package. If you are interested, you can check out more related content.

The generator has other uses, but obviously it works well for reading large files. If we need to process data, generators are probably the best way to go.

文件之间的管道

在不需要处理数据的情况下,我们可以将文件数据从一个文件传递到另一个文件。这通常称为管道 (大概是因为除了两端之外,我们看不到管道内的任何东西,当然,只要它是不透明的)。我们可以通过流(stream)来实现,首先,我们编写一个脚本实现一个文件到另一个文件的传输,以便我们可以测量内存使用情况:

// from piping-files-1.php

file_put_contents(
    "piping-files-1.txt", file_get_contents("shakespeare.txt")
);

require "memory.php";

结果并没有让人感到意外。该脚本比其复制的文本文件使用更多的内存来运行。这是因为脚本必须在内存中读取整个文件直到将其写入另外一个文件。对于小的文件而言,这种操作是 OK 的。但是将其用于大文件时,就不是那么回事了。

让我们尝试从一个文件流式传输(或管道传输)到另一个文件:

// from piping-files-2.php

$handle1 = fopen("shakespeare.txt", "r");
$handle2 = fopen("piping-files-2.txt", "w");

stream_copy_to_stream($handle1, $handle2);

fclose($handle1);
fclose($handle2);

require "memory.php";

这段代码有点奇怪。我们打开两个文件的句柄,第一个处于读取模式,第二个处于写入模式。然后,我们从第一个复制到第二个。我们通过再次关闭两个文件来完成。当你知道内存使用为 393 KB 时,可能会感到惊讶。

这个数字看起来很熟悉,这不就是利用生成器保存逐行读取内容时所使用的内存吗。这是因为 fgets 的第二个参数定义了每行要读取的字节数(默认为 -1 或到达新行之前的长度)。

stream_copy_to_stream 的第三个参数是相同的(默认值完全相同)。stream_copy_to_stream 一次从一个流读取一行,并将其写入另一流。由于我们不需要处理该值,因此它会跳过生成器产生值的部分

单单传输文字还不够实用,所以考虑下其他例子。假设我们想从 CDN 输出图像,可以用以下代码来描述

// from piping-files-3.php

file_put_contents(
    "piping-files-3.jpeg", file_get_contents(
        "https://github.com/assertchris/uploads/raw/master/rick.jpg"
    )
);

// ...or write this straight to stdout, if we don't need the memory info

require "memory.php";

想象一下应用程度执行到该步骤。这次我们不是要从本地文件系统中获取图像,而是从 CDN 获取。我们用 file_get_contents 代替更优雅的处理方式(例如Guzzle),它们的实际效果是一样的。

内存使用情况为 581KB,现在,我们如何尝试进行流传输呢?

// from piping-files-4.php

$handle1 = fopen(
    "https://github.com/assertchris/uploads/raw/master/rick.jpg", "r"
);

$handle2 = fopen(
    "piping-files-4.jpeg", "w"
);

// ...or write this straight to stdout, if we don't need the memory info

stream_copy_to_stream($handle1, $handle2);

fclose($handle1);
fclose($handle2);

require "memory.php";

内存使用比刚才略少(400 KB),但是结果是相同的。如果我们不需要内存信息,也可以打印至标准输出。PHP 提供了一种简单的方法来执行此操作:

$handle1 = fopen(
    "https://github.com/assertchris/uploads/raw/master/rick.jpg", "r"
);

$handle2 = fopen(
    "php://stdout", "w"
);

stream_copy_to_stream($handle1, $handle2);

fclose($handle1);
fclose($handle2);

// require "memory.php";

其他流

还存在一些流可以通过管道来读写。

  • php://stdin 只读
  • php://stderr 只写,与 php://stdout 相似
  • php://input 只读,使我们可以访问原始请求内容
  • php://output 只写,可让我们写入输出缓冲区
  • php://memoryphp://temp (可读写) 是临时存储数据的地方。区别在于数据足够大时 php:/// temp 就会将数据存储在文件系统中,而php:/// memory将继续存储在内存中直到耗尽。

过滤器

我们可以对流使用另一个技巧,称为过滤器。它介于两者之间,对数据进行了适当的控制使其不暴露给外接。假设我们要压缩 shakespeare.txt 文件。我们可以使用 Zip 扩展

// from filters-1.php

$zip = new ZipArchive();
$filename = "filters-1.zip";

$zip->open($filename, ZipArchive::CREATE);
$zip->addFromString("shakespeare.txt", file_get_contents("shakespeare.txt"));
$zip->close();

require "memory.php";

这段代码虽然整洁,但是总共使用了大概 10.75 MB 的内存。我们可以使用过滤器来进行优化

// from filters-2.php

$handle1 = fopen(
    "php://filter/zlib.deflate/resource=shakespeare.txt", "r"
);

$handle2 = fopen(
    "filters-2.deflated", "w"
);

stream_copy_to_stream($handle1, $handle2);

fclose($handle1);
fclose($handle2);

require "memory.php";

在这里,我们可以看到 php:///filter/zlib.deflate 过滤器,该过滤器读取和压缩资源的内容。然后我们可以将该压缩数据通过管道传输到另一个文件中。这仅使用了 896KB 内存。

虽然格式不同,或者说使用 zip 压缩文件有其他诸多好处。但是,你不得不考虑:如果选择其他格式你可以节省 12 倍的内存,你会不会心动?

要对数据进行解压,只需要通过另外一个 zlib 过滤器:

// from filters-2.php

file_get_contents(
    "php://filter/zlib.inflate/resource=filters-2.deflated"
);

关于流,在 Understanding Streams in PHPUsing PHP Streams Effectively 文章中已经进行了广泛的讨论,如果你想要换个角度思考,可以查看以上这两篇文章。

自定义流

fopenfile_get_contents 具有它们自己的默认选项集,但是它们是完全可定制的。要定义它们,我们需要创建一个新的流上下文

// from creating-contexts-1.php

$data = join("&", [
    "twitter=assertchris",
]);

$headers = join("\r\n", [
    "Content-type: application/x-www-form-urlencoded",
    "Content-length: " . strlen($data),
]);

$options = [
    "http" => [
        "method" => "POST",
        "header"=> $headers,
        "content" => $data,
    ],
];

$context = stream_content_create($options);

$handle = fopen("https://example.com/register", "r", false, $context);
$response = stream_get_contents($handle);

fclose($handle);

本例中,我们尝试发送一个 POST 请求给 API。API 端点是安全的,不过我们仍然使用了 http 上下文属性(可用于 http 或者 https)。我们设置了一些头部,并打开了 API 的文件句柄。我们可以将句柄以只读方式打开,上下文负责编写。

自定义的内容很多,如果你想了解更多信息,可查看对应 文档

创建自定义协议和过滤器

在总结之前,我们先谈谈创建自定义协议。如果你查看 文档,可以找到一个示例类:

Protocol {
    public resource $context;
    public __construct ( void )
    public __destruct ( void )
    public bool dir_closedir ( void )
    public bool dir_opendir ( string $path , int $options )
    public string dir_readdir ( void )
    public bool dir_rewinddir ( void )
    public bool mkdir ( string $path , int $mode , int $options )
    public bool rename ( string $path_from , string $path_to )
    public bool rmdir ( string $path , int $options )
    public resource stream_cast ( int $cast_as )
    public void stream_close ( void )
    public bool stream_eof ( void )
    public bool stream_flush ( void )
    public bool stream_lock ( int $operation )
    public bool stream_metadata ( string $path , int $option , mixed $value )
    public bool stream_open ( string $path , string $mode , int $options ,
        string &$opened_path )
    public string stream_read ( int $count )
    public bool stream_seek ( int $offset , int $whence = SEEK_SET )
    public bool stream_set_option ( int $option , int $arg1 , int $arg2 )
    public array stream_stat ( void )
    public int stream_tell ( void )
    public bool stream_truncate ( int $new_size )
    public int stream_write ( string $data )
    public bool unlink ( string $path )
    public array url_stat ( string $path , int $flags )
}

我们并不打算实现其中一个,因为我认为它值得拥有自己的教程。有很多工作要做。但是一旦完成工作,我们就可以很容易地注册流包装器:

if (in_array("highlight-names", stream_get_wrappers())) {
    stream_wrapper_unregister("highlight-names");
}

stream_wrapper_register("highlight-names", "HighlightNamesProtocol");

$highlighted = file_get_contents("highlight-names://story.txt");

同样,也可以创建自定义流过滤器。 文档 有一个示例过滤器类:

Filter {
    public $filtername;
    public $params
    public int filter ( resource $in , resource $out , int &$consumed ,
        bool $closing )
    public void onClose ( void )
    public bool onCreate ( void )
}

可被轻松注册

$handle = fopen("story.txt", "w+");
stream_filter_append($handle, "highlight-names", STREAM_FILTER_READ);

highlight-names 需要与新过滤器类的 filtername 属性匹配。还可以在 php:///filter/highligh-names/resource=story.txt 字符串中使用自定义过滤器。定义过滤器比定义协议要容易得多。原因之一是协议需要处理目录操作,而过滤器仅需要处理每个数据块。

如果您愿意,我强烈建议您尝试创建自定义协议和过滤器。如果您可以将过滤器应用于stream_copy_to_stream操作,则即使处理令人讨厌的大文件,您的应用程序也将几乎不使用任何内存。想象一下编写调整大小图像过滤器或加密应用程序过滤器。

如果你愿意,我强烈建议你尝试创建自定义协议和过滤器。如果你可以将过滤器应用于 stream_copy_to_stream 操作,即使处理烦人的大文件,你的应用程序也几乎不使用任何内存。想象下编写 resize-image 过滤器和  encrypt-for-application 过滤器吧。

总结

虽然这不是我们经常遇到的问题,但是在处理大文件时的确很容易搞砸。在异步应用中,如果我们不注意内存的使用情况,很容易导致服务器的崩溃。

本教程希望能带给你一些新的想法(或者更新你的对这方面的固有记忆),以便你能够更多的考虑如何有效地读取和写入大文件。当我们开始熟悉和使用流和生成器并停止使用诸如 file_get_contents 这样的函数时,这方面的错误将全部从应用程序中消失,这不失为一件好事。

英文原文地址:https://www.sitepoint.com/performant-reading-big-files-php/

译文地址:https://learnku.com/php/t/39751

推荐学习:《PHP视频教程

The above is the detailed content of Let’s talk about how to use PHP to read large files (tutorial sharing). For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:learnku.com. If there is any infringement, please contact admin@php.cn delete