(已解決)請問python 用咩處理300M csv file好?

本帖最後由 bongbong3481 於 2023-8-30 19:12 編輯

有個CSV file 300M , 需要FILTER 後,再GROUPING 加總。

FILE 太大,EXCEL 開吾到,上網睇話PANDAS可以做到,但有人話速度慢,請問PANDAS 外,有冇其它工具處理到?

結果: 都系用pandas , 幾行code搞好,運算超級快,1分鐘內完成

300 MB 如果用 SSD load, 基本上一秒讀完
點為之慢呢?需要幾大 throughput 呢?試左未

https://www.w3schools.com/python/pandas/pandas_csv.asp

Pandas / dataframe會方便之後做summary / analysis,係方便既

TOP

300 MB 如果用 SSD load, 基本上一秒讀完
點為之慢呢?需要幾大 throughput 呢?試左未



Pandas / datafr ...
LoneGumMan 發表於 2023-8-29 23:01



其實我仍學緊pandas,未正式操作,未試到效果。
我都吾知慢係邊。 我只是filter n group 完gen 番個excel or csv file 出黎就完工。  

如果5分鐘內export 到個file 出黎,我接受到既。

TOP

回覆 2# LoneGumMan


    今日剛用完pandas 整,超級快,1分鐘內處理完兼export csv file

TOP

回覆 4# bongbong3481


    pandas背後都係C所以快

TOP

When dealing with large CSV files in Python, there are alternative tools and approaches to handle the data efficiently. While pandas is a popular library for data manipulation and analysis, its performance might be slower for very large files due to memory constraints. Here are a few alternative options you can consider:

1. **Dask**: Dask is a flexible parallel computing library that integrates well with pandas. It can handle larger-than-memory datasets by performing operations in smaller, more manageable chunks. You can use the `dask.dataframe` module as a drop-in replacement for pandas to filter and group your CSV data.

2. **Modin**: Modin is another library that aims to provide pandas-like functionality while scaling to larger datasets. It leverages parallel and distributed computing engines, such as Dask or Ray, to speed up data processing. You can try using `modin.pandas` as a replacement for pandas and see if it improves the performance.

3. **Apache Arrow**: Apache Arrow is an in-memory columnar data format that provides a high-performance interface for working with large datasets. You can use the `pyarrow` library in Python to read and process CSV files, leveraging its efficient memory utilization and vectorized operations.

4. **Chunked Reading**: Instead of loading the entire CSV file into memory, you can read and process it in smaller chunks. The `csv` module in Python's standard library allows you to read the file line by line or in fixed-size chunks. By processing the data incrementally, you can reduce the memory footprint and improve performance.

5. **Database Solutions**: If the data manipulation task involves complex filtering and grouping, you might consider importing the CSV file into a database management system like PostgreSQL or SQLite. These systems are optimized for handling large datasets and offer powerful querying capabilities. You can use libraries like `psycopg2` or `sqlite3` in Python to interact with the database and perform the required operations.

Remember to benchmark the performance of different approaches and choose the one that best suits your specific requirements in terms of processing speed, memory usage, and ease of implementation.

TOP

回覆 6# s20012797

我未用python 之前,成日聽人講,python 處理大數據,以前吾信(以為好似excel 咁),而家信了,估吾到咁多工具

TOP

回覆  s20012797

我未用python 之前,成日聽人講,python 處理大數據,以前吾信(以為好似excel 咁),而 ...
bongbong3481 發表於 2023/8/31 05:54


Java吾洗其它工具已做到python加上面堆工具80~90%功能,python除左易上手外似乎冇咩過人之處.

TOP

回覆 8# s20012797

我覺得python 除了易上手外,因為語法簡單,而且工具數量或種類,都非常多,開發program 速度非常快,幾行code 就做到好複雜既野。同埋出錯的提示,簡單易明,好易除錯。

我只是用window 內置notepad + CMD ,都好快開發到自己想做既野。

TOP

10年開始間唔中要做ETL, 因為無得用SSIS, 偷用talend studio , 下水試下玩python先.

TOP