PatchTST: A Breakthrough in Time Series Forecasting (PatchTST = patch time series transformer)
I wondered what could be done with Python and what I consider to be Big Data. So I wrote a Python script where I read 8639828 X and Y values (time series data) into a Pandas DataFrame, and then calculated descriptive statistics on the Y channel. All of that was accomplished in only 1.2 seconds. I performed the test on time series data with 10 million values, and the data was read and statistics performed in less than 2 seconds. At 50 million values, NumPy was unable to allocate enough of something to read and analyze the data. I see a lot of potential.
Dask gives you the ability to perform pandas, NumPy, and ML operations on large datasets. Learn how Dask works and how to use it.
pip install dask
# Import dask.dataframe
import dask.dataframe as dd
#read large file into variable ddf
ddf = dd.read_csv(r"data.csv")
# See the # of partitions (try to keep them < 100 MB in size)
ddf.npartitions
Dask handles large arrays too.
Datashader is a open source Python library for big data (millions of values) visualization. It supports data formats of Pandas or Dask, multidimensional arrays in xarray or CuPy, columnar data in cuDF, and ragged arrays in . Tutorial: How to Render Huge Datasets in Python through Datashader
How to filter large (GB) CSV files for import into Pandas
Practical SQL for Data Analysis by Haki Benita is loaded with greate examples on how to use SQL to perform many of the tasks you might use Pandas for, and achieving the result using less memory and with better speed.
Python Solutions
Sitemap | Copyright © 2017 - 2024 Mechatronic Solutions LLC
Web site by www.MechatronicSolutionsLLC.com | | 21.4580 ms