如何高效地存取巨量的单细胞数据:python包xarray

ComputationalBio Computational Biology 2023-05-19 13:33 发表于美国

微信公众号:Computational Biology

关注生物信息学和计算表观遗传学。问题或建议,请公众号留言。

随着单细胞测序的样本数的增加,新发表的CNS论文的细胞数动辄达到几十万,得到的测序文件单位都是T级别
那么,面对这么大的数据,如何存取和分析,一直是是一个让人头疼的问题。
比如,人类基因组中,单细胞RNA-Seq的数据中有几万个基因,单细胞DNA甲基化数据有两千多万个CpG位点。
如果要从这几千万个位点中,提取某几个特定区间的位点,或者是只读取某几个指定的样本(细胞),传统的方法需要将整个大矩阵读到内存里面去,再根据行和列去做索引提取。
这么大的文件,如果将数据全部读到内存中,那么会消耗非常大的内存,电脑甚至可能会直接卡死。

图片

今天,我们来介绍xarray这个python包,使用xarray,用户可以快速存取大数据,快速索引和计算。
下面,我们来介绍一下xarray的使用方法:

Import packages

import xarray as xr
import numpy as np
import pandas as pd

Create a DataArray

array=np.random.randn(23)
print(array)
data = xr.DataArray(array, dims=("x""y"), coords={"x": [1020]}) #data, coords, dims, name, attrs
print(data)
[[-1.2099241  -0.63543639  1.43413058]
[ 0.15186551 -0.1011179 -0.02198635]]
<xarray.DataArray (x: 2, y: 3)>
array([[-1.2099241 , -0.63543639, 1.43413058],
[ 0.15186551, -0.1011179 , -0.02198635]])
Coordinates:
* x (x) int64 10 20
Dimensions without coordinates: y
df=pd.Series(range(3), index=list("abc"), name="foo")
print(df)
print(xr.DataArray(df))
a    0
b 1
c 2
Name: foo, dtype: int64
<xarray.DataArray 'foo' (dim_0: 3)>
array([0, 1, 2])
Coordinates:
* dim_0 (dim_0) object 'a' 'b' 'c'
print(data.values)
print(data.dims)
data.attrs
[[-1.2099241  -0.63543639  1.43413058]
[ 0.15186551 -0.1011179 -0.02198635]]
('x', 'y')



{}

Indexing

#positional indexing
print(data[0, :]) # by integer 
print(data.loc[10]) # by label

#label-based indexing
print(data.isel(x=0)) # by integer
print(data.sel(x=10)) # by label
<xarray.DataArray (y: 3)>
array([-1.2099241 , -0.63543639, 1.43413058])
Coordinates:
x int64 10
Dimensions without coordinates: y
<xarray.DataArray (y: 3)>
array([-1.2099241 , -0.63543639, 1.43413058])
Coordinates:
x int64 10
Dimensions without coordinates: y
<xarray.DataArray (y: 3)>
array([-1.2099241 , -0.63543639, 1.43413058])
Coordinates:
x int64 10
Dimensions without coordinates: y
<xarray.DataArray (y: 3)>
array([-1.2099241 , -0.63543639, 1.43413058])
Coordinates:
x int64 10
Dimensions without coordinates: y

Attributes

print(data.attrs) #attrs is a dict
data.attrs["long_name"] = "random velocity"
data.attrs["units"] = "metres/sec"
data.attrs["description"] = "A random variable created as an example."
data.attrs["random_attribute"] = 123
data.x.attrs["units"] = "x units"
data.attrs
{}





{'long_name': 'random velocity',
'units': 'metres/sec',
'description': 'A random variable created as an example.',
'random_attribute': 123}
data.x.attrs
{'units': 'x units'}

Computation

print(data)
print(data + 10)
print(np.sin(data))
print(data.T)
print(data.sum())
print(data.mean(dim='x'))
<xarray.DataArray (x: 2, y: 3)>
array([[-1.2099241 , -0.63543639, 1.43413058],
[ 0.15186551, -0.1011179 , -0.02198635]])
Coordinates:
* x (x) int64 10 20
Dimensions without coordinates: y
Attributes:
long_name: random velocity
units: metres/sec
description: A random variable created as an example.
random_attribute: 123
<xarray.DataArray (x: 2, y: 3)>
array([[ 8.7900759 , 9.36456361, 11.43413058],
[10.15186551, 9.8988821 , 9.97801365]])
Coordinates:
* x (x) int64 10 20
Dimensions without coordinates: y
<xarray.DataArray (x: 2, y: 3)>
array([[-0.93558921, -0.59352878, 0.99067576],
[ 0.15128243, -0.10094567, -0.02198458]])
Coordinates:
* x (x) int64 10 20
Dimensions without coordinates: y
Attributes:
long_name: random velocity
units: metres/sec
description: A random variable created as an example.
random_attribute: 123
<xarray.DataArray (y: 3, x: 2)>
array([[-1.2099241 , 0.15186551],
[-0.63543639, -0.1011179 ],
[ 1.43413058, -0.02198635]])
Coordinates:
* x (x) int64 10 20
Dimensions without coordinates: y
Attributes:
long_name: random velocity
units: metres/sec
description: A random variable created as an example.
random_attribute: 123
<xarray.DataArray ()>
array(-0.38246865)
<xarray.DataArray (y: 3)>
array([-0.5290293 , -0.36827714, 0.70607211])
Dimensions without coordinates: y
a = xr.DataArray(np.random.randn(3), [data.coords["y"]])

b = xr.DataArray(np.random.randn(4), dims="z")

print(a)

print(b)

print(a + b)
<xarray.DataArray (y: 3)>
array([ 0.03569927, -0.97316472, -1.81326771])
Coordinates:
* y (y) int64 0 1 2
<xarray.DataArray (z: 4)>
array([0.81001756, 0.24334516, 0.16533663, 0.65630593])
Dimensions without coordinates: z
<xarray.DataArray (y: 3, z: 4)>
array([[ 0.84571682, 0.27904443, 0.20103589, 0.6920052 ],
[-0.16314716, -0.72981956, -0.80782809, -0.31685879],
[-1.00325016, -1.56992255, -1.64793109, -1.15696178]])
Coordinates:
* y (y) int64 0 1 2
Dimensions without coordinates: z
print(data)
print(data.T)
print(data-data.T) # don't need to worry about the order of dimension
<xarray.DataArray (x: 2, y: 3)>
array([[-1.2099241 , -0.63543639, 1.43413058],
[ 0.15186551, -0.1011179 , -0.02198635]])
Coordinates:
* x (x) int64 10 20
Dimensions without coordinates: y
Attributes:
long_name: random velocity
units: metres/sec
description: A random variable created as an example.
random_attribute: 123
<xarray.DataArray (y: 3, x: 2)>
array([[-1.2099241 , 0.15186551],
[-0.63543639, -0.1011179 ],
[ 1.43413058, -0.02198635]])
Coordinates:
* x (x) int64 10 20
Dimensions without coordinates: y
Attributes:
long_name: random velocity
units: metres/sec
description: A random variable created as an example.
random_attribute: 123
<xarray.DataArray (x: 2, y: 3)>
array([[0., 0., 0.],
[0., 0., 0.]])
Coordinates:
* x (x) int64 10 20
Dimensions without coordinates: y

GroupBy

labels = xr.DataArray(["E""F""E"], [data.coords["y"]], name="labels")
print(data)
print(labels)
print(data.groupby(labels).mean("y"))
print(data.groupby(labels).map(lambda x: x - x.min()))
<xarray.DataArray (x: 2, y: 3)>
array([[-1.2099241 , -0.63543639, 1.43413058],
[ 0.15186551, -0.1011179 , -0.02198635]])
Coordinates:
* x (x) int64 10 20
Dimensions without coordinates: y
Attributes:
long_name: random velocity
units: metres/sec
description: A random variable created as an example.
random_attribute: 123
<xarray.DataArray 'labels' (y: 3)>
array(['E', 'F', 'E'], dtype='<U1')
Coordinates:
* y (y) int64 0 1 2
<xarray.DataArray (x: 2, labels: 2)>
array([[ 0.11210324, -0.63543639],
[ 0.06493958, -0.1011179 ]])
Coordinates:
* x (x) int64 10 20
* labels (labels) object 'E' 'F'
Attributes:
long_name: random velocity
units: metres/sec
description: A random variable created as an example.
random_attribute: 123
<xarray.DataArray (x: 2, y: 3)>
array([[0. , 0. , 2.64405468],
[1.36178961, 0.53431849, 1.18793775]])
Coordinates:
* x (x) int64 10 20
Dimensions without coordinates: y
print(data.groupby(labels).mean())
print(data.mean(dim='x'))
<xarray.DataArray (x: 2, labels: 2)>
array([[ 0.11210324, -0.63543639],
[ 0.06493958, -0.1011179 ]])
Coordinates:
* x (x) int64 10 20
* labels (labels) object 'E' 'F'
Attributes:
long_name: random velocity
units: metres/sec
description: A random variable created as an example.
random_attribute: 123
<xarray.DataArray (y: 3)>
array([-0.5290293 , -0.36827714, 0.70607211])
Dimensions without coordinates: y

Plotting

data.plot()
<matplotlib.collections.QuadMesh at 0x1519af16d510>
图片

pandas

series = data.to_series()

print(series)
print(data.to_dataframe(name="Col1"))
df=data.to_pandas()
print(df)
print(df.to_xarray()) # pandas to DataArray, columns are converted to variables, not what we wanted.
# xr.DataArray(df)
print(df.stack()) #to series
print(df.stack().to_xarray()) #series to DataArray
print(series.to_xarray())
x   y
10 0 -1.209924
1 -0.635436
2 1.434131
20 0 0.151866
1 -0.101118
2 -0.021986
dtype: float64
Col1
x y
10 0 -1.209924
1 -0.635436
2 1.434131
20 0 0.151866
1 -0.101118
2 -0.021986
y 0 1 2
x
10 -1.209924 -0.635436 1.434131
20 0.151866 -0.101118 -0.021986
<xarray.Dataset>
Dimensions: (x: 2)
Coordinates:
* x (x) int64 10 20
Data variables:
0 (x) float64 -1.21 0.1519
1 (x) float64 -0.6354 -0.1011
2 (x) float64 1.434 -0.02199
x y
10 0 -1.209924
1 -0.635436
2 1.434131
20 0 0.151866
1 -0.101118
2 -0.021986
dtype: float64
<xarray.DataArray (x: 2, y: 3)>
array([[-1.2099241 , -0.63543639, 1.43413058],
[ 0.15186551, -0.1011179 , -0.02198635]])
Coordinates:
* x (x) int64 10 20
* y (y) int64 0 1 2
<xarray.DataArray (x: 2, y: 3)>
array([[-1.2099241 , -0.63543639, 1.43413058],
[ 0.15186551, -0.1011179 , -0.02198635]])
Coordinates:
* x (x) int64 10 20
* y (y) int64 0 1 2

Datasets

ds = xr.Dataset(dict(foo=data, bar=("x", [12]), baz=np.pi))

print(ds)
<xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) int64 10 20
Dimensions without coordinates: y
Data variables:
foo (x, y) float64 -1.21 -0.6354 1.434 0.1519 -0.1011 -0.02199
bar (x) int64 1 2
baz float64 3.142
print(ds["foo"])
# or
print(ds.foo)
<xarray.DataArray 'foo' (x: 2, y: 3)>
array([[-1.2099241 , -0.63543639, 1.43413058],
[ 0.15186551, -0.1011179 , -0.02198635]])
Coordinates:
* x (x) int64 10 20
Dimensions without coordinates: y
Attributes:
long_name: random velocity
units: metres/sec
description: A random variable created as an example.
random_attribute: 123
<xarray.DataArray 'foo' (x: 2, y: 3)>
array([[-1.2099241 , -0.63543639, 1.43413058],
[ 0.15186551, -0.1011179 , -0.02198635]])
Coordinates:
* x (x) int64 10 20
Dimensions without coordinates: y
Attributes:
long_name: random velocity
units: metres/sec
description: A random variable created as an example.
random_attribute: 123

Read & write netCDF files

ds.to_netcdf("example.nc")

reopened = xr.open_dataset("example.nc"#open_dataarray()

print(reopened)
<xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) int32 10 20
Dimensions without coordinates: y
Data variables:
foo (x, y) float64 ...
bar (x) int32 ...
baz float64 ...

欢迎关注本公众号,后期将持续更新本系列和其它更多的原创生物信息学文章。


往期精选文章


微信扫一扫
关注该公众号