如何高效地存取巨量的单细胞数据：python包xarray

原创 ComputationalBio Computational Biology 2023-05-19 13:33

微信公众号：Computational Biology
关注生物信息学和计算表观遗传学。问题或建议，请公众号留言。

随着单细胞测序的样本数的增加，新发表的CNS论文的细胞数动辄达到几十万，得到的测序文件单位都是T级别。
那么，面对这么大的数据，如何存取和分析，一直是是一个让人头疼的问题。
比如，人类基因组中，单细胞RNA-Seq的数据中有几万个基因，单细胞DNA甲基化数据有两千多万个CpG位点。
如果要从这几千万个位点中，提取某几个特定区间的位点，或者是只读取某几个指定的样本（细胞），传统的方法需要将整个大矩阵读到内存里面去，再根据行和列去做索引提取。
这么大的文件，如果将数据全部读到内存中，那么会消耗非常大的内存，电脑甚至可能会直接卡死。

今天，我们来介绍xarray这个python包，使用xarray，用户可以快速存取大数据，快速索引和计算。
下面，我们来介绍一下xarray的使用方法：

Import packages

import xarray as xr
import numpy as np
import pandas as pd

Create a DataArray

array=np.random.randn(2, 3)
print(array)
data = xr.DataArray(array, dims=("x", "y"), coords={"x": [10, 20]}) #data, coords, dims, name, attrs
print(data)

[[-1.2099241  -0.63543639  1.43413058]
 [ 0.15186551 -0.1011179  -0.02198635]]
<xarray.DataArray (x: 2, y: 3)>
array([[-1.2099241 , -0.63543639,  1.43413058],
       [ 0.15186551, -0.1011179 , -0.02198635]])
Coordinates:
  * x        (x) int64 10 20
Dimensions without coordinates: y

df=pd.Series(range(3), index=list("abc"), name="foo")
print(df)
print(xr.DataArray(df))

a    0
b    1
c    2
Name: foo, dtype: int64
<xarray.DataArray 'foo' (dim_0: 3)>
array([0, 1, 2])
Coordinates:
  * dim_0    (dim_0) object 'a' 'b' 'c'

print(data.values)
print(data.dims)
data.attrs

[[-1.2099241  -0.63543639  1.43413058]
 [ 0.15186551 -0.1011179  -0.02198635]]
('x', 'y')



{}

Indexing

#positional indexing
print(data[0, :]) # by integer 
print(data.loc[10]) # by label

#label-based indexing
print(data.isel(x=0)) # by integer
print(data.sel(x=10)) # by label

<xarray.DataArray (y: 3)>
array([-1.2099241 , -0.63543639,  1.43413058])
Coordinates:
    x        int64 10
Dimensions without coordinates: y
<xarray.DataArray (y: 3)>
array([-1.2099241 , -0.63543639,  1.43413058])
Coordinates:
    x        int64 10
Dimensions without coordinates: y
<xarray.DataArray (y: 3)>
array([-1.2099241 , -0.63543639,  1.43413058])
Coordinates:
    x        int64 10
Dimensions without coordinates: y
<xarray.DataArray (y: 3)>
array([-1.2099241 , -0.63543639,  1.43413058])
Coordinates:
    x        int64 10
Dimensions without coordinates: y

Attributes

print(data.attrs) #attrs is a dict
data.attrs["long_name"] = "random velocity"
data.attrs["units"] = "metres/sec"
data.attrs["description"] = "A random variable created as an example."
data.attrs["random_attribute"] = 123
data.x.attrs["units"] = "x units"
data.attrs

{}





{'long_name': 'random velocity',
 'units': 'metres/sec',
 'description': 'A random variable created as an example.',
 'random_attribute': 123}

data.x.attrs

{'units': 'x units'}

Computation

print(data)
print(data + 10)
print(np.sin(data))
print(data.T)
print(data.sum())
print(data.mean(dim='x'))

<xarray.DataArray (x: 2, y: 3)>
array([[-1.2099241 , -0.63543639,  1.43413058],
       [ 0.15186551, -0.1011179 , -0.02198635]])
Coordinates:
  * x        (x) int64 10 20
Dimensions without coordinates: y
Attributes:
    long_name:         random velocity
    units:             metres/sec
    description:       A random variable created as an example.
    random_attribute:  123
<xarray.DataArray (x: 2, y: 3)>
array([[ 8.7900759 ,  9.36456361, 11.43413058],
       [10.15186551,  9.8988821 ,  9.97801365]])
Coordinates:
  * x        (x) int64 10 20
Dimensions without coordinates: y
<xarray.DataArray (x: 2, y: 3)>
array([[-0.93558921, -0.59352878,  0.99067576],
       [ 0.15128243, -0.10094567, -0.02198458]])
Coordinates:
  * x        (x) int64 10 20
Dimensions without coordinates: y
Attributes:
    long_name:         random velocity
    units:             metres/sec
    description:       A random variable created as an example.
    random_attribute:  123
<xarray.DataArray (y: 3, x: 2)>
array([[-1.2099241 ,  0.15186551],
       [-0.63543639, -0.1011179 ],
       [ 1.43413058, -0.02198635]])
Coordinates:
  * x        (x) int64 10 20
Dimensions without coordinates: y
Attributes:
    long_name:         random velocity
    units:             metres/sec
    description:       A random variable created as an example.
    random_attribute:  123
<xarray.DataArray ()>
array(-0.38246865)
<xarray.DataArray (y: 3)>
array([-0.5290293 , -0.36827714,  0.70607211])
Dimensions without coordinates: y

a = xr.DataArray(np.random.randn(3), [data.coords["y"]])

b = xr.DataArray(np.random.randn(4), dims="z")

print(a)

print(b)

print(a + b)

<xarray.DataArray (y: 3)>
array([ 0.03569927, -0.97316472, -1.81326771])
Coordinates:
  * y        (y) int64 0 1 2
<xarray.DataArray (z: 4)>
array([0.81001756, 0.24334516, 0.16533663, 0.65630593])
Dimensions without coordinates: z
<xarray.DataArray (y: 3, z: 4)>
array([[ 0.84571682,  0.27904443,  0.20103589,  0.6920052 ],
       [-0.16314716, -0.72981956, -0.80782809, -0.31685879],
       [-1.00325016, -1.56992255, -1.64793109, -1.15696178]])
Coordinates:
  * y        (y) int64 0 1 2
Dimensions without coordinates: z

print(data)
print(data.T)
print(data-data.T) # don't need to worry about the order of dimension

<xarray.DataArray (x: 2, y: 3)>
array([[-1.2099241 , -0.63543639,  1.43413058],
       [ 0.15186551, -0.1011179 , -0.02198635]])
Coordinates:
  * x        (x) int64 10 20
Dimensions without coordinates: y
Attributes:
    long_name:         random velocity
    units:             metres/sec
    description:       A random variable created as an example.
    random_attribute:  123
<xarray.DataArray (y: 3, x: 2)>
array([[-1.2099241 ,  0.15186551],
       [-0.63543639, -0.1011179 ],
       [ 1.43413058, -0.02198635]])
Coordinates:
  * x        (x) int64 10 20
Dimensions without coordinates: y
Attributes:
    long_name:         random velocity
    units:             metres/sec
    description:       A random variable created as an example.
    random_attribute:  123
<xarray.DataArray (x: 2, y: 3)>
array([[0., 0., 0.],
       [0., 0., 0.]])
Coordinates:
  * x        (x) int64 10 20
Dimensions without coordinates: y

GroupBy

labels = xr.DataArray(["E", "F", "E"], [data.coords["y"]], name="labels")
print(data)
print(labels)
print(data.groupby(labels).mean("y"))
print(data.groupby(labels).map(lambda x: x - x.min()))

<xarray.DataArray (x: 2, y: 3)>
array([[-1.2099241 , -0.63543639,  1.43413058],
       [ 0.15186551, -0.1011179 , -0.02198635]])
Coordinates:
  * x        (x) int64 10 20
Dimensions without coordinates: y
Attributes:
    long_name:         random velocity
    units:             metres/sec
    description:       A random variable created as an example.
    random_attribute:  123
<xarray.DataArray 'labels' (y: 3)>
array(['E', 'F', 'E'], dtype='<U1')
Coordinates:
  * y        (y) int64 0 1 2
<xarray.DataArray (x: 2, labels: 2)>
array([[ 0.11210324, -0.63543639],
       [ 0.06493958, -0.1011179 ]])
Coordinates:
  * x        (x) int64 10 20
  * labels   (labels) object 'E' 'F'
Attributes:
    long_name:         random velocity
    units:             metres/sec
    description:       A random variable created as an example.
    random_attribute:  123
<xarray.DataArray (x: 2, y: 3)>
array([[0.        , 0.        , 2.64405468],
       [1.36178961, 0.53431849, 1.18793775]])
Coordinates:
  * x        (x) int64 10 20
Dimensions without coordinates: y

print(data.groupby(labels).mean())
print(data.mean(dim='x'))

<xarray.DataArray (x: 2, labels: 2)>
array([[ 0.11210324, -0.63543639],
       [ 0.06493958, -0.1011179 ]])
Coordinates:
  * x        (x) int64 10 20
  * labels   (labels) object 'E' 'F'
Attributes:
    long_name:         random velocity
    units:             metres/sec
    description:       A random variable created as an example.
    random_attribute:  123
<xarray.DataArray (y: 3)>
array([-0.5290293 , -0.36827714,  0.70607211])
Dimensions without coordinates: y

Plotting

data.plot()

<matplotlib.collections.QuadMesh at 0x1519af16d510>

pandas

series = data.to_series()

print(series)
print(data.to_dataframe(name="Col1"))
df=data.to_pandas()
print(df)
print(df.to_xarray()) # pandas to DataArray, columns are converted to variables, not what we wanted.
# xr.DataArray(df)
print(df.stack()) #to series
print(df.stack().to_xarray()) #series to DataArray
print(series.to_xarray())

x   y
10  0   -1.209924
    1   -0.635436
    2    1.434131
20  0    0.151866
    1   -0.101118
    2   -0.021986
dtype: float64
          Col1
x  y          
10 0 -1.209924
   1 -0.635436
   2  1.434131
20 0  0.151866
   1 -0.101118
   2 -0.021986
y          0         1         2
x                               
10 -1.209924 -0.635436  1.434131
20  0.151866 -0.101118 -0.021986
<xarray.Dataset>
Dimensions:  (x: 2)
Coordinates:
  * x        (x) int64 10 20
Data variables:
    0        (x) float64 -1.21 0.1519
    1        (x) float64 -0.6354 -0.1011
    2        (x) float64 1.434 -0.02199
x   y
10  0   -1.209924
    1   -0.635436
    2    1.434131
20  0    0.151866
    1   -0.101118
    2   -0.021986
dtype: float64
<xarray.DataArray (x: 2, y: 3)>
array([[-1.2099241 , -0.63543639,  1.43413058],
       [ 0.15186551, -0.1011179 , -0.02198635]])
Coordinates:
  * x        (x) int64 10 20
  * y        (y) int64 0 1 2
<xarray.DataArray (x: 2, y: 3)>
array([[-1.2099241 , -0.63543639,  1.43413058],
       [ 0.15186551, -0.1011179 , -0.02198635]])
Coordinates:
  * x        (x) int64 10 20
  * y        (y) int64 0 1 2

Datasets

ds = xr.Dataset(dict(foo=data, bar=("x", [1, 2]), baz=np.pi))

print(ds)

<xarray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int64 10 20
Dimensions without coordinates: y
Data variables:
    foo      (x, y) float64 -1.21 -0.6354 1.434 0.1519 -0.1011 -0.02199
    bar      (x) int64 1 2
    baz      float64 3.142

print(ds["foo"])
# or
print(ds.foo)

<xarray.DataArray 'foo' (x: 2, y: 3)>
array([[-1.2099241 , -0.63543639,  1.43413058],
       [ 0.15186551, -0.1011179 , -0.02198635]])
Coordinates:
  * x        (x) int64 10 20
Dimensions without coordinates: y
Attributes:
    long_name:         random velocity
    units:             metres/sec
    description:       A random variable created as an example.
    random_attribute:  123
<xarray.DataArray 'foo' (x: 2, y: 3)>
array([[-1.2099241 , -0.63543639,  1.43413058],
       [ 0.15186551, -0.1011179 , -0.02198635]])
Coordinates:
  * x        (x) int64 10 20
Dimensions without coordinates: y
Attributes:
    long_name:         random velocity
    units:             metres/sec
    description:       A random variable created as an example.
    random_attribute:  123

Read & write netCDF files

ds.to_netcdf("example.nc")

reopened = xr.open_dataset("example.nc") #open_dataarray()

print(reopened)

<xarray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int32 10 20
Dimensions without coordinates: y
Data variables:
    foo      (x, y) float64 ...
    bar      (x) int32 ...
    baz      float64 ...

欢迎关注本公众号，后期将持续更新本系列和其它更多的原创生物信息学文章。

往期精选文章

喜欢此内容的人还喜欢

微信扫一扫
关注该公众号