A simple guide for learning bioinformatics

发表于 2017-07-25 | 分类于 bioinformatics

Preparation

python

learn-python-the-hard-way

linux

鸟哥的linux私房菜

R

R in action

Practics

建立合适的目录存放每步生成的文件；
留存每步分析的脚本。

mission 1

使用 FastQC 对 fastq 文件进行质控

数据路径：

OM-mRNA-dujuanhua-P20161221

任务要求：

将数据链接到当前工作目录。
使用 FastQC 软件对数据进行数据质控分析。
生成数据质控统计表，包含样品数据量，Q30，GC含量，duplication信息。
样品GC含量分布图，duplication 比例图绘制。

mission 2

使用 kallisto 对各样品进行定量分析

分析物种：

中文名：家鸡
英文名：chicken
拉丁文名：Gallus gallus

任务要求：

基本数据格式学习, fastq, fasta, gtf。
数据库学习 ensembl。
根据分析物种下载分析需要的注释文件。
建立 kallisto 定量的 index 文件。
使用 kallisto 对所有样品进行定量。
对每个样品的定量结果进行合并作图:
- 样品表达盒形图，密度图。
- 计算两两样品间的 pearson correlation 并制作热图。
- 样品表达 PCA 图。

mission 3

使用 edgeR 进行差异分析

任务要求：

学习 tximport, edgeR。
使用 tximport 整合 kallisto 定量结果，并将转录本的表达量转换为基因的表达量。
使用 edgeR 进行差异分析。
对差异分析结果进行作图：
- 差异分析火山图， MA 图。
- 差异基因聚类热图。
- 差异基因聚类折线图。

mission 4

差异表达基因富集分析

任务要求：

学习 goseq, topGO, KOBAS, pathview 软件并在服务器上配置。
使用 biomart 数据库下载分析物种的 GO 注释 (biomaRt可以对biomart数据进行批量下载)。
使用 goseq 软件对差异基因进行 GO 富集分析并作图。
使用 topGO 软件绘制 GO 有向无环图。
使用 KOBAS 软件对差异基因进行 KEGG 富集分析并作图。
使用 pathview 软件下载 KEGG Pathway 通路图。

mission 5

使用 STAR 进行 RNAseq mapping

任务要求：

理解 RNAseq mapping 过程 (可参考此ppt)。
在家目录下建立目录，并在该目录下配置 STAR 软件。
使用 STAR 软件进行比对分析：
- 建立参考基因组比对索引文件。
- 进行比对。
整理比对结果并作图、

mission 6

使用 Rseqc 进行 RNAseq 数据质控

使用 geneBody_coverage.py 分析样品 RNA 完整性。
使用 inner_distance.py 分析测序的插入片段长度。

Advanced analysis

SNP

GATK

Alternative splicing

MISO
rMATS

Co-expression

WGCNA

Novel transcript assembly

StringTie

to be continued

Useful Tools

Biopython

python 模块，处理各种生物数据。

HTSeq

python 模块，可以使用 bam 文件对基因进行定量。其中 GFF_Reader 模块可以很方便的处理 gtf 文件。

cufflinks

经典的转录组拼接软件。目前已经被 StringTie 替代。其中 cuffcompare 和 gffread 等组件能够进行快速的 gtf 比较，序列提取等功能。

bedtools

强大的基因区域分析工具。

to be continued

Websites

RNAseq blog

to be continued

docopt creates beautiful command-line interfaces

发表于 2017-01-05 | 分类于 python

What’s docopt

根据说明文档解析命令行参数 (generate option parser based on the beautiful help message)

example


"""Naval Fate.
Usage:
  naval_fate.py ship new <name>...
  naval_fate.py ship <name> move <x> <y> [--speed=<kn>]
  naval_fate.py ship shoot <x> <y>
  naval_fate.py mine (set|remove) <x> <y> [--moored | --drifting]
  naval_fate.py (-h | --help)
  naval_fate.py --version
Options:
  -h --help     Show this screen.
  --version     Show version.
  --speed=<kn>  Speed in knots [default: 10].
  --moored      Moored (anchored) mine.
  --drifting    Drifting mine.
"""
from docopt import docopt
if __name__ == '__main__':
    arguments = docopt(__doc__, version='Naval Fate 2.0')
    print(arguments)

API

1
2
3

from docopt import docopt
docopt(doc, argv=None, help=True, version=None, options_first=False)

返回值是由选项，参数或命令作为键值的字典。(The return value is a simple dictionary with options, arguments and commands as keys, spelled exactly like in your help message)

naval_fate.py ship Guardian move 100 150 --speed=15
{'--drifting': False,    'mine': False,
 '--help': False,        'move': True,
 '--moored': False,      'new': False,
 '--speed': '15',        'remove': False,
 '--version': False,     'set': False,
 '<name>': ['Guardian'], 'ship': True,
 '<x>': '100',           'shoot': False,
 '<y>': '150'}

帮助文档格式

帮助文档由用法 (Usage) 和选项 (Options) 两部分组成。

用法格式

用法是脚本文档的一部分，以 usage 开始（大小写不敏感），以空行结束。

组成元素包括：

< arguments > ，参数用大写或用尖括号括起表示。
-options，选项以连接号（-）起始的字符表示，单字母的选项可以共用一个连接符，例如：-vof 等同于 -v -o -f。选项后可以添加参数，–input=FILE ， -i FILE 或者 -iFILE。
commands

格式：

“[]” 代表可选元素
“( )” 代表必须元素，任何不放在”[]”中的，也是必须元素
“|” 代表互斥元素
“…” 代表一个或更多元素
“[–]” ，使用 “- -“ 分隔位置参数，启用此功能需要在 usage 中添加 “[- -]”
“[-]”，代表使用标准输入作为脚本输入，启用此功能在 usage 中添加 “[-]”

选项格式

选项说明位于用法说明下方，在以下情况，选项说明是必要的：

选项有长，短两种写法
选项有参数
选项有默认值

选项写作规则如下：

以 - 或 - - 作为起始（不包括空格）
使用空格或等号添加选项的参数，参数用大学字母或用尖角符括起表示，可以使用逗号分隔选项，eg:
1
2
-o FILE --output=FILE # without comma, with "=" sign
-i <file>, --input <file> # with comma, without "=" sing
使用两个及以上的空格分隔选项和选项说明

设置默认值的格式为：”[default: < my-default-value>]”，eg:

1
2
3

--coefficient=K  The K coefficient [default: 2.95]
--output=FILE    Output file [default: test.txt]
--directory=DIR  Some directory [default: ./]

如果选项不可重复，默认值被认为是字符串，否则会将默认值以空格拆分为列表。

字符串格式化输入

发表于 2016-12-16 | 分类于 python

BASIC

通过通过和关键字参数调用


template = '{0}, {1} and {2}'
template.format('a', 'b', 'c')  ## by position
template = '{motto}, {pork} and {food}'
template.format(motto = 'spam', pork = 'ham', food = 'eggs') ## by keywork
template = '{motto}, {0} and {food}'
template.format('ham', motto = 'spam', food = 'eggs') ## by both

添加键、属性和偏移量

键值和属性

1
2
3

import sys
'My {1[spam]} runs {0.platform}'.format(sys, {'spam':'laptop'})

偏移量 (正的偏移量才能在格式化字符串的语法中有效)

1 2	somelist = list('SPAM') 'first={0[0]}, third={0[2]}'.format(somelist)

添加具体格式化

{fieldname!conversionflag:formatspec}

fieldname：指定参数的一个数字或关键字
conversionflag：可以是r、s，或者分别是改值上对repr、str或ascii内置函数的一次调用
formatspec：指定如何表示改值，包括字段宽度、对齐方式、补零、小数点精度等，形式如：[[fill]align[sign][#][0][width][.precision][typecode]]

ConfigParser

发表于 2016-11-16 | 分类于 python

basic

import ConfigParser
config = ConfigParser.ConfigParser()
config.read('example.cfg')
# Set the third, optional argument of get to 1 if you wish to use raw mode.
print config.get('Section1', 'foo', 0)  # -> "Python is fun!"
print config.get('Section1', 'foo', 1)  # -> "%(bar)s is %(baz)s!"
# The optional fourth argument is a dict with members that will take
# precedence in interpolation.
print config.get('Section1', 'foo', 0, {'bar': 'Documentation',
                                        'baz': 'evil'})

advanced

getboolean()

getboolean() 可以将值转换为布尔值。

如：

1 2	[section1] option1 = 0

当调用 getboolean(‘section1’, ‘option1’)，返回 False。
yes/no、true/false、on/off 会进行相应转换。

[DEFAULT]

当读取的配置项不在指定节中，会到 [DEFAULT] 节中查找

ConfigParser 支持字符串格式化

format.conf 如下所示

[DEFAULT]
conn_str = %(dbn)s://%(pw)s@%(host)s:%(port)s/%(db)s
dbn = mysql
user = root
host = localhost
port = 3306
[db1]
user = aaa
pw = ppp
db = example

python 脚本：

import ConfigParser
conf = ConfigParser.ConfigParser()
conf.read('format.conf')
print conf.get('db1', 'conn_str')

运行脚本，输出

1	mysql://aaa:ppp@localhost:3306/example

tips

可以使用 # 或 ; 对config 文件进行注释

强大的sort

发表于 2016-10-17 | 分类于 python

语法

sorted(iterable[, cmp[, key[, reverse]]])
sort([cmp[, key[, reverse]]])

cmp: 比较函数，参数为两个可比较的参数
key: 带一个参数的函数，用来为每个元素提取比较值

实战

对字典排序

1
2
3

phonebook = {'Linda': '7790', 'Bob':'9876', 'Carol':'6754'}
from operator import itemgetter
sorted_pb = sorted(phonebook.iteritems(), key = itemgetter(1))

多维 list 排序

1
2
3

from operator import itemgetter
gameresult = [['Bob', 95, 'A'], ['Alan', 86, 'B']] ## 分别代表学生的姓名，成绩，等级
sorted(gameresult, key = operator.itemgetter(2, 1))

按照等级排序，等级相同时，按成绩排序

字典混合 list 排序

对字典 my_dict 的 value 结构 [n, m] 中的 m 按照从小到大排序

my_dict = { 'Li':['A',1],
            'Zhang':['B',2]}
from operator import itemgetter
sorted(my_dict.iteritems(), key = lambda (k,v):operator.itemgetter(1)(v))

List 混合字典排序

按照 list 中 rating 和 name 进行排序

gameresult = [{"name":"Bob", "wins":10, "losses":3, "rating":75},
                             {"name":"David", "wins":3, "losses":10, "rating":50}]
from operator import itemgetter
sorted(gameresult, key = operator.itemgetter("rating", 'name'))

Tips

sort() 不需要复制原有列表，消耗内存较少，效率较高，不需要保存原有列表，可以选择 sort()
key 比 cmp 效率更高

my-first-blog

发表于 2016-10-16 | 分类于 test

编写高质量代码