Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
P
python_graph_minhash
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package registry
Container registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
GitLab community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Juhász Judit
python_graph_minhash
Commits
a5ff07c7
Commit
a5ff07c7
authored
Nov 20, 2023
by
Ligeti Balázs
Browse files
Options
Downloads
Patches
Plain Diff
SRA assemblies
parent
abbabdb3
Branches
assemblyinfo
No related tags found
No related merge requests found
Changes
2
Expand all
Show whitespace changes
Inline
Side-by-side
Showing
2 changed files
bin/assembly_example.ipynb
+474
-1
474 additions, 1 deletion
bin/assembly_example.ipynb
data/clean_all_phage_sra_metadata_spades.csv
+9832
-0
9832 additions, 0 deletions
data/clean_all_phage_sra_metadata_spades.csv
with
10306 additions
and
1 deletion
bin/assembly_example.ipynb
+
474
−
1
View file @
a5ff07c7
...
...
@@ -60,16 +60,489 @@
" return expected_assembly_folder"
]
},
{
"cell_type": "markdown",
"id": "0cd4c7e1-f333-4450-8799-711675d2ec1f",
"metadata": {},
"source": [
"## Init"
]
},
{
"cell_type": "code",
"execution_count":
null
,
"execution_count":
6
,
"id": "8730f244-ed8e-4ec4-bea9-30419b206614",
"metadata": {},
"outputs": [],
"source": [
"## Loading the database\n",
"phage_sequences_dir = '/home/ligeti/NAR2022Phage/data'\n",
"assembly_basedir = '/scratch/behemoth00/nar_assembly/assembly'\n",
"phage_mapping_file = '/home/ligeti/NAR2022Phage/phage_bac_mapping.tsv'\n",
"assembly_file = '../data/clean_all_phage_sra_metadata_spades.csv'\n",
"\n",
"phage_mapping = pd.read_csv(phage_mapping_file, sep='\\t')\n",
"assembly_info = pd.read_csv(assembly_file)\n",
"\n",
"# Only assemblies with batch\n",
"assembly_info = assembly_info[~assembly_info['srr_batch'].isnull()]\n",
"\n",
"assembly_batch_mapping = assembly_info[['Run', 'srr_batch']]\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "63646c56-3597-4609-bed1-b967cfc4a4fa",
"metadata": {},
"source": [
"## Usage\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "192b81e9-1969-49bd-864a-2e3c279344e9",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"assemblydir: /scratch/behemoth00/nar_assembly/assembly/srr_batch_023/ERR3515864\n"
]
}
],
"source": [
"act_runid = 'ERR3515864'\n",
"\n",
"assemblydir = get_assembly_folder(act_runid, assembly_batch_mapping, assembly_basedir)\n",
"print(f'assemblydir: {assemblydir}')\n",
"\n",
"phage_names = get_phage_filenames_fromdb(phage_mapping, act_runid)\n",
"phage_fasta_files = get_phage_fasta_files(phage_names, phage_sequences_dir)\n",
"\n",
"sequences = get_phage_sequences(phage_fasta_files)\n"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "ab363fe3-5157-4a70-8f08-dbbb34860e95",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Unnamed: 0</th>\n",
" <th>Run</th>\n",
" <th>ReleaseDate</th>\n",
" <th>LoadDate</th>\n",
" <th>spots</th>\n",
" <th>bases</th>\n",
" <th>spots_with_mates</th>\n",
" <th>avgLength</th>\n",
" <th>size_MB</th>\n",
" <th>AssemblyName</th>\n",
" <th>...</th>\n",
" <th>Histological_Type</th>\n",
" <th>Body_Site</th>\n",
" <th>CenterName</th>\n",
" <th>Submission</th>\n",
" <th>dbgap_study_accession</th>\n",
" <th>Consent</th>\n",
" <th>RunHash</th>\n",
" <th>ReadHash</th>\n",
" <th>spades_time</th>\n",
" <th>srr_batch</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>203</th>\n",
" <td>203</td>\n",
" <td>ERR829817</td>\n",
" <td>2015-03-25 11:26:03</td>\n",
" <td>2016-03-08 20:05:11</td>\n",
" <td>2023883</td>\n",
" <td>404776600</td>\n",
" <td>2023883</td>\n",
" <td>200</td>\n",
" <td>252</td>\n",
" <td>GCA_000011505.1</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>THE WELLCOME TRUST SANGER INSTITUTE</td>\n",
" <td>ERA422299</td>\n",
" <td>NaN</td>\n",
" <td>public</td>\n",
" <td>D93140F4BA1DA8869ABD791AB6975A6A</td>\n",
" <td>1C9478B92E0D843598618563F3F00F04</td>\n",
" <td>194.0</td>\n",
" <td>srr_batch_027</td>\n",
" </tr>\n",
" <tr>\n",
" <th>204</th>\n",
" <td>204</td>\n",
" <td>ERR829820</td>\n",
" <td>2015-03-25 11:26:03</td>\n",
" <td>2016-03-08 20:08:18</td>\n",
" <td>2155402</td>\n",
" <td>431080400</td>\n",
" <td>2155402</td>\n",
" <td>200</td>\n",
" <td>267</td>\n",
" <td>GCA_000011505.1</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>THE WELLCOME TRUST SANGER INSTITUTE</td>\n",
" <td>ERA422299</td>\n",
" <td>NaN</td>\n",
" <td>public</td>\n",
" <td>C7C230BE779F3B292B0E5F9C12CBB1FA</td>\n",
" <td>D120886C5C01F5809F3DEFDF45A195FD</td>\n",
" <td>173.0</td>\n",
" <td>srr_batch_027</td>\n",
" </tr>\n",
" <tr>\n",
" <th>821</th>\n",
" <td>821</td>\n",
" <td>ERR900483</td>\n",
" <td>2015-05-27 06:33:11</td>\n",
" <td>2016-01-18 00:49:41</td>\n",
" <td>3938509</td>\n",
" <td>787701800</td>\n",
" <td>3938509</td>\n",
" <td>200</td>\n",
" <td>479</td>\n",
" <td>GCA_000011505.1</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>THE WELLCOME TRUST SANGER INSTITUTE</td>\n",
" <td>ERA441502</td>\n",
" <td>NaN</td>\n",
" <td>public</td>\n",
" <td>1EA82E1DC3866DD5D45C960F4A676A27</td>\n",
" <td>D246A763EFA059F03A4D20472B506585</td>\n",
" <td>355.0</td>\n",
" <td>srr_batch_028</td>\n",
" </tr>\n",
" <tr>\n",
" <th>822</th>\n",
" <td>822</td>\n",
" <td>ERR900489</td>\n",
" <td>2015-05-27 06:33:12</td>\n",
" <td>2016-01-17 17:31:13</td>\n",
" <td>2375440</td>\n",
" <td>475088000</td>\n",
" <td>2375440</td>\n",
" <td>200</td>\n",
" <td>291</td>\n",
" <td>GCA_000011505.1</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>THE WELLCOME TRUST SANGER INSTITUTE</td>\n",
" <td>ERA441502</td>\n",
" <td>NaN</td>\n",
" <td>public</td>\n",
" <td>C354AB0C14C86D98AED8AAC9ABB193FD</td>\n",
" <td>F8758BB433CA8433490545D81FC74515</td>\n",
" <td>219.0</td>\n",
" <td>srr_batch_028</td>\n",
" </tr>\n",
" <tr>\n",
" <th>823</th>\n",
" <td>823</td>\n",
" <td>ERR900490</td>\n",
" <td>2015-05-27 06:33:12</td>\n",
" <td>2016-01-18 07:03:14</td>\n",
" <td>2239366</td>\n",
" <td>447873200</td>\n",
" <td>2239366</td>\n",
" <td>200</td>\n",
" <td>274</td>\n",
" <td>GCA_000011505.1</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>THE WELLCOME TRUST SANGER INSTITUTE</td>\n",
" <td>ERA441502</td>\n",
" <td>NaN</td>\n",
" <td>public</td>\n",
" <td>2836DEEF81E0C7EAF2EEB12A51AB8B53</td>\n",
" <td>C5468195F81B3A3497F899B70493F00C</td>\n",
" <td>308.0</td>\n",
" <td>srr_batch_028</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9825</th>\n",
" <td>9825</td>\n",
" <td>ERR3491282</td>\n",
" <td>2019-08-29 08:12:06</td>\n",
" <td>2019-08-29 16:43:37</td>\n",
" <td>914113</td>\n",
" <td>416231752</td>\n",
" <td>914113</td>\n",
" <td>455</td>\n",
" <td>238</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>UNIVERSITY OF COPENHAGEN</td>\n",
" <td>ERA2101231</td>\n",
" <td>NaN</td>\n",
" <td>public</td>\n",
" <td>F26FBF6EA4E164DA80ABC8EE4F3E7A5C</td>\n",
" <td>39DA443AD578F0F7D23F838140254516</td>\n",
" <td>254.0</td>\n",
" <td>srr_batch_021</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9826</th>\n",
" <td>9826</td>\n",
" <td>ERR3491285</td>\n",
" <td>2019-08-29 08:12:06</td>\n",
" <td>2019-08-29 16:44:22</td>\n",
" <td>794175</td>\n",
" <td>357519189</td>\n",
" <td>794175</td>\n",
" <td>450</td>\n",
" <td>205</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>UNIVERSITY OF COPENHAGEN</td>\n",
" <td>ERA2101231</td>\n",
" <td>NaN</td>\n",
" <td>public</td>\n",
" <td>2F79FBDAA96EBCEC5A41EB3BCFBA95AF</td>\n",
" <td>8C590944290B059036A3EA93B409AC56</td>\n",
" <td>224.0</td>\n",
" <td>srr_batch_021</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9827</th>\n",
" <td>9827</td>\n",
" <td>ERR3491288</td>\n",
" <td>2019-08-29 08:12:06</td>\n",
" <td>2019-08-29 16:44:44</td>\n",
" <td>740969</td>\n",
" <td>335487659</td>\n",
" <td>740969</td>\n",
" <td>452</td>\n",
" <td>194</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>UNIVERSITY OF COPENHAGEN</td>\n",
" <td>ERA2101231</td>\n",
" <td>NaN</td>\n",
" <td>public</td>\n",
" <td>1B9BDDF08DFFE966E583BB8510D1A1C9</td>\n",
" <td>2D4E33AA83E3DF0C6743C2941FB92E06</td>\n",
" <td>184.0</td>\n",
" <td>srr_batch_021</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9828</th>\n",
" <td>9828</td>\n",
" <td>ERR3491372</td>\n",
" <td>2019-08-29 08:12:07</td>\n",
" <td>2019-08-29 16:55:42</td>\n",
" <td>686603</td>\n",
" <td>335627151</td>\n",
" <td>686603</td>\n",
" <td>488</td>\n",
" <td>178</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>UNIVERSITY OF COPENHAGEN</td>\n",
" <td>ERA2101231</td>\n",
" <td>NaN</td>\n",
" <td>public</td>\n",
" <td>20486576F4823DAA58B831EC46E2C3BD</td>\n",
" <td>23B4D8A735FA899AE1E8AB1D9E94211F</td>\n",
" <td>251.0</td>\n",
" <td>srr_batch_022</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9829</th>\n",
" <td>9829</td>\n",
" <td>ERR3491395</td>\n",
" <td>2019-08-29 08:12:08</td>\n",
" <td>2019-08-29 16:59:00</td>\n",
" <td>666494</td>\n",
" <td>300810331</td>\n",
" <td>666494</td>\n",
" <td>451</td>\n",
" <td>159</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>UNIVERSITY OF COPENHAGEN</td>\n",
" <td>ERA2101231</td>\n",
" <td>NaN</td>\n",
" <td>public</td>\n",
" <td>CC0B1B2B7982BEB010327DBF4050F917</td>\n",
" <td>1B39810F4998FDE338225C5B5408676A</td>\n",
" <td>183.0</td>\n",
" <td>srr_batch_023</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>800 rows × 50 columns</p>\n",
"</div>"
],
"text/plain": [
" Unnamed: 0 Run ReleaseDate LoadDate \\\n",
"203 203 ERR829817 2015-03-25 11:26:03 2016-03-08 20:05:11 \n",
"204 204 ERR829820 2015-03-25 11:26:03 2016-03-08 20:08:18 \n",
"821 821 ERR900483 2015-05-27 06:33:11 2016-01-18 00:49:41 \n",
"822 822 ERR900489 2015-05-27 06:33:12 2016-01-17 17:31:13 \n",
"823 823 ERR900490 2015-05-27 06:33:12 2016-01-18 07:03:14 \n",
"... ... ... ... ... \n",
"9825 9825 ERR3491282 2019-08-29 08:12:06 2019-08-29 16:43:37 \n",
"9826 9826 ERR3491285 2019-08-29 08:12:06 2019-08-29 16:44:22 \n",
"9827 9827 ERR3491288 2019-08-29 08:12:06 2019-08-29 16:44:44 \n",
"9828 9828 ERR3491372 2019-08-29 08:12:07 2019-08-29 16:55:42 \n",
"9829 9829 ERR3491395 2019-08-29 08:12:08 2019-08-29 16:59:00 \n",
"\n",
" spots bases spots_with_mates avgLength size_MB \\\n",
"203 2023883 404776600 2023883 200 252 \n",
"204 2155402 431080400 2155402 200 267 \n",
"821 3938509 787701800 3938509 200 479 \n",
"822 2375440 475088000 2375440 200 291 \n",
"823 2239366 447873200 2239366 200 274 \n",
"... ... ... ... ... ... \n",
"9825 914113 416231752 914113 455 238 \n",
"9826 794175 357519189 794175 450 205 \n",
"9827 740969 335487659 740969 452 194 \n",
"9828 686603 335627151 686603 488 178 \n",
"9829 666494 300810331 666494 451 159 \n",
"\n",
" AssemblyName ... Histological_Type Body_Site \\\n",
"203 GCA_000011505.1 ... NaN NaN \n",
"204 GCA_000011505.1 ... NaN NaN \n",
"821 GCA_000011505.1 ... NaN NaN \n",
"822 GCA_000011505.1 ... NaN NaN \n",
"823 GCA_000011505.1 ... NaN NaN \n",
"... ... ... ... ... \n",
"9825 NaN ... NaN NaN \n",
"9826 NaN ... NaN NaN \n",
"9827 NaN ... NaN NaN \n",
"9828 NaN ... NaN NaN \n",
"9829 NaN ... NaN NaN \n",
"\n",
" CenterName Submission dbgap_study_accession \\\n",
"203 THE WELLCOME TRUST SANGER INSTITUTE ERA422299 NaN \n",
"204 THE WELLCOME TRUST SANGER INSTITUTE ERA422299 NaN \n",
"821 THE WELLCOME TRUST SANGER INSTITUTE ERA441502 NaN \n",
"822 THE WELLCOME TRUST SANGER INSTITUTE ERA441502 NaN \n",
"823 THE WELLCOME TRUST SANGER INSTITUTE ERA441502 NaN \n",
"... ... ... ... \n",
"9825 UNIVERSITY OF COPENHAGEN ERA2101231 NaN \n",
"9826 UNIVERSITY OF COPENHAGEN ERA2101231 NaN \n",
"9827 UNIVERSITY OF COPENHAGEN ERA2101231 NaN \n",
"9828 UNIVERSITY OF COPENHAGEN ERA2101231 NaN \n",
"9829 UNIVERSITY OF COPENHAGEN ERA2101231 NaN \n",
"\n",
" Consent RunHash \\\n",
"203 public D93140F4BA1DA8869ABD791AB6975A6A \n",
"204 public C7C230BE779F3B292B0E5F9C12CBB1FA \n",
"821 public 1EA82E1DC3866DD5D45C960F4A676A27 \n",
"822 public C354AB0C14C86D98AED8AAC9ABB193FD \n",
"823 public 2836DEEF81E0C7EAF2EEB12A51AB8B53 \n",
"... ... ... \n",
"9825 public F26FBF6EA4E164DA80ABC8EE4F3E7A5C \n",
"9826 public 2F79FBDAA96EBCEC5A41EB3BCFBA95AF \n",
"9827 public 1B9BDDF08DFFE966E583BB8510D1A1C9 \n",
"9828 public 20486576F4823DAA58B831EC46E2C3BD \n",
"9829 public CC0B1B2B7982BEB010327DBF4050F917 \n",
"\n",
" ReadHash spades_time srr_batch \n",
"203 1C9478B92E0D843598618563F3F00F04 194.0 srr_batch_027 \n",
"204 D120886C5C01F5809F3DEFDF45A195FD 173.0 srr_batch_027 \n",
"821 D246A763EFA059F03A4D20472B506585 355.0 srr_batch_028 \n",
"822 F8758BB433CA8433490545D81FC74515 219.0 srr_batch_028 \n",
"823 C5468195F81B3A3497F899B70493F00C 308.0 srr_batch_028 \n",
"... ... ... ... \n",
"9825 39DA443AD578F0F7D23F838140254516 254.0 srr_batch_021 \n",
"9826 8C590944290B059036A3EA93B409AC56 224.0 srr_batch_021 \n",
"9827 2D4E33AA83E3DF0C6743C2941FB92E06 184.0 srr_batch_021 \n",
"9828 23B4D8A735FA899AE1E8AB1D9E94211F 251.0 srr_batch_022 \n",
"9829 1B39810F4998FDE338225C5B5408676A 183.0 srr_batch_023 \n",
"\n",
"[800 rows x 50 columns]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fa3c733a-0b37-4167-b12b-7168fa4370a4",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
...
...
%% Cell type:code id:43ecd078-674a-4027-a226-f2ca8d7edb02 tags:
```
python
# Dependencies
from
os.path
import
join
from
Bio
import
SeqIO
import
pandas
as
pd
```
%% Cell type:markdown id:1273f0c1-2851-4508-98d1-1098a5126e0e tags:
## Some function without error handling
%% Cell type:code id:4334e544-06de-41f2-8760-6b4d94604534 tags:
```
python
def
get_phage_filenames_fromdb
(
phage_mapping
,
query_srr
):
phage_name_list
=
list
(
phage_mapping
[
phage_mapping
[
'
SRR number
'
]
==
query_srr
][
'
Phage name
'
])
return
phage_name_list
def
get_phage_fasta_files
(
phage_name_list
,
phage_seqdir
):
phage_seq_paths
=
[
join
(
phage_seqdir
,
f
'
{
phage_name
}
.fa
'
)
for
phage_name
in
phage_name_list
]
return
phage_seq_paths
def
get_phage_sequences
(
phage_fasta_files
):
'''
Loading the phage sequences with biopython
'''
phage_seqs
=
[]
for
phage_seq_file
in
phage_fasta_files
:
act_phage_seqs
=
list
(
SeqIO
.
parse
(
phage_seq_file
,
"
fasta
"
))
phage_seqs
.
extend
(
act_phage_seqs
)
return
phage_seqs
def
get_assembly_folder
(
run_id
,
assembly_batch_mapping
,
assembly_basedir
):
srr_batch_id
=
list
(
assembly_batch_mapping
[
assembly_batch_mapping
[
'
Run
'
]
==
run_id
][
'
srr_batch
'
])[
0
]
expected_assembly_folder
=
join
(
assembly_basedir
,
srr_batch_id
,
run_id
)
return
expected_assembly_folder
```
%% Cell type:markdown id:0cd4c7e1-f333-4450-8799-711675d2ec1f tags:
## Init
%% Cell type:code id:8730f244-ed8e-4ec4-bea9-30419b206614 tags:
```
python
## Loading the database
phage_sequences_dir
=
'
/home/ligeti/NAR2022Phage/data
'
assembly_basedir
=
'
/scratch/behemoth00/nar_assembly/assembly
'
phage_mapping_file
=
'
/home/ligeti/NAR2022Phage/phage_bac_mapping.tsv
'
assembly_file
=
'
../data/clean_all_phage_sra_metadata_spades.csv
'
phage_mapping
=
pd
.
read_csv
(
phage_mapping_file
,
sep
=
'
\t
'
)
assembly_info
=
pd
.
read_csv
(
assembly_file
)
# Only assemblies with batch
assembly_info
=
assembly_info
[
~
assembly_info
[
'
srr_batch
'
].
isnull
()]
assembly_batch_mapping
=
assembly_info
[[
'
Run
'
,
'
srr_batch
'
]]
```
%% Cell type:markdown id:63646c56-3597-4609-bed1-b967cfc4a4fa tags:
## Usage
%% Cell type:code id:192b81e9-1969-49bd-864a-2e3c279344e9 tags:
```
python
act_runid
=
'
ERR3515864
'
assemblydir
=
get_assembly_folder
(
act_runid
,
assembly_batch_mapping
,
assembly_basedir
)
print
(
f
'
assemblydir:
{
assemblydir
}
'
)
phage_names
=
get_phage_filenames_fromdb
(
phage_mapping
,
act_runid
)
phage_fasta_files
=
get_phage_fasta_files
(
phage_names
,
phage_sequences_dir
)
sequences
=
get_phage_sequences
(
phage_fasta_files
)
```
%% Output
assemblydir: /scratch/behemoth00/nar_assembly/assembly/srr_batch_023/ERR3515864
%% Cell type:code id:ab363fe3-5157-4a70-8f08-dbbb34860e95 tags:
```
python
```
%% Output
Unnamed: 0 Run ReleaseDate LoadDate \
203 203 ERR829817 2015-03-25 11:26:03 2016-03-08 20:05:11
204 204 ERR829820 2015-03-25 11:26:03 2016-03-08 20:08:18
821 821 ERR900483 2015-05-27 06:33:11 2016-01-18 00:49:41
822 822 ERR900489 2015-05-27 06:33:12 2016-01-17 17:31:13
823 823 ERR900490 2015-05-27 06:33:12 2016-01-18 07:03:14
... ... ... ... ...
9825 9825 ERR3491282 2019-08-29 08:12:06 2019-08-29 16:43:37
9826 9826 ERR3491285 2019-08-29 08:12:06 2019-08-29 16:44:22
9827 9827 ERR3491288 2019-08-29 08:12:06 2019-08-29 16:44:44
9828 9828 ERR3491372 2019-08-29 08:12:07 2019-08-29 16:55:42
9829 9829 ERR3491395 2019-08-29 08:12:08 2019-08-29 16:59:00
spots bases spots_with_mates avgLength size_MB \
203 2023883 404776600 2023883 200 252
204 2155402 431080400 2155402 200 267
821 3938509 787701800 3938509 200 479
822 2375440 475088000 2375440 200 291
823 2239366 447873200 2239366 200 274
... ... ... ... ... ...
9825 914113 416231752 914113 455 238
9826 794175 357519189 794175 450 205
9827 740969 335487659 740969 452 194
9828 686603 335627151 686603 488 178
9829 666494 300810331 666494 451 159
AssemblyName ... Histological_Type Body_Site \
203 GCA_000011505.1 ... NaN NaN
204 GCA_000011505.1 ... NaN NaN
821 GCA_000011505.1 ... NaN NaN
822 GCA_000011505.1 ... NaN NaN
823 GCA_000011505.1 ... NaN NaN
... ... ... ... ...
9825 NaN ... NaN NaN
9826 NaN ... NaN NaN
9827 NaN ... NaN NaN
9828 NaN ... NaN NaN
9829 NaN ... NaN NaN
CenterName Submission dbgap_study_accession \
203 THE WELLCOME TRUST SANGER INSTITUTE ERA422299 NaN
204 THE WELLCOME TRUST SANGER INSTITUTE ERA422299 NaN
821 THE WELLCOME TRUST SANGER INSTITUTE ERA441502 NaN
822 THE WELLCOME TRUST SANGER INSTITUTE ERA441502 NaN
823 THE WELLCOME TRUST SANGER INSTITUTE ERA441502 NaN
... ... ... ...
9825 UNIVERSITY OF COPENHAGEN ERA2101231 NaN
9826 UNIVERSITY OF COPENHAGEN ERA2101231 NaN
9827 UNIVERSITY OF COPENHAGEN ERA2101231 NaN
9828 UNIVERSITY OF COPENHAGEN ERA2101231 NaN
9829 UNIVERSITY OF COPENHAGEN ERA2101231 NaN
Consent RunHash \
203 public D93140F4BA1DA8869ABD791AB6975A6A
204 public C7C230BE779F3B292B0E5F9C12CBB1FA
821 public 1EA82E1DC3866DD5D45C960F4A676A27
822 public C354AB0C14C86D98AED8AAC9ABB193FD
823 public 2836DEEF81E0C7EAF2EEB12A51AB8B53
... ... ...
9825 public F26FBF6EA4E164DA80ABC8EE4F3E7A5C
9826 public 2F79FBDAA96EBCEC5A41EB3BCFBA95AF
9827 public 1B9BDDF08DFFE966E583BB8510D1A1C9
9828 public 20486576F4823DAA58B831EC46E2C3BD
9829 public CC0B1B2B7982BEB010327DBF4050F917
ReadHash spades_time srr_batch
203 1C9478B92E0D843598618563F3F00F04 194.0 srr_batch_027
204 D120886C5C01F5809F3DEFDF45A195FD 173.0 srr_batch_027
821 D246A763EFA059F03A4D20472B506585 355.0 srr_batch_028
822 F8758BB433CA8433490545D81FC74515 219.0 srr_batch_028
823 C5468195F81B3A3497F899B70493F00C 308.0 srr_batch_028
... ... ... ...
9825 39DA443AD578F0F7D23F838140254516 254.0 srr_batch_021
9826 8C590944290B059036A3EA93B409AC56 224.0 srr_batch_021
9827 2D4E33AA83E3DF0C6743C2941FB92E06 184.0 srr_batch_021
9828 23B4D8A735FA899AE1E8AB1D9E94211F 251.0 srr_batch_022
9829 1B39810F4998FDE338225C5B5408676A 183.0 srr_batch_023
[800 rows x 50 columns]
%% Cell type:code id:fa3c733a-0b37-4167-b12b-7168fa4370a4 tags:
```
python
```
...
...
This diff is collapsed.
Click to expand it.
data/clean_all_phage_sra_metadata_spades.csv
0 → 100644
+
9832
−
0
View file @
a5ff07c7
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment