sqlite datatype

classic Classic list List threaded Threaded
2 messages Options
| Threaded
Open this post in threaded view
|

sqlite datatype

Jorrit Boekel
Hi JJ and mailing list,

I think it was John who mailed earlier about implementing an SQLite datatype for proteomics. I sort of like this, since I find myself using SQLite all the time for my own command line tool to do filtering, protein grouping, and relating quant output to ID data.
https://github.com/glormph/msstitch  — I basically wrote this to combine percolator, mzidentml, and openMS intermediate formats. It’s currently a bit of a mixture of different databases and tables for different operations, but maybe it would be a good idea to have one large proteomics experiment SQLite database. Note that this is nothing I want to push onto people like it would be the next PSI standard, more like a very convenient and fast intermediate datatype.

I believe that the proteome discoverer msf files are also based on sqlite. While I’m not sure if it’s a good fit for a server environment since it can get locking/access problems when accessed by too many processes, it’s a very nice fit for a bioinformatic cli tool. Are you guys currently working on any standardized SQLite thing?

cheers,

Jorrit Boekel
Proteomics systems developer
BILS / Lehtiö lab
Scilifelab Stockholm, Sweden



_______________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: sqlite datatype

Jim Johnson
On 2/12/15 1:50 AM, Jorrit Boekel wrote:

> Hi JJ and mailing list,
>
> I think it was John who mailed earlier about implementing an SQLite datatype for proteomics. I sort of like this, since I find myself using SQLite all the time for my own command line tool to do filtering, protein grouping, and relating quant output to ID data.
> https://github.com/glormph/msstitch  — I basically wrote this to combine percolator, mzidentml, and openMS intermediate formats. It’s currently a bit of a mixture of different databases and tables for different operations, but maybe it would be a good idea to have one large proteomics experiment SQLite database. Note that this is nothing I want to push onto people like it would be the next PSI standard, more like a very convenient and fast intermediate datatype.
>
> I believe that the proteome discoverer msf files are also based on sqlite. While I’m not sure if it’s a good fit for a server environment since it can get locking/access problems when accessed by too many processes, it’s a very nice fit for a bioinformatic cli tool. Are you guys currently working on any standardized SQLite thing?
>
> cheers,
> —
> Jorrit Boekel
> Proteomics systems developer
> BILS / Lehtiö lab
> Scilifelab Stockholm, Sweden
>
>
>
Jorrit,

We would like to define a Galaxy datatype that subclasses the Galaxy SQLite datatype and defines a schema for proteomics.

Our current use case is as a Galaxy dataprovider for a visualization plugin.
Ira and I had discussed this last summer, and he quickly put together demo visualization plugin.

MSI is currently prototyping a Galaxy visualization plugin using the lorikeet spectral viewer.

For feasibility testing, I just used the schema that fell out of the mzR bioconductor package (details below),
but we would now like to design a schema that would be responsive, but also generic enough to cover a variety of needs.

I'll take a look at what you've done for guidance.

Thanks,

JJ


I've been using this data as a test case for the Visualization plugin:

http://proteome.sysbiol.cam.ac.uk/lgatto/RforProteomics/data/iTRAQ_Erwinia_1uLSike_Top10HCD_40CE_60min_01_100614184128-centroided.mzML
http://proteome.sysbiol.cam.ac.uk/lgatto/RforProteomics/data/iTRAQ_Erwinia_1uLSike_Top10HCD_40CE_60min_01_100614184128-centroided.mzid

I generated the SQLite DB using this R script:

## install.packages(c("/Users/jj/src/bio/prot/bioc/mzR_2.0.0.tar.gz"), lib = "/Library/Frameworks/R.framework/Resources/library/",repos=NULL,type="source")
library(mzR)
library(msdata)
library(jsonlite)
# Read in scan data
z <- openMSfile("iTRAQ_Erwinia.mzML")
# Not used in any DB table as yet
zri <- runInfo(z)
zii <- instrumentInfo(z)
# Read in mzIdentML
x <- openIDfile("iTRAQ_Erwinia.mzid")
ix <- mzidInfo(x)
p <- psms(x)
m <- modifications(x)
s <- score(x)
psm <- data.frame(scanNum=sub("^.*=(\\d+)$","\\1",p[,1]),p)
# Need to get spectrumID
# Then parse out scan num
## pl <- peaks(z,6591)
## toJSON(p[1,])
## toJSON(header(z,6591))
## toJSON(peaks(z,6591))
scanl <- 1:runInfo(z)$scanCount
moz <- sapply(scanl,function(y) toJSON(peaks(z,y)[,1]))
intensity <- sapply(scanl,function(y) toJSON(peaks(z,y)[,2]))
pkdf <- data.frame(scanNum = scanl, moz = moz, intensity = intensity)
## mzR to RSQLite
library("RSQLite")
sqlite <- dbDriver("SQLite")
dbconn <- dbConnect(sqlite,"iTRAQ_Erwinia.sqlite")
dbWriteTable(dbconn,"psm",psm)
dbWriteTable(dbconn,"scan",header(z))
dbWriteTable(dbconn,"score",s)
dbWriteTable(dbconn,"modification",m)
dbWriteTable(dbconn,"peaks",pkdf)


$ sqlite3 iTRAQ_Erwinia.sqlite
SQLite version 3.4.0
Enter ".help" for instructions
sqlite> .schema
CREATE TABLE modification
( "spectrumID" TEXT,
         "sequence" TEXT,
         "name" TEXT,
         "mass" REAL,
         "location" INTEGER
);
CREATE TABLE peaks
( "scanNum" INTEGER,
         "moz" TEXT,
         "intensity" TEXT
);
CREATE TABLE psm
( "scanNum" TEXT,
         "spectrumID" TEXT,
         "chargeState" INTEGER,
         "rank" INTEGER,
         "passThreshold" INTEGER,
         "experimentalMassToCharge" REAL,
         "calculatedMassToCharge" REAL,
         "sequence" TEXT,
         "modNum" INTEGER,
         "isDecoy" INTEGER,
         "post" TEXT,
         "pre" TEXT,
         "start" INTEGER,
         "end" INTEGER,
         "DatabaseAccess" TEXT,
         "DatabaseSeq" TEXT,
         "DatabaseDescription" TEXT
);
CREATE TABLE scan
( "seqNum" INTEGER,
         "acquisitionNum" INTEGER,
         "msLevel" INTEGER,
         "polarity" INTEGER,
         "peaksCount" INTEGER,
         "totIonCurrent" REAL,
         "retentionTime" REAL,
         "basePeakMZ" REAL,
         "basePeakIntensity" REAL,
         "collisionEnergy" REAL,
         "ionisationEnergy" REAL,
         "lowMZ" REAL,
         "highMZ" REAL,
         "precursorScanNum" INTEGER,
         "precursorMZ" REAL,
         "precursorCharge" INTEGER,
         "precursorIntensity" REAL,
         "mergedScan" INTEGER,
         "mergedResultScanNum" INTEGER,
         "mergedResultStartScanNum" INTEGER,
         "mergedResultEndScanNum" INTEGER
);
CREATE TABLE score
( "spectrumID" TEXT,
         "MS_GF_RawScore" REAL,
         "MS_GF_DeNovoScore" REAL,
         "MS_GF_SpecEValue" REAL,
         "MS_GF_EValue" REAL
);






--
James E. Johnson Minnesota Supercomputing Institute University of Minnesota
_______________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/