getGenesFromKEGG Retrieves information on all genes stored in KEGG database keggPath this function reads data from a local FTP dump of the KEGG database. keggPath is the pathway to the root of the database koList the number of genes in KEGG is very large. koList can be a cell array with KO identifiers, in which case only genes belonging to one of those KEGG orthologies are retrieved (opt, default all KOs with associated reactions) model a model structure generated from the database. The following fields are filled id: 'KEGG' description: 'Automatically generated from KEGG database' rxns: KO ids rxnNames: Name for each entry genes: IDs for all the genes. Genes are saved as organism abbreviation:id (same as in KEGG). 'HSA:124' for example is alcohol dehydrogenase in Homo sapiens rxnGeneMat A binary matrix that indicates whether a specific gene is present in a KO id If the file keggGenes.mat is in the RAVEN directory it will be loaded instead of parsing of the KEGG files. If it does not exist it will be saved after parsing of the KEGG files. In general, you should remove the keggGenes.mat file if you want to rebuild the model structure from a newer version of KEGG. Usage: model=getGenesFromKEGG(keggPath,koList) Rasmus Agren, 2013-08-01
0001 function model=getGenesFromKEGG(keggPath,koList) 0002 % getGenesFromKEGG 0003 % Retrieves information on all genes stored in KEGG database 0004 % 0005 % keggPath this function reads data from a local FTP dump of the KEGG 0006 % database. keggPath is the pathway to the root of the database 0007 % 0008 % koList the number of genes in KEGG is very large. koList can be a 0009 % cell array with KO identifiers, in which case only genes 0010 % belonging to one of those KEGG orthologies are retrieved (opt, 0011 % default all KOs with associated reactions) 0012 % 0013 % model a model structure generated from the database. The following 0014 % fields are filled 0015 % id: 'KEGG' 0016 % description: 'Automatically generated from KEGG database' 0017 % rxns: KO ids 0018 % rxnNames: Name for each entry 0019 % genes: IDs for all the genes. Genes are saved as 0020 % organism abbreviation:id (same as in KEGG). 0021 % 'HSA:124' for example is alcohol dehydrogenase 0022 % in Homo sapiens 0023 % rxnGeneMat A binary matrix that indicates whether a 0024 % specific gene is present in a KO id 0025 % 0026 % If the file keggGenes.mat is in the RAVEN directory it will be loaded 0027 % instead of parsing of the KEGG files. If it does not exist it will be 0028 % saved after parsing of the KEGG files. In general, you should remove 0029 % the keggGenes.mat file if you want to rebuild the model structure from 0030 % a newer version of KEGG. 0031 % 0032 % Usage: model=getGenesFromKEGG(keggPath,koList) 0033 % 0034 % Rasmus Agren, 2013-08-01 0035 % 0036 0037 %NOTE: This is how one entry looks in the file 0038 0039 % ENTRY K11440 KO 0040 % NAME gbsB 0041 % DEFINITION choline dehydrogenase [EC:1.1.1.1] 0042 % CLASS Metabolism; Amino Acid Metabolism; Glycine, serine and threonine 0043 % metabolism [PATH:ko00260] 0044 % DBLINKS RN: R08557 R08558 0045 % COG: COG1454 0046 % GENES BSU: BSU31050(gbsB) 0047 % BCY: Bcer98_1477 0048 % BLI: BL02563(gbsB) 0049 % BLD: BLi03269(gbsB) 0050 % BCL: ABC0979(gbsB) 0051 % BAY: RBAM_028150(gbsB) 0052 % BPU: BPUM_2734(gbsB) 0053 % ADDITIONAL INFO... 0054 % /// 0055 0056 %The file is not tab-delimited. Instead each label is 12 characters 0057 %(except for '///') 0058 0059 %Check if the genes have been parsed before and saved. If so, load the 0060 %model. 0061 [ST I]=dbstack('-completenames'); 0062 ravenPath=fileparts(ST(I).file); 0063 genesFile=fullfile(ravenPath,'kegg','keggGenes.mat'); 0064 if exist(genesFile, 'file') 0065 fprintf(['NOTE: Importing KEGG genes from ' strrep(genesFile,'\','/') '.\n']); 0066 load(genesFile); 0067 else 0068 %Download required files from KEGG if it doesn't exist in the directory 0069 downloadKEGG(keggPath); 0070 0071 %Get all KOs that are associated to reactions 0072 allKOs=getAllKOs(keggPath); 0073 0074 %Since the list of genes will be accessed many times I use a hash table 0075 geneMap=containers.Map('KeyType','char','ValueType','int32'); 0076 0077 %Add new functionality in the order specified in models 0078 model.id='KEGG'; 0079 model.description='Automatically generated from KEGG database'; 0080 0081 %Preallocate memory 0082 model.rxns=cell(numel(allKOs),1); 0083 model.rxnNames=cell(numel(allKOs),1); 0084 0085 %Reserve space for 500000 genes 0086 model.genes=cell(500000,1); 0087 0088 %Load information on KO ID, name, and associated genes 0089 fid = fopen(fullfile(keggPath,'ko'), 'r'); 0090 0091 %Keeps track of how many KOs that have been added 0092 koCounter=0; 0093 nGenes=0; 0094 addingGenes=false; 0095 0096 %These contain the mapping between KOs and genes and are used to 0097 %generate the model.rxnGeneMat in the end 0098 koIndex=zeros(5000000,1); 0099 geneIndex=koIndex; 0100 nMappings=0; 0101 0102 skipRead=false; 0103 0104 %Loop through the file 0105 while 1 0106 %Get the next line 0107 if skipRead==false 0108 tline = fgetl(fid); 0109 else 0110 skipRead=false; 0111 end 0112 0113 %Abort at end of file 0114 if ~ischar(tline) 0115 break; 0116 end 0117 0118 %Skip '///' 0119 if numel(tline)<12 0120 addingGenes=false; 0121 continue; 0122 end 0123 0124 %Check if it's a new reaction 0125 if strcmp(tline(1:12),'ENTRY ') 0126 %Check if it should be added 0127 koID=tline(13:18); 0128 0129 if ismember(koID,allKOs) 0130 addMe=true; 0131 koCounter=koCounter+1 0132 else 0133 addMe=false; 0134 continue; 0135 end 0136 0137 %Add reaction ID (always 6 characters) 0138 model.rxns{koCounter}=koID; 0139 model.rxnNames{koCounter}=''; %Will be overwritten if it exists 0140 end 0141 0142 %To get here we must be in a KO that should be added 0143 if addMe==true 0144 %Add name 0145 if strcmp(tline(1:12),'DEFINITION ') 0146 model.rxnNames{koCounter}=tline(13:end); 0147 end 0148 0149 %Check if it's time to start adding genes 0150 if strcmp(tline(1:12),'GENES ') 0151 addingGenes=true; 0152 end 0153 0154 %Add genes 0155 if addingGenes==true 0156 %We are now adding genes for the current KO. All gene rows start 0157 %with 12 spaces. If this is not true it means that there is 0158 %additional annotation such as AUTHORS. This is not common but it 0159 %exists. 0160 if strcmp(tline(1:12),' ') || strcmp(tline(1:12),'GENES ') 0161 geneLine=tline(13:end); 0162 0163 %Check if the line is from a new organism of from the same as 0164 %before 0165 if strcmp(geneLine(4),':') 0166 currentOrganism=geneLine(1:3); 0167 end 0168 0169 %Parse the string 0170 genes=regexp(geneLine(6:end),' ','split'); 0171 genes=strcat(currentOrganism,':',genes(:)); 0172 0173 %Add the genes to the gene list 0174 for i=1:numel(genes) 0175 geneString=genes{i}; 0176 if geneMap.isKey(geneString); 0177 index=geneMap(geneString); 0178 else 0179 nGenes=nGenes+1; 0180 model.genes{nGenes}=geneString; 0181 index=nGenes; 0182 geneMap(geneString)=index; 0183 end 0184 0185 %Add the connection to the KO 0186 nMappings=nMappings+1; 0187 koIndex(nMappings)=koCounter; 0188 geneIndex(nMappings)=index; 0189 end 0190 else 0191 %Now we want to throw away everything until the next 0192 %entry 0193 while 1 0194 tline = fgetl(fid); 0195 if strcmp(tline,'///') 0196 %When the new entry is found, skip reading one line to 0197 %fit with the rest of the code 0198 skipRead=true; 0199 break; 0200 end 0201 end 0202 end 0203 end 0204 end 0205 end 0206 %Close the file 0207 fclose(fid); 0208 0209 %If too much space was allocated, shrink the model 0210 model.rxns=model.rxns(1:koCounter); 0211 model.rxnNames=model.rxnNames(1:koCounter); 0212 model.genes=model.genes(1:nGenes); 0213 0214 %Retrieve and add the gene associations 0215 model.rxnGeneMat=sparse(koIndex(1:nMappings),geneIndex(1:nMappings),ones(nMappings,1)); 0216 0217 %To make sure the size is correct if the last KOs don't have genes 0218 if size(model.rxnGeneMat,1)~=koCounter 0219 model.rxnGeneMat(koCounter,1)=0; 0220 end 0221 0222 %Save the model structure 0223 save(genesFile,'model'); 0224 end 0225 0226 %Only get the KOs in koList 0227 I=~ismember(model.rxns,koList); 0228 model=removeRxns(model,I,true,true); 0229 end 0230 0231 function allKOs=getAllKOs(keggPath) 0232 %Retrieves all KOs that are associated to reactions. This is because 0233 %the number of genes in KEGG is very large so without this parsing it 0234 %would take many hours 0235 0236 allKOs={}; 0237 0238 %First check if the reactions have already been parsed 0239 [ST I]=dbstack('-completenames'); 0240 ravenPath=fileparts(ST(I).file); 0241 rxnsFile=fullfile(ravenPath,'kegg','keggRxns.mat'); 0242 if exist(rxnsFile, 'file') 0243 fprintf(['NOTE: Importing KEGG ORTHOLOGY list from ' strrep(rxnsFile,'\','/') '.\n']); 0244 load(rxnsFile,'model'); 0245 0246 %Loop through the reactions and add the corresponding genes 0247 for i=1:numel(model.rxns) 0248 if isstruct(model.rxnMiriams{i}) 0249 %Get all KOs 0250 allKOs=[allKOs;model.rxnMiriams{i}.value(strcmpi(model.rxnMiriams{i}.name,'urn:miriam:kegg.ko'))]; 0251 end 0252 end 0253 else 0254 %Parse the reaction file instead 0255 %First load information on reaction ID, reaction name, KO, pathway, and ec-number 0256 fid = fopen(fullfile(keggPath,'reaction'), 'r'); 0257 orthology=false; 0258 while 1 0259 %Get the next line 0260 tline = fgetl(fid); 0261 0262 %Abort at end of file 0263 if ~ischar(tline) 0264 break; 0265 end 0266 0267 %Skip '///' 0268 if numel(tline)<12 0269 continue; 0270 end 0271 0272 %Check if it's a new reaction 0273 if strcmp(tline(1:12),'ENTRY ') 0274 orthology=false; 0275 end 0276 0277 if strcmp(tline(1:9),'REFERENCE') 0278 orthology=false; 0279 end 0280 0281 %Add KO-ids 0282 if numel(tline)>16 0283 if strcmp(tline(1:16),'ORTHOLOGY KO: ') || strcmp(tline(1:16),' KO: ') || strcmp(tline(1:12),'ORTHOLOGY ') || orthology==true 0284 if strcmp(tline(13:16),'KO:') %This is in the old version 0285 allKOs=[allKOs;tline(17:22)]; 0286 else 0287 if strcmp(tline(1:12),'ORTHOLOGY ') 0288 %This means that it found one KO in the new format and that 0289 %subsequent lines might be other KOs 0290 orthology=true; 0291 end 0292 allKOs=[allKOs;tline(13:18)]; 0293 end 0294 end 0295 end 0296 end 0297 0298 %Close the file 0299 fclose(fid); 0300 end 0301 allKOs=unique(allKOs); 0302 end