parseScores Parses the output from a predictor to generate the geneScoreStructure. inputFile a file with the output from the predictor predictor the predictor that was used. 'tsv' for tab-separated values where the name of the compartments in the first row and each row after that correspond to a gene. 'wolf' for WoLFPSORT. (opt, default 'tsv') The function normalizes the scores so that the best score for each gene is 1.0. geneScoreStructure a structure to be used in predictLocalization Usage: geneScoreStructure=parseScores(inputFile,predictor,normalize) Rasmus Agren, 2013-08-01
0001 function geneScoreStructure=parseScores(inputFile,predictor) 0002 % parseScores 0003 % Parses the output from a predictor to generate the geneScoreStructure. 0004 % 0005 % inputFile a file with the output from the predictor 0006 % predictor the predictor that was used. 'tsv' for tab-separated values 0007 % where the name of the compartments in the first row and each 0008 % row after that correspond to a gene. 'wolf' for 0009 % WoLFPSORT. (opt, default 'tsv') 0010 % 0011 % The function normalizes the scores so that the best score for each gene 0012 % is 1.0. 0013 % 0014 % geneScoreStructure a structure to be used in predictLocalization 0015 % 0016 % Usage: geneScoreStructure=parseScores(inputFile,predictor,normalize) 0017 % 0018 % Rasmus Agren, 2013-08-01 0019 0020 if nargin<2 0021 predictor='tsv'; 0022 end 0023 0024 fid=fopen(inputFile,'r'); 0025 0026 if fid<1 0027 dispEM('Could not open file'); 0028 end 0029 0030 if strcmpi(predictor,'wolf') 0031 A=textscan(fid,'%s','Delimiter','\n','CommentStyle','#'); 0032 0033 %Each element should be for one gene, but some of them are on the form 0034 %"Pc20g11350: treating 9 X's as Glycines". Those should be removed. 0035 I=~cellfun(@any,strfind(A{1},'treating')); 0036 0037 B=regexp(A{1}(I),' ','split'); 0038 0039 %Reserve space for stuff 0040 geneScoreStructure.compartments={}; 0041 geneScoreStructure.scores=[]; %Don't know number of comps yet 0042 geneScoreStructure.genes=cell(numel(B),1); 0043 0044 %Parsing is a bit cumbersome as ', ' is used as a delimiter in some cases 0045 %and ' ' in others. Use strrep to get rid of ','. 0046 for i=1:numel(B) 0047 b=strrep(B{i},',',''); 0048 geneScoreStructure.genes{i}=b{1}; 0049 0050 %Then go through the compartments and add new ones as they are 0051 %found 0052 for j=2:2:numel(b)-1 0053 [crap J]=ismember(b(j),geneScoreStructure.compartments); 0054 0055 %Add new compartment if it doesn't exist 0056 if J==0 0057 geneScoreStructure.compartments=[geneScoreStructure.compartments;b(j)]; 0058 J=numel(geneScoreStructure.compartments); 0059 geneScoreStructure.scores=[geneScoreStructure.scores zeros(numel(B),1)]; 0060 end 0061 0062 geneScoreStructure.scores(i,J)=str2double(b(j+1)); 0063 end 0064 end 0065 end 0066 0067 %Check if there are duplicate genes 0068 [crap J K]=unique(geneScoreStructure.genes); 0069 0070 if numel(J)~=numel(K) 0071 dispEM('There are duplicate genes in the input file',false); 0072 geneScoreStructure.genes=geneScoreStructure.genes(J); 0073 geneScoreStructure.scores=geneScoreStructure.scores(J,:); 0074 end 0075 0076 %Normalize 0077 I=max(geneScoreStructure.scores,[],2); 0078 geneScoreStructure.scores=bsxfun(@times, geneScoreStructure.scores, 1./I); 0079 0080 fclose(fid);