Lineage Barcode Extract ------------------------------------------------------------------------------- If you’re looking to identify lineage defining mutations for your lineage of choice, look no further! You can extract this information using Freyja’s existing barcoding and a bit of python. 1. Run a freyja update and extract barcodes to the directory of your choosing (will take a few minutes): :: freyja update --outdir /my/local/directory/ 2. Load the barcodes using pandas and extract barcodes for your lineages of interest .. code:: python import pandas as pd def sortFun(x): # sort based on nuc position, ignoring nuc identities return int(x[1:(len(x)-1)]) # replace "/my/local/directory/" with your chosen directory df = pd.read_csv('/my/local/directory/usher_barcodes.csv', index_col=0) # specify your lineages of interest here lins = ['B.1.617.2','BA.1','BA.2','B.1.1.7','P.1'] df = df.loc[lins] # keep only columns with at least one mutation across the lins keepcols = df.columns[df.sum(axis=0)>0] df = df.loc[:, keepcols] Optional: 3. Load in gene coordinate information to translate nucleotide mutations into possible AA mutations. An example file for doing this is available `here `__. Then append AA mutations to the corresponding nucleotide mutations. .. code:: python # build dataframe with gene data import json f1 = open('/my/local/directory/SARS-CoV-2.json',) dat = json.load(f1) f1.close() df_genes = pd.DataFrame(columns=["gene","start","end"], dtype=object) for d in dat["genes"]: new_row = {'gene':d,'start':dat["genes"][d]['coordinates']['from'], 'end':dat["genes"][d]['coordinates']['to']} df_genes = df_genes.append(new_row,ignore_index=True) df_genes = df_genes.sort_values('start').set_index('gene',drop=False) # function for converting to gene-wise coordinate numbering def getGeneNum(pos,df0): j=0 while j=df0.iloc[j].loc['start']: return df0.iloc[j].loc['gene'], (pos - df0.iloc[j].loc['start'])//3+1 else: j+=1 return '','' # add AA mutation info into our dataframe cols = list(df.columns) cols.sort(key=lambda x: sortFun(x)) df = df[cols] for i, mut in enumerate(cols): pos = sortFun(mut) g,pa = getGeneNum(pos,df_genes) if len(g)==0: refAA='' mutAA='' else: posInTriple = (pos - df_genes.loc[g,'start']) % 3 triple = ref0[pos-posInTriple-1:pos-posInTriple+2] refAA = triple.translate() tripleMut = triple tripleMut[posInTriple] = mut[-1] mutAA = tripleMut.translate() cols[i] = cols[i] +'('+ g+':'+str(refAA)+str(pa)+str(mutAA)+')' df.columns = cols 4. Export to csv format for later viewing. .. code:: python df.to_csv('mutations_lineage_subset.csv')