Software solutions for large data sets

Working with large data sets or files
    GCG's limitations

    One of the problems of dealing with human genome data is the large size of many of the sequence files. Individual PAC or BAC clone submissions may exceed 200,000 bp (see AL049830 from chromosome 14, or the contigs from map region Xq25-q26 on chromosome X).

    It is best to try and narrow your search as much as possiible before you start downloading data. ACHS's primary platform for working with sequences is the GCG package, accessed either via a terminal session in command-line interface, or graphically with Seqlab, or via the web using Seqweb. GCG has several limitations which may confound working with very long sequence data. For example, Pile-Up, GCG's multiple sequence alignment program, has a sequence length limit of 7,000 characters (including gap-space characters), and an absolute limit of 2 million characters per data set.

    If you find that you are running into some form of limits with your particular sequence data, there are ways around length and data-set size constraints. Perhaps it is possible to edit the sequences to remove all but the specific regions of interest. This can be done in GCG's editor, SeqEd, or using a variety of other software (see links below for suggestions). Various web-server applications are available which may not have the same limitations as software available at ACHS, in particular web-based hosts for creating multiple alignments of very large/long data sets.

    *Researchers wishing their own copy of GCG manuals can obtain pdf-format files from Accelrys' GCG web site.

     

    Useful web sites:

    Institut Pasteur's Biological Software

    This is a large site with many programs. Some are limited to local users only, but most are publically available via web-interfaces. There are several alignment programs, sequence format convertors, as well as excellent programs for evolutionary and phylogenetic studies, such as web-interfaces to PHYLIP programs, and FastDNAml and MOLPHY for maximum likelihood analysis of DNA and Protein sequences.

    SeWeR = Sequence analysis using WEB resources

    A Java application to provide interface and links to a collection of web-based programs for DNA and Protein analyses, hosted by Indiana University as well as several mirror sites (see bottom of SeWeR home page). You can choose to run SeWeR from one of the hosting sites, or just download the Java applet and run it in your own PC or Mac's web browser.

    ClustalW at EBI, Baylor College of Medicine, or GenomeNet (from Kyoto University)

    ClustalW is an excellent multiple alignment program, and several web-servers exist which can handle very large data sets. ClustalW is also available for PC's, Mac's and UNIX at numerous sites such as FinLand's CSC server, or Indiana University's FTP server.

    Editing sequences

    SeqEd in GCG can handle many basic editing tasks. Researchers who wish to have a sequence editor available on their desktop can invest in any one of several commercial products, or experiment with the many freeware and shareware programs available via the web. For Windows PC users, two popular sequence editors available freely over the web are BioEdit and GeneDoc. Freely available software for Mac users is a rarer thing, but one program in development is SeqPup, available at Indiana University's software archive. SeqPup is a JAVA application, so users will also need to download and install a copy of the Java RunTime program, available for Mac, PC, or Linux.

    SeqPup will also run on blue.unix.virginia.edu - download SeqPup.jar and move it to a directory of your choice. Start the program using the command line "jre -cp SeqPup.jar run" - after the first use it will extract a startup script which can be executed as "./seqpup".

    Apple OS X users can use the excellent sequence editor Se-Al by Andrew Rambaut at The University of Oxford

    Several sites host huge numbers of links to other software and web-based applications. Several of the main sites of this nature include:

    The ExPASy Molecular Biology server

    The BioList

    and for Linux users, the Scientific Applications for Linux, or SAL site

     

    *** If you find any web sites or web applications that have proven useful in your research, please e-mail the links and information to Michael Black, (mblack@virginia.edu) so we may make them easily accessible to others

    Please email comments or suggestions about the ACHS MolBiol pages to mblack@virginia.edu.

    Academic Computing Health Sciences
    Box 800555
    Charlottesville, VA 22908
    (434) 982-4025

    © 2008 by the Rector and Visitors of the University of Virginia.

    The information contained on the University of Virginia’s Department of Information Technology and Communication (ITC) website is provided as a public service with the understanding that ITC makes no representations or warranties, either expressed or implied, concerning the accuracy, completeness, reliability or suitability of the information, including warrantees of title, non-infringement of copyright or patent rights of others. These pages are expected to represent the University of Virginia community and the State of Virginia in a professional manner in accordance with the University of Virginia’s Computing Policies.