ALLSTAR is a public repository of Debian source code, compiler artifacts, binaries, and symbols. It is hosted by APL as a community resource for improving software reverse engineering research.
ALLSTAR (Assembled Labeled Library for Static Analysis Research) is a public repository hosted by APL as a community resource for improving software reverse engineering research.
ALLSTAR consists of over 30,000 Debian (jessie) packages built for (or build attempts made for ☺) each Debian architecture (x86, amd64, ARM, MIPS, PPC, s390x) using Dockcross containers.
For each package/architecture, an attempt is made to build the package using debuild, overriding some of the build flags. If the build is successful, any ELF files from the build (containing symbols) are saved, as well as source code and header files and intermediate compiler files (.o,. class, .gkd, .gimple). System headers and libraries are also saved.
You can see an example package here.
The dataset is divided alphabetically into four partitions to make management simpler. Each package has an HTML file for browsing and a JSON file for program parsing. The package HTML files often contain multiple binaries per package (we pull any test binaries that get built during the build process too). You can browse the HTML files by clicking on the “dataset” links below. Individual binaries can be indexed at the “binaries by arch” link. The latter page is generated by finding an amd64 binary, listing its MD5, and then looking to see whether that binary exists in the directory structure for other architectures. You can also pull the entire dataset over rsync, see instructions below.
If the package build could not start due to unsatisfied dependencies (this happens more than you might think), the package’s HTML/JSON will be seemingly empty (no entries under “Documentation” or “Binaries”).
If the build succeeded but produced no ELF binaries and no documentation was found, the package’s HTML/JSON will be seemingly empty (no entries under “Documentation” or “Binaries”).
Missing .o files: The build script parses the DWARF symbol information for the ELF binaries to get original source file names. It then assumes that name is used for the .o file (e.g., foo.c produces foo.o). Occasionally a package will do something like “gcc -c foo.c -o foo-main.o”. This results in the build outputting a bad link for foo.o.
Missing .h files: Sometimes the absolute path to the header file is included in the DWARF symbol information rather than a relative path. This will result in a broken link, although if you pull the full dataset, the .h file should be there.
If you want to troubleshoot package building or if you need something from the jessie distribution, we have the entire distribution (including packages for all architectures mirrored here). Mirroring will end soon for jessie but we plan to keep this mirror active as a community resource to support ALLSTAR-based research.
To grab packages from our jessie repo, add the following lines to a jessie Docker container or jessie VM:
deb http://allstar.jhuapl.edu/debian jessie main deb-src http://allstar.jhuapl.edu/debian jessie main
part 1 | (0ad - libapr1-dev) | dataset | package list | binaries by arch |
part 2 | (libapreq2-3 - liblingua-stem-perl) | dataset | package list | binaries by arch |
part 3 | (liblingua-stem-snowball-da-perl - mate-system-tools) | dataset | package list | binaries by arch |
part 4 | (mate-terminal - zzuf) | dataset | package list | binaries by arch |
To do experiments on individual binaries without pulling the full dataset, you can use the Python API available on GitHub.
You can pull the entire dataset over rsync. This also gives you access to the build logs if you want to diagnose build issues for particular packages.
From a Linux command prompt: rsync -avz rsync://allstar@allstar.jhuapl.edu/share . (Password is "allstar")
@misc{allstarDataset, author= {JHU/APL Staff}, year = {2019}, month = {Dec}, title = {Assembled Labeled Library for Static Analysis Research (ALLSTAR) Dataset}, url = {http://allstar.jhuapl.edu/}, }