January 2013

While exploring blog posts on Kaggle, I came across an open source data visualization tool that I thought was really neat, called Gephi. It is described as "Photoshop but for data." I decided to try out this program with my own Facebook network. After looking at the Gephi tutorials page, I found the Facebook app called Netvizz that allows you to download a copy of your Facebook network in a GDF or TAB file.

After downloading the files, I opened the .gdf file in Gephi. After opening the file, an Import Report screen comes up (Fig. 1). I selected the graph type to be undirected, which means that if A connects to B, then B also connects to A, which is how a Facebook network would work. If I chose a directed graph, then that means that A connects to B, but B might not be connected to A.

Fig. 1

After selecting ok, the data was imported and I got a preview of my network (Fig. 2). Each circle, or node, represents a person and the lines, or edges, shows the connections between them. You can zoom in and out by using the scroll wheel on the mouse and you can pan left or right by using the right mouse button and dragging along the screen. If you lose view of the graph, you can press the magnifying glass on the left bottom toolbar to center the graph in the window.

Fig. 2

In the ranking window in the top left is where the color and sizes of the nodes can be adjusted. By selecting the degree option, I first changed the color scheme (Fig. 3). I created my own by hovering over the color bar and clicking the triangles to pick a color. There is also an option to pick a default color scheme and to invert the chosen color scheme. Next, I adjusted the sizes of the nodes (Fig.4). I set my minimum node size to 1 and the maximum to 10.

Fig. 3

Fig. 4

You are also able to add labels to the nodes (Fig. 5). By clicking the T button on the bottom toolbar will toggle the labels. The size of the text and the position and color can also be adjusted. I decided to have the labels turned off because it got too cluttered.

Fig. 5

Next, I chose a layout algorithm for the data (Fig. 6). This will set the shape of the graph. I chose Fruchterman Reingold for this particular graph (Fig. 7). I left the layout properties unchanged and ran the algorithm. Once it was finished, I got a more aesthetically pleasing graph than the giant square I started with (Fig. 8).

Fig. 6

Fig. 7

Fig. 8

Being pleased with the output, I went to the preview window to see the complete rendered version of the graph (Fig. 9). I set the edges to be curved to get a nice looking output. I then exported the final result. There are three options: SVG, PDF, or PNG. I exported to a PNG and got the final graph (Fig.10).

Fig. 9

Fig. 10

Analyzing the graph, I noticed a pattern. The big super cluster on the left are people I went to high school with, while the smaller cluster on the right are people from my university. There's a few people in the middle that share connections with both groups.

I then played with the data set a little more and explored different layout algorithms, filters, and color schemes. I ran the Forced Atlas layout and got a graph that further divides my data into more clusters (Fig. 11). I noticed five clusters, each one representing a different group of my connections. Blue represents high school, dark green represents my family, yellow represent my junior college, red represent the group of friends I spend the most time with, and the light green on the left represents my university.

Fig. 11

This tool was a great find and I know that once I explore more of its features, I will be able to graphically represent data sets to find important trends and patterns that I wouldn't have been able to see through other programs and data filtering techniques.

In the field of astronomy, most of the data that a scientist works with is collected by telescopes and large sky surveys. This growth in new data is on the order of terabytes, and will continue to grow with new data from future sky surveys. This challenges current astronomical research methods, and as a result, encourages new approaches to studying it. With the incorporation of data mining techniques, scientists would be able to organize, classify, and find trends in the growing data sets with more ease. This push to a more data-driven area of astronomical research has called for a new discipline that combines astronomy and data mining, called astroinformatics.

Why is astroinformatics so important? Take as an example the Sloan Digital Sky Survey (SDSS). The SDSS is a major optical telescope that produces multi-color images covering more than a quarter of the sky. In addition to these images, it has produced 3-dimensional maps containing numerous amounts of galaxies and quasars. The survey has also produced the largest picture of the sky to date. The SDSS takes a few hundred gigabytes of raw data every night, and since the beginning of the survey, the amount of raw data collected has grown to a few terabytes, and continues to grow as the survey goes on. With such a large amount of data, the task of processing it becomes more difficult and time-consuming.

But the SDSS isn't the only sky survey out there. A future survey, called the Large Synoptic Survey Telescope (LSST) that plans to be operational by 2022, will be producing up to 30 terabytes of data per night. That's about as much data that the SDSS collects in its whole operation! At that rate, the LSST is expected to produce about 2 petabytes of uncompressed data per year. That is far more than what humans are able to review, and reviewing the data is the most difficult part of the project. This is why astroinformatics is important.

With effective data mining techniques, the ability to review the enormous data output from the survey becomes much more simplified. By creating well-tested, straightforward data mining algorithms, the process of reviewing data and classifying objects allows for the data to be placed in searchable and downloadable databases in real-time. This allows scientists and researchers from all over the world to have immediate access to large, high quality data sets that would otherwise take months or years to be published if the data were to have been reviewed just by humans.

Wednesday, January 30, 2013

Gephi Data Visualization

Saturday, January 5, 2013

Astroinformatics: A Blend of Astronomy and Data Mining