Transit data. You probably use it on a daily basis, although it's through a high level interface such as the Google Maps Trip Planner. One nice outcome of commoditized transit routing applications is some standards around transit data that work across agencies. We can answer neat questions like:
- When is the bus supposed to arrive at the stop I'm at? (Obvious.)
- Is the bus late? (Less obvious, needs real-time data.)
- How often is the bus late?
- Given that the bus is late, is it going to come eventually, or is there a catastrophic failure in the system, meaning that I'll have to find an alternative route? (This happens more in San Francisco than people outside it are likely to believe.)
I'll give a quick overview of the real time data sources i'm using on a transit tracking application, live at pureinformation.net/transit-tracker
Where's the data from?
I'll talk about two sources of raw, unprocessed transit data: **GTFS**, which is provided by many agencies, as well as **NextBus**, a proprietary service which Firebase and many popular transit apps are built on.
GTFS is as dead simple as it gets - a set of CSV files. The spec is here: developers.google.com/transit/gtfs. Most of the complaining you'll find about GTFS is because it's quite low level - it's intended for scheduling and route planning purposes, and requires some massaging to create maps. If all you want is one shape per route, you may be better served by getting LineStrings out of an OpenStreetMap extract.
GTFS-Data-Exchange has a list of many GTFS sources. These usually link directly to transit agencies, who update their GTFS files periodically.
I've published a Golang library for reading GTFS data directories here: github.com/bdon/go.gtfs. It's not as full featured as some of the other GTFS readers out there, but it doesn't use any external database, is fast, and works competely in memory for most operations.
GTFS data issues for creating maps and visualizations
*Example of different scheduled services on a transit line, in this case in the morning to distribute trains evenly. Each possible trip extent is represented as a separate shape in GTFS.*
GTFS has multiple route "shapes" for service that may only cover a part of the total route. For example, the "N-Judah" route (1093) has a large number of associated Trips, each of which references one of a set of Shapes.
A simple heuristic i'm using is to use the shape with the most points as for mapping purposes. This can be done with the go.gtfs library as follows:
Determining route destination signs
*Example destination signs for the N-Judah route:*
Destination signs or "rollsigns" are the endpoints of transit routes. The foolproof way to determine these is by the endpoints of the longest route, e.g. "Ocean Beach", and "Caltrain". However, GTFS also provides a headsign field in the 'Trips' table, because destination signs may have additional information to familiarize riders - for example, "Via Downtown" to generalize the end location. The go.gtfs library also performs a bit of this legwork for you:
Determining Uptown/Outbound or Downtown/Inbound
Ideally I'd like to label trips as "Uptown/Outbound" or "Downtown/Inbound", however, this is difficult to determine programatically, and can change along the extent of a transit line. For example, MUNI light rail vehicles travel *downtown* and Northeast when west of Market Street, but also travel north in the opposite direction east of market street.
To make things even more confusing, SFMTA itself uses a confusing mix of terminology to refer to trip directions - i.e. the mythical "Northbound" light rail trip, that only travels North from the Caltrain terminus to Embarcadero:
If you have any ideas on how to properly characterize transit routes as downtown/uptown/crosstown, please leave a comment!
Handling cases with non-linear routes
All of the above assume your transit system doesn't have forking routes, e.g. the MBTA Red Line. In these cases routes aren't simply bidirectional and instead have a single destination sign in one direction, while they fork in the other direction. the go.gtfs library doesn't quite handle these yet.
Nextbus provides an XML-based pull API for all of its agencies at a single endpoint, documented here: NextBus documentation
Joining GTFS with Nextbus
It's a bit of a pain to compare route identifiers across API providers. GTFS assigns a mutable ID for routes - that is, GTFS route with ID 1071 - MUNI bus route 71 - keeps the 1071 ID when its schedule changes. In most cases it's best to use the `route_short_name` as a universal identifier for a transit route.
Some routes from NextBus, e.g. SF cable cars, have logner names: "CALIFORNIA" - while in the GTFS they are given the name "61". There's not much choice in your application other than providing a hardcoded dictionary lookup to convert between the two. For San Francisco MUNI, the main exceptions look like:
Other Nextbus Quirks
Nextbus directions have awkward values such as `30_OB3` for SFMUNI route 30 Outbound, and it's not documented what the 3 means. The heuristic i'm using in these cases is to look for the token IB in the tag as indicating Inbound, but this is frequently wrong, as the direction tags seem to be linked to some driver instrumentation on Muni vehicles. Nextbus also seems to interpolate vehicle locations
Other data sources
GTFS adoption is next to zero for agencies outside the US. Some agencies such as the MBTA rail system and the NYC MTA have their own APIs and data formats. GTFS-Realtime sounds like it fits alongside these tools, but it doesn't seem to be heavily adopted, so I haven't looked into it yet.