First approach associates each police report with one of logical text corpora to display wordclouds of most frequent terms. Rules defining text corpora may be based on time, location, gender, race, etc. To illustrate I define 4 logical documents - Night, Morning, Day, and Evening - by dividing 24-hour day into 4 segments: Night (reports with time of offense from midnight to 6 a.m.), Morning (time of offense from 6 a.m. to noon), Day (time of offense from noon to 6 p.m.), and Evening (time of offense from 6 p.m. to midnight). This results in 4 clouds displayed in pairs of opposing times: Night vs. Day and then Morning vs. Evening:
At night we are looking at ASSAULT, VEHICLE, and FLED, while day is about SUSPECT, THEFT, and NFI (No F...ing Idea). Not sure what CAUSING and CONSENT mean on top of the Night, but PAIN there makes sense. Evening and Morning seem to differ the most on ASSAULT and BURGLARY.
Another way to look at the same data is with slopegraph (example): terms (rows) moving up or down according to their frequency across 4 logical document (columns), but I am leaving this for my next post.
Unstructured data as text can still utilize types of plots usually associated with structured types. Bar graphs (histograms) by hour of day for reports with certain term is one way to achieve this. To illustrate let's count police reports with term BURGLARY for each hour (from 0 to 23) across all data (roughly 6 months). This results in the following plot:
Not exactly surprising but it offers several conclusions, for example: most burglar's work day starts at 9 in the morning until 10 in the eveining. They also have lunch break at about 3 p.m or possibly it's police officers who do. Finally, high peak at 5 is something everyone should be aware of unless, again, it's police officers who skew the time of offense in reports towards the end of their shift.
Different trend can be observed for reports containing BMV (Burglary Motor Vehicle):
Last graph is for term ASSAULT: