Sorting of raw data #2

Open
opened 2022-07-14 08:13:32 +02:00 by irrtum_limited · 5 comments
irrtum_limited commented 2022-07-14 08:13:32 +02:00 (Migrated from gitlab.com)

More a proposal than an issue:

Sorting of raw data

I know it would be sufficient to declare: Sorting is entirely a matter of the client / of the one who is scraping/using the data for his/her needs.
However: The more entries the CSV gets, the more sense would a "native" sorting make.
And I don't want to "over engineer" this: Just use an alphabetical sorting for the first couple of chars per line (in CSV terms: sorting the first column).
So - linuxwise: "sort _data/wim.csv"

Why?

Because the amount of data may grow significantly. And this way of simple, even primitive sorting would help to prevent duplicate entries (else - even now - before every entry I'd like to make I have to check for doublets).

A new entry then could be inserted right into the alphabetical position, where it belongs - not just appending it at the end. Tracking of new entries is taken care by git anyway.

What do you say?

(If you are okay with this, I'd assign this to me)

BtW. Doublets. This is something for the future, but it will hit us sooner or later. Example (unfortunately male bands - but it's just for the sake of an example): There are two bands (at least) called "Pankow". Both active btw. at the same time. One from Berlin/Germany, one from Firence/Italy.
Right now, we have no real way to deal with this in an obvious way (the artist link would be different though). Normed Name Databases may be of help (Wikipedia uses them i.E.). This would significantly reduce the fun of editing this - so right now I'm not going further this path - just raising attention.

_More a proposal than an issue:_ # Sorting of raw data I know it would be sufficient to declare: Sorting is entirely a matter of the client / of the one who is scraping/using the data for his/her needs. However: The more entries the CSV gets, the more sense would a "native" sorting make. And I don't want to "over engineer" this: Just use an alphabetical sorting for the first couple of chars per line (in CSV terms: sorting the first column). So - linuxwise: "sort _data/wim.csv" ## Why? Because the amount of data may grow significantly. And this way of simple, even primitive sorting would help to prevent duplicate entries (else - even now - before every entry I'd like to make I have to check for doublets). A new entry then could be inserted right into the alphabetical position, where it belongs - not just appending it at the end. Tracking of new entries is taken care by git anyway. What do you say? (If you are okay with this, I'd assign this to me) BtW. Doublets. This is something for the future, but it will hit us sooner or later. Example (unfortunately male bands - but it's just for the sake of an example): There are two bands (at least) called "Pankow". Both active btw. at the same time. One from Berlin/Germany, one from Firence/Italy. Right now, we have no real way to deal with this in an obvious way (the artist link would be different though). Normed Name Databases may be of help (Wikipedia uses them i.E.). This would significantly reduce the fun of editing this - so right now I'm not going further this path - just raising attention.
irrtum_limited commented 2022-07-16 14:16:53 +02:00 (Migrated from gitlab.com)

Oh - now that I actually cloned the repo to work in it: There is that "date" column. This seems to be the "Date" when the entry should come up in automatic feeds.
right now the last entry adds one day and one hour to the date of the previous entry.
I was thinking, "Date" means the point in time when a record was made.

Umm. I have to think about this.

For once, I understand: This project is originally quite "purpose" driven (one Woman in music every day). But on the other hand this makes it hard for an "universal"/"generic" approach.

Oh - now that I actually cloned the repo to work in it: There is that "date" column. This seems to be the "Date" when the entry should come up in automatic feeds. right now the last entry adds one day and one hour to the date of the previous entry. I was thinking, "Date" means the point in time when a record was made. Umm. I have to think about this. For once, I understand: This project is originally quite "purpose" driven (one Woman in music every day). But on the other hand this makes it hard for an "universal"/"generic" approach.
sakrecoer commented 2022-07-28 14:16:47 +02:00 (Migrated from gitlab.com)

We don't have an order. The thing spins 1day + 1hour of the year to make sure it reaches every time zone. (minus some downtime that is inevitable at this scale of ops)

We just add artists as we find them. It's the easiest impartial way. The date is only there for the jekyll build mechanism. I'm thinking about removing everything Jekyll related and only maintain the list here. I can have scripts for the rest of the needs to build https://basspistol.com/wim.xml and so should anyone else :)

We don't have an order. The thing spins 1day + 1hour of the year to make sure it reaches every time zone. (minus some downtime that is inevitable at this scale of ops) We just add artists as we find them. It's the easiest impartial way. The date is only there for the jekyll build mechanism. I'm thinking about removing everything Jekyll related and only maintain the list here. I can have scripts for the rest of the needs to build https://basspistol.com/wim.xml and so should anyone else :)
sakrecoer commented 2022-07-28 14:17:39 +02:00 (Migrated from gitlab.com)

Thank you for your contributions however @irrtum_limited , i will merge them now! :)

Thank you for your contributions however @irrtum_limited , i will merge them now! :)
sakrecoer commented 2022-07-28 14:30:42 +02:00 (Migrated from gitlab.com)

No order makes it more cumbersome to maintain free of doubles, indeed. If you want to sort things, you are welcome to do it. :) I would vouch for alaphabetic order.

However, do we want to publish the list in alphabetic order? If we don't want that, it means the feed builder has to reshuffle the new entries at the end of their current shuffled version. Which strikes me as a complex task. But i could live with it :)

No order makes it more cumbersome to maintain free of doubles, indeed. If you want to sort things, you are welcome to do it. :) I would vouch for alaphabetic order. However, do we want to publish the list in alphabetic order? If we don't want that, it means the feed builder has to reshuffle the new entries at the end of their current shuffled version. Which strikes me as a complex task. But i could live with it :)
sakrecoer commented 2022-07-28 14:30:47 +02:00 (Migrated from gitlab.com)

assigned to @irrtum_limited

assigned to @irrtum_limited
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: hq/women-in-music#2
No description provided.